《UNIX环境高级编程》读书笔记6

WUCANADA 2011-12-11

展开全文

《UNIX环境高级编程》读书笔记6

2008年03月23日星期日 1:35

《APUE》第五章讲标准I/O库，之所以叫“标准”就是因为这个库不仅被UNIX支持，而且在其它许多系统下也都得到实现，是ISO C的一部分，大一下学期《C程序设计》里面经常用的scanf和printf就是这块。标准I/O和第三章的文件I/O主要区别就是缓冲，read、 write直接调用系统调用，没有缓冲区，而scanf、printf不直接调用系统调用，在用户空间维护一块缓冲区，在适当的时候调用read、 write读写缓冲区。

首先遇到的是既熟悉又陌生的FILE，虽然一直都那么用，但FILE到底啥样一直不清楚，下面是Linux下定义：
struct _IO_FILE {
int _flags;        /* High-order word is _IO_MAGIC; rest is flags. */
#define _IO_file_flags _flags

/* The following pointers correspond to the C++ streambuf protocol. */
/* Note: Tk uses the _IO_read_ptr and _IO_read_end fields directly. */
char* _IO_read_ptr;    /* Current read pointer */
char* _IO_read_end;    /* End of get area. */
char* _IO_read_base;    /* Start of putback+get area. */
char* _IO_write_base;    /* Start of put area. */
char* _IO_write_ptr;    /* Current put pointer. */
char* _IO_write_end;    /* End of put area. */
char* _IO_buf_base;    /* Start of reserve area. */
char* _IO_buf_end;    /* End of reserve area. */
/* The following fields are used to support backing up and undo. */
char *_IO_save_base; /* Pointer to start of non-current get area. */
char *_IO_backup_base; /* Pointer to first valid character of backup area */
char *_IO_save_end; /* Pointer to end of non-current get area. */

struct _IO_marker *_markers;
struct _IO_FILE *_chain;
int _fileno;
#if 0
int _blksize;
#else
int _flags2;
#endif
_IO_off_t _old_offset; /* This used to be _offset but it's too small. */

#define __HAVE_COLUMN /* temporary */
/* 1+column number of pbase(); 0 is unknown. */
unsigned short _cur_column;
signed char _vtable_offset;
char _shortbuf[1];

/* char* _save_gptr; char* _save_egptr; */
_IO_lock_t *_lock;
#ifdef _IO_USE_OLD_IO_FILE
};

其中有几个重要字段可以帮助理解缓冲：
char* _IO_read_base;//读缓冲区首指针
char* _IO_read_end;//读缓冲区尾指针
char* _IO_read_ptr;//读缓冲区当前指针
char* _IO_write_base;//写缓冲区首指针
char* _IO_write_end;//写缓冲区尾指针
char* _IO_write_ptr;//写缓冲区当前指针
char* _IO_buf_base;//缓冲区首指针
char* _IO_buf_end;//缓冲区尾指针

可以通过下面的程序知道，这三个缓冲区其实是一个缓冲区，并且在第一次I/O的时候由库函数自动申请空间，最终由库函数自动释放：
/*test.c for testing the members of FILE*/
void test()
{
    printf("before reading\n");
    printf("read buffer base %p\n", stdin->_IO_read_base);
    printf("read buffer end %p\n", stdin->_IO_read_end);
    printf("read buffer current %p\n", stdin->_IO_read_ptr);
    printf("write buffer base %p\n", stdin->_IO_write_base);
    printf("write buffer end %p\n", stdin->_IO_write_end);
    printf("write buffer current %p\n", stdin->_IO_write_ptr);
    printf("buf buffer base %p\n", stdin->_IO_buf_base);
    printf("buf buffer end %p\n", stdin->_IO_buf_end);
    fgetc(stdin);
    //fputc('a',stdout);
    printf("after reading\n");
    printf("read buffer base %p\n", stdin->_IO_read_base);
    printf("read buffer end %p\n", stdin->_IO_read_end);
    printf("read buffer current %p\n", stdin->_IO_read_ptr);
    printf("write buffer base %p\n", stdin->_IO_write_base);
    printf("write buffer end %p\n", stdin->_IO_write_end);
    printf("write buffer current %p\n", stdin->_IO_write_ptr);
    printf("buf buffer base %p\n", stdin->_IO_buf_base);
    printf("buf buffer end %p\n", stdin->_IO_buf_end);
}

UNIX下提供三种缓冲机制：全缓冲、行缓冲、无缓冲，这块是本章最难理解的地方，下面分别介绍他们各自如何处理FILE：

1.全缓冲一般应用对象是磁盘文件，标准I/O尽量多读写文件到缓冲区，当缓冲区已满或手动flush时导致缓冲区立即flush。以上面的test.c为例，把stdin重定向到一个磁盘文件，来看FILE个成员是如何变化的：
$ gcc test.c && ./a.out < datafile #其中datafile最好是含多行数据的文件

由输出结果可以看出，全缓冲是尽可能的多读写数据到缓冲区，即便我只想得到一个字符，而实际标准I/O已经为我把剩下的数据（只要不超过stdin- >_IO_buf_end-stdin->_IO_buf_base）也都读到缓冲区里面了，这样如果还需要读文件就没有必要再读磁盘了，只需要从缓冲区取出数据即可。

全缓冲读的时候，_IO_read_base指向缓冲区的开始，_IO_read_end指向已从磁盘读入缓冲区的字符的下一个，_IO_read_ptr指向缓冲区中已被用户读走字符的下一个；全缓冲写的时候，_IO_write_base指向缓冲区的开始， _IO_write_end指向缓冲区最后一个字符的下一个，_IO_write_ptr指向缓冲区中已被用户写入的字符的下一个。

2.行缓冲一般应用对象是标准输入和输出这些终端，当遇到下列三个条件时导致缓冲区立即flush：
(1)遇到'\n'
(2)缓冲区已满
(3)书上P135最后一段强调的：当需要从内核读取数据，如果输入流是无缓冲流或行缓冲流，则所有的行缓冲输出流被立即flush。

下面这个程序就涉及到行缓冲，但很多人包括我在第一次看到这个程序都是一头雾水：
/*print.c for writing which is read from stdin to stdout*/
void print()
{
    int c;
    for(; (c=getchar())!=EOF; putchar(c));
}

一般会这么误解：如果简单的按照程序走，应该是输入一个字符，然后输出一个字符；stdin和stdout都是行缓冲，按照上面第三条(3)应该在输入一个字符以后所有的行缓冲包括stdout被立即flush啊？

但实际上我们在终端输入一个字符后并没有直接在终端输出，而是等到我们输入一行结束，按回车时才输出，并且直接输出刚才输入的一行（包括回车），这就是行缓冲搞的鬼，在第一次getchar时库函数为stdin分配缓冲区，并将一行数据（只要不超过stdin->_IO_buf_end-stdin->_IO_buf_base）放入缓冲区，后来 getchar并非调用read读内核而是直接从缓冲区读，所以也就不满足上面第三条(3)，而第一次putchar时库函数也为stdout分配了缓冲区，并将要写的字符放入缓冲区。

当输完一行按回车键时，满足上面条件(1)缓冲区立即flush，stdin缓冲区被清空，即stdin->_IO_read_ptr=stdin->_IO_read_end，stdout 缓冲区也被清空，将stdout->_IO_write_base和stdout->_IO_write_ptr之间的字符通过write输出到终端，然后stdout->_IO_write_ptr=stdout->_IO_write_base，这也就解释为什么出现输入一行输出一行的原因了。

当然满足条件(2)也导致flush(通过test.c看出行缓冲的缓冲区大小是固定的，在我系统上行缓冲大小是1024bytes)，我们不妨试试，随便找一个大小超过1024bytes的文档复制到命令行，看是不是自动flush，在我系统上是完全正常，自动flush。

行缓冲读的时候，_IO_read_base指向缓冲区的开始，_IO_read_end指向已从内核读入缓冲区的字符的下一个，_IO_read_ptr指向缓冲区中已被用户读走的字符的下一个；行缓冲写的时候，_IO_write_base指向缓冲区的开始，_IO_write_end指向缓冲区的开始， _IO_write_ptr指向缓冲区中已被用户写入的字符的下一个。

行缓冲还有一点需要强调，换行符可以手动保存到缓冲区，库函数并不立即flush，下面的例子可以说明这点：
/*check whether is flushed when '\n' is buffered by hand*/
void check()
{
    char str[5] = "abc\n";
     fputc(str[0], stdout);
     fprintf(stderr, "%p\n", stdout->_IO_write_ptr);
     fputc(str[1], stdout);
    fprintf(stderr, "%p\n", stdout->_IO_write_ptr);
    fputc(str[2], stdout);
    fprintf(stderr, "%p\n", stdout->_IO_write_ptr);
    *stdout->_IO_write_ptr++ = '\n';//可以手动添加'\n'，但并不引起flush，因为库函数并不做检查
    fputc(str[3], stdout);//缓冲区flush，stdout->_IO_write_ptr赋值为stdout->_IO_write_base
    fprintf(stderr, "%p\n", stdout->_IO_write_ptr);
}

3.无缓冲一般应用对象是标准错误输出， “无缓冲”并不是指缓冲区大小为0而是为1，只要把test.c里面改成stderr就可以知道，对无缓冲流的每次读写操作都会引起flush。

ISO C规定：
（1）当且仅当标准输入和标准输出并不涉及交互式设备时，他们是全缓冲；
（2）标准输出决不是全缓冲。

大部分系统默认规定：
（1）标准错误输出是无缓冲；
（2）如果涉及终端设备，则是行缓冲，否则是全缓冲。

setbuf 和setvbuf看似很简单，其实很多隐藏的东西，书上并没有明确将出来，比如：自己指定buffer，那这个buffer怎么保存，如果是简简单单的局部变量，那函数返回后buffer自动释放，也就是说相应的流找不到其缓冲区；还有关于无缓冲，是不是setbuf成无缓冲就可以输入一边输入字符一边输出字符，这个概念我也一直很头疼，如果我想设置缓冲模式为完全无缓冲，就类似曾经在编汇编程序的输入和输出，任何键盘输入都不会缓存，实际我实验是不可以的，还是有缓冲，why？下面有一段转至google group上一个帖子里面一段话：

setbuf() has to do with the delivery of bytes between the
C library FILE* management layer and the OS I/O layer.

Calls to fread(), fgets(), fgetc(), and getchar() work within
whatever FILE* buffered data is available, and when that data
is exhausted, the calls request that the FILE* buffer be refilled
by the system I/O layer.

When full buffering is turned on, that refill operation results in the
FILE* layer requesting that the operating system hand it a full
buffer's worth of data; when buffering is turned off, that
refill operation results in the FILE* layer requesting that the
operating system return a single character.

Your error is in assuming that the operating system layer in
question is dealing with raw bytes directly from the terminal.
That is not the case. Instead, the relevant operating system layer
is dealing with bytes returned by the terminal device driver --
and the device driver does not pass those bytes up to the
operating system layer until the device driver is ready to do so.

As I indicated before, setting an input stream to be unbuffered
does NOT tell the operating system to tell the device driver
to go into any kind of "raw" single-character mode. There are
system-specific calls such as ioctl() and tcsetterm() that
control what the device driver will do.

In Unix-type systems, the terminal device driver by default works
on a line at a time, not passing the line onward until it detects
a sequence that indicates end-of-line. When the Unix-type
'line disciplines' are in effect, you can edit the line in various
ways before allowing it to be passed to the operating system.
For example, you might type cad and then realize you mistyped and so
press the deletion key and type an r; if you were to do so, and then
pretty return, it would be the word car that was passed to the
next layer, *not* the series of keys cad<delete>r
The device driver buffers the input to allow you to edit it,
and setting your input stream to unbuffered in your program does NOT
affect that device driver buffering.

If you want to do single-character I/O and you will worry about
things like inline editting yourself in your program, then you
will need to use system-specific calls to enable that I/O mode.

Before you head down that path, you should keep in mind that
you cannot handle mouse-highlight and copy and paste operations
just by looking at the key presses themselves: you have to work
with the graphical layer to do that, and that can get very messy.
Because of that, character-by-character I/O is probably best
reserved for interaction with non-graphical devices such as
modems and serial ports. If you -really- want character-by-
character I/O, such as because you are programming a graphical
game, then it is probably best to find a pre-written library that
handles the dirty work for you.

程序设定无缓冲并不表示操作系统不缓冲（raw），而且还涉及硬件缓冲；自定义缓冲区应该设成全局或静态变量。

在读文件的时候，我们大都会用是否等于EOF来判断文件结束，但为什么还要feof函数呢？在二进制文件中EOF是有效字符，这时就会出现文件还没读完就被认为文件已经结束的情况，而feof()就解决了这个问题，它基于文件长度判断文件结束。

习题：
5.1 setvbuf(fp, buf, buf ? _IOFBF : _IONBF, BUFSIZ)
5.2 标准输入输出是行缓冲，会一次性把整行读进缓冲区，分批读入用户定义的buf
5.3 往stdout写的串为空
5.4 主要看char的大小范围能否包括EOF，最好改成int
5.5 ??
5.6 在数据flush时调用fsync让数据写入磁盘
5.7 由于调用fgets向内核请求输入数据，而stdin是行缓冲流，故满足条件（3）,所以输出%了