三、I/O模型

sven_ 2013-09-13

展开全文

数据结构

[include/types/fd.h]
/* info about one given fd */
struct fdtab {
struct {
int (*f)(int fd); /* read/write function */
struct buffer *b; /* read/write buffer */
} cb[DIR_SIZE];
void *owner; /* the session (or proxy) associated with this fd */
struct { /* used by pollers which support speculative polling */
unsigned char e; /* read and write events status. 4 bits*/
unsigned int s1; /* Position in spec list+1. 0=not in list. */
} spec;
unsigned short flags; /* various flags precising the exact status of this fd */
unsigned char state; /* the state of this fd */
unsigned char ev; /* event seen in return of poll() : FD_POLL_* */
};
struct poller {
void *private; /* any private data for the poller */
int REGPRM2 (*is_set)(const int fd, int dir); /* check if <fd> is being polled for dir <dir> */
int REGPRM2 (*set)(const int fd, int dir); /* set polling on <fd> for <dir> */
int REGPRM2 (*clr)(const int fd, int dir); /* clear polling on <fd> for <dir> */
int REGPRM2 (*cond_s)(const int fd, int dir); * set polling on <fd> for <dir> if unset */
int REGPRM2 (*cond_c)(const int fd, int dir); /* clear polling on <fd> for <dir> if set */
void REGPRM1 (*rem)(const int fd); /* remove any polling on <fd> */
void REGPRM1 (*clo)(const int fd); /* mark <fd> as closed */
void REGPRM2 (*poll)(struct poller *p, int exp); /* the poller itself */
int REGPRM1 (*init)(struct poller *p); /* poller initialization */
void REGPRM1 (*term)(struct poller *p); /* termination of this poller */
int REGPRM1 (*test)(struct poller *p); /* pre-init check of the poller */
int REGPRM1 (*fork)(struct poller *p); /* post-fork re-opening */
const char *name; /* poller name */
int pref; /* try pollers with higher preference first */
};
[src/fd.c]
struct fdtab *fdtab = NULL; /* array of all the file descriptors */
struct fdinfo *fdinfo = NULL; /* less-often used infos for file descriptors */
int maxfd; /* # of the highest fd + 1 */
int totalconn; /* total # of terminated sessions */
int actconn; /* # of active sessions */
struct poller pollers[MAX_POLLERS];
struct poller cur_poller;
int nbpollers = 0;

在看到fdtab和poller的结构体时，然后查看ev_epoll.c的时候可能会奇怪为什么会设置成这样。但是如果先查看ev_sepoll.c的话可能很多疑惑都没有了。

sepoll

在Haproxy中，作者在epoll上将模型推进至sepoll(我不知道是否在此之前就有人提出或者使用这种方法)，从理论上来说，这种模型的总体效率应该比epoll更好，虽然说它是基于epoll的，因为它能够减少较多与epoll相关的昂贵的系统调用。

sepoll，作者在代码注释中称为speculative I/O。Sepoll的原理就是，对于刚accept完的套接字描述符，一般都是直接能够读取导数据的；对于connect完的描述符，一般都是可写的；即使是对于在传输数据的链接，它也是能提升效率的，因为假设对于某一条链接的某端已经处于epoll的等待队列中，那么另一端也是需要做出反应的，要么发送数据，要么接收数据，这依赖于(读/写)缓冲区的水位。

当然，作者也描述了sepoll的缺点，那就是这可能会导致在epoll队列中的可用事件缺少而变得饥饿(starve the polled events)(我对此处饥饿的理解是，有足够资源的时候不给予需要的人；poll本来就是用于处理多个描述符专用，假设只处理几个描述符，那么poll根本就提升不了多少性能，因为它本身也是系统调用，因此需要保持poll队列含有一定数量的fd，否则就是出现饥饿情况)，作者说实验证明，当epoll队列出现饥饿的情况时，压力会转到spec I/O上面，此时由于每次去读取或者写入，但是都失败，陷入恶性循环，会严重的降低系统性能(spec list描述符较多，一直轮询肯定会导致性能问题)。用于解决此问题的方法，可以通过减少epoll一次处理的事件来解决这个问题（对spec list的不能使用这个方法，因为实验显示，spec list中2/3的fd是新的，只有1/3的fd是老的)。作者说这是基于以下两点事实，第一，对于位于spec list的fd，不能也将它们注册在epoll中等待；第二是，即使在系统压力非常大的时候，我们基本上也不会同时对同一个fd进行读与写的流操作。作者所说的后面一个事实我认为是这样的，对于客户端，一个请求都是将请求数据发送完成之后，后端才会对其进行响应；对于服务器，都是接收玩请求之后，才会发回响应数据。

作者说第一个事实意味着在饥饿期间，poll等待队列中不会有超过一半的fd。否则的话，说明spec list中的fd比poll list少，那么也就没有饥饿情况。第二个事实意味着我们只对最大数量描述符的一半事件感兴趣(每个描述符要么读，要么写)。

减少poll list一次处理的数量用于解决poll list饥饿的情况，可以这么理解，假设每个fd经过一次读和一次写之后就被销毁，那么对于第二个事实，在进行读的时候，poll list的fd不会减少，影响不大，但是在写的时候，由于读与写都已经完成了，那么可能这一次会导致大量的fd被移除，而补充又跟不上，这就可能会导致饥饿；但是由于第一个事实限制每次可处理的最大数量，那么一次读写完成被撤掉的fd数量就减少了，而且把poll list中的fd分成了两部分，错开了它们移出poll list的时间，减少了一次被移除的fd数量，那么就应该能够使后续的fd补充跟上。

那么对于fd本来就不多，导致poll list分配到的很少导致的饥饿怎么办？此时由于fd不多，spec list的fd也不多，，对性能的影响不是很大，基本上忽略了。

作者最后说明，如果我们能够在负载高峰时段保证poll list拥有maxsock/2/2数量的事件，这意味着我们应该给poll list分配maxsock/4的事件，就不会受饥饿的影响。Maxsock/2/2来源作者没有明确说明，不过从上面的的解释来看，第一除2应该是表示如果poll list如果有不小于maxsock/2的fd，那么就不会受饥饿的影响；第二个除2暂时还不能确定，假如是根据第二个事实来的，那也不是很合理，因为一个sock肯定包含两个事件，一次处理只做一个事件的话，那么时间数量也是和sock数量本身一样的。

接下来看看sepoll的处理流程。

[cpp] view plain copy

[src/ev_sepoll.c]
#define FD_EV_IN_SL 1
#define FD_EV_IN_PL 4
#define FD_EV_IDLE 0
#define FD_EV_SPEC (FD_EV_IN_SL)
#define FD_EV_WAIT (FD_EV_IN_PL)
#define FD_EV_STOP (FD_EV_IN_SL|FD_EV_IN_PL)
/* Those match any of R or W for Spec list or Poll list */
#define FD_EV_RW_SL (FD_EV_IN_SL | (FD_EV_IN_SL << 1))
#define FD_EV_RW_PL (FD_EV_IN_PL | (FD_EV_IN_PL << 1))
#define FD_EV_MASK_DIR (FD_EV_IN_SL|FD_EV_IN_PL)
#define FD_EV_IDLE_R 0
#define FD_EV_SPEC_R (FD_EV_IN_SL)
#define FD_EV_WAIT_R (FD_EV_IN_PL)
#define FD_EV_STOP_R (FD_EV_IN_SL|FD_EV_IN_PL)
#define FD_EV_MASK_R (FD_EV_IN_SL|FD_EV_IN_PL)
#define FD_EV_IDLE_W (FD_EV_IDLE_R << 1)
#define FD_EV_SPEC_W (FD_EV_SPEC_R << 1)
#define FD_EV_WAIT_W (FD_EV_WAIT_R << 1)
#define FD_EV_STOP_W (FD_EV_STOP_R << 1)
#define FD_EV_MASK_W (FD_EV_MASK_R << 1)
#define FD_EV_MASK (FD_EV_MASK_W | FD_EV_MASK_R)

从以上宏定义可以看出，对于位于spec list的读写事件分别对应的最低两位；对于位于poll list的读写事件位于第三、四位。

[cpp] view plain copy

[src/ev_sepoll.c]_do_poll()
REGPRM2 static void _do_poll(struct poller *p, int exp)
{
static unsigned int last_skipped;
static unsigned int spec_processed;
int status, eo;
int fd, opcode;
int count;
int spec_idx;
int wait_time;
int looping = 0;
re_poll_once:
/* Here we have two options :
* - either walk the list forwards and hope to match more events
* - or walk it backwards to minimize the number of changes and
* to make better use of the cache.
* Tests have shown that walking backwards improves perf by 0.2%.
*/

首先处理的是位于spec list的fd，作者说从后面遍历spec list能够提高0.2%的效率，这是因为spec list总是把最新的fd存储在最后，而对于最新的fd，基本上很可能是直接可读或者可写的。

[cpp] view plain copy

[src/ev_sepoll.c]
status = 0;
spec_idx = nbspec;
while (likely(spec_idx > 0)) {
int done;
spec_idx--;
fd = spec_list[spec_idx];
eo = fdtab[fd].spec.e; /* save old events */
if (looping && --fd_created < 0) {
/* we were just checking the newly created FDs */
break;
}

拿到fd，然后根据fd从fdtab中拿到对应的信息。如果这是第二次处理循环，只是为了检查由于listen fd进行accept之后新创建的fd，因此作者专门使用一个变量fd_created用于记录新创建的fd数量，当新的fd处理完成之后，直接跳出循环了。

[cpp] view plain copy

[src/ev_sepoll.c]_do_poll()
/*
* Process the speculative events.
*
* Principle: events which are marked FD_EV_SPEC are processed
* with their assigned function. If the function returns 0, it
* means there is nothing doable without polling first. We will
* then convert the event to a pollable one by assigning them
* the WAIT status.
*/

作者说明规则是处理标志了FD_EV_SPEC事件的，并且调用他们指定的函数，如果函数返回0，那么表示现在没有任何事可做，我们应该先对其进行一个poll等待先。

[html] view plain copy

[src/ev_sepoll.c]_do_poll()
#ifdef DEBUG_DEV
if (fdtab[fd].state == FD_STCLOSE) {
fprintf(stderr,"fd=%d, fdtab[].ev=%x, fdtab[].spec.e=%x, .s=%d, idx=%d\n",
fd, fdtab[fd].ev, fdtab[fd].spec.e, fdtab[fd].spec.s1, spec_idx);
}
#endif
done = 0;
fdtab[fd].ev &= FD_POLL_STICKY;
if ((eo & FD_EV_MASK_R) == FD_EV_SPEC_R) {
/* The owner is interested in reading from this FD */
if (fdtab[fd].state != FD_STERROR) {
/* Pretend there is something to read */
fdtab[fd].ev |= FD_POLL_IN;
if (!fdtab[fd].cb[DIR_RD].f(fd))
fdtab[fd].spec.e ^= (FD_EV_WAIT_R ^ FD_EV_SPEC_R);
else
done = 1;
}
}
else if ((eo & FD_EV_MASK_R) == FD_EV_STOP_R) {
/* This FD was being polled and is now being removed. */
fdtab[fd].spec.e &= ~FD_EV_MASK_R;
}
if ((eo & FD_EV_MASK_W) == FD_EV_SPEC_W) {
/* The owner is interested in writing to this FD */
if (fdtab[fd].state != FD_STERROR) {
/* Pretend there is something to write */
fdtab[fd].ev |= FD_POLL_OUT;
if (!fdtab[fd].cb[DIR_WR].f(fd))
fdtab[fd].spec.e ^= (FD_EV_WAIT_W ^ FD_EV_SPEC_W);
else
done = 1;
}
}
else if ((eo & FD_EV_MASK_W) == FD_EV_STOP_W) {
/* This FD was being polled and is now being removed. */
fdtab[fd].spec.e &= ~FD_EV_MASK_W;
}

对于位于spec fd的读事件，当函数返回0时，去掉FD_EV_SPEC_R事件，转为FD_EV_SPEC_WAIT_R事件，表示这个描述符应该放入poll等待队列。函数返回不为0，那么表示此次spec处理时成功的，那么依然将其留在spec队列中，记录成功标志。在处理相应事件的时候还用fdtab[fd].ev记录下了相应fd被处理的事件。

对于被标志为停止了的fd，那么将其相应的读事件全部清空。

写事件的处理与读事件的处理相同。

[cpp] view plain copy

[src/ev_sepoll.c]_do_poll()
status += done;
/* one callback might already have closed the fd by itself */
if (fdtab[fd].state == FD_STCLOSE)
continue;

前面只要读或者写成功，那么表示此次的spec处理是成功的，因此对其进行数量统计，当然有可能对应的fd在其相应的读或者写函数中已经关闭，那么以下的事情就没必要做了。

[cpp] view plain copy

[src/ev_sepoll.c]_do_poll()
/* Now, we will adjust the event in the poll list. Indeed, it
* is possible that an event which was previously in the poll
* list now goes out, and the opposite is possible too. We can
* have opposite changes for READ and WRITE too.
*/
if ((eo ^ fdtab[fd].spec.e) & FD_EV_RW_PL) {
/* poll status changed*/
if ((fdtab[fd].spec.e & FD_EV_RW_PL) == 0) {
/* fd removed from poll list */
opcode = EPOLL_CTL_DEL;
}
else if ((eo & FD_EV_RW_PL) == 0) {
/* new fd in the poll list */
opcode = EPOLL_CTL_ADD;
}
else {
/* fd status changed */
opcode = EPOLL_CTL_MOD;
}
/* construct the epoll events based on new state */
ev.events = 0;
if (fdtab[fd].spec.e & FD_EV_WAIT_R)
ev.events |= EPOLLIN;
if (fdtab[fd].spec.e & FD_EV_WAIT_W)
ev.events |= EPOLLOUT;
ev.data.fd = fd;
epoll_ctl(epoll_fd, opcode, fd, &ev);
}

对于此处的表达式结果，结合以上三种情况即可知道其结果。首先是对于done的情况，此时o^fdtab[fd].spec.e==0，所以不会进入分支；接着是对于函数返回值为0的情况，这种情况下，FD_EV_SPEC的事件被清除，FD_EV_POLL的事件被设置，因此结果为不为0，会进入分支，进入分支后，易知内部分支会进入第二分支，也就是将fd加到epoll中；第三种是FD_EV_STOP类型导致事件被清空，计算结果不为0，进入分支，由于spec.e被清零，因此进入第一个分支，也就是从epoll list中移除fd。

在进行操作判断之后，然后对poll list的fd进行相应的操作。

[cpp] view plain copy

[src/ev_sepoll.c]_do_poll()
if (!(fdtab[fd].spec.e & FD_EV_RW_SL)) {
/* This fd switched to combinations of either WAIT or
* IDLE. It must be removed from the spec list.
*/
release_spec_entry(fd);
continue;
}
}

在对poll list更新之后，还需要检查fd新的事件中是否已经不再包含spec的事件，如果是，那么需要将fd从fdtab中移除。至此spec的循环处理已经结束。

总结一下上面的流程。从后往前遍历spec list，根据对fd有兴趣的事件调用相应函数进行数据的输入和输出(所有的fd都是非阻塞形式的)，如果调用成功，那么相应的fd仍然保留于spec list中，并统计在spec中成功处理的fd数量；若失败，那么需要将其放入poll list去等待，因为在等待数据到来之前在spec list中并不能做什么；如果描述符已经被停止使用，那么将会从poll list或者spec list中移除。