The Event Completion Framework for the Solaris Operating System

ShangShujie 2007-06-05

展开全文

Article

The Event Completion Framework for the Solaris Operating System

Print-friendly Version

By Robert Benson, July 29, 2004

Contents

-	Introduction
-	Motivation
-	Solaris Event Completion API
-	Examples
-	Related Work
-	Conclusion
-	References
-	Acknowledgments
-	About the Author
-	Appendix: Code Example Listings

Introduction

An application‘s lifetime includes a number of events of interest. These events happen because of the application‘s interaction with the system in some well-defined frameworks. The Asynchronous I/O (AIO), timer, and poll frameworks are all good examples. As shown in this article, each one of these frameworks provides a solution to the problem on which it is focused, but does not extend any further. Due to the lack of crossover between these frameworks, application developers do not have a general way to gather multiple events of differing types using one framework.

This is the problem that the event completion framework shipped with the Solaris 10 Operating System (OS) is designed to solve. This framework provides a group of clients waiting on multiple objects (that is, AIO transactions, timers, files, and user-defined events) with a method to receive transaction completion events from different parts of the system in a scalable and performant manner. Additionally, the introduction of this framework enables developers with applications that leverage an event completion API to migrate from other operating systems to the Solaris 10 OS.

Within the Solaris 10 OS, the event completion framework focuses on providing a scalable, performant, and extendable framework that can incorporate new object and event types as they appear within the system.

Motivation

Prior to the Solaris 10 OS, there wasn‘t a unified way to reap an application‘s completion of events. Within the Asynchronous I/O framework, the status of an I/O transaction has to be collected, or reaped, using the aio_error() function. If the application needs to set up a timer to fire at some point in the future, the application depends on the signal framework to receive notification of the timer expiration. In addition, applications commonly need to execute some form of I/O in order to read or write to a group of files or the network. Due to system complexity, the resource requested by the application might be busy, and thus the application would have to wait. Traditionally the poll(2) or poll(7D) system calls were used by application programmers to query, or poll, the system to see if the application could write or read to the pertinent resource (such as the pipe, socket, and so on).

Because all of these frameworks were built independently, no unified methodology existed by which an application could gather events. For example, the poll functions are not general enough to return AIO read and write completion events or timer expiration events. In addition, none of these frameworks allowed for a threaded application to send user-defined events and payloads to a subset of the total amount of threads within the application. These points -- as well as the widely varying performance and scalability of the available frameworks to deliver event completion -- spurred the developers at Sun Microsystems, Inc. to develop a unified framework by which an application could reap an event of interest using one API.

A classic example of historic work within this area surrounds the poll(2) and poll(7D) interfaces. These interfaces work by taking an array of pollfd structures, which include file descriptors (fds) and a set of flags to indicate what events the application is waiting for on the list of fds. The poll(2) interface is a reasonable solution to the problem but only if the set of fds is small, does not change frequently, and the number of "active" fds is small in comparison to the total number of fds. In addition, the poll(2) functions block every time a fd is added to the list of fds to be monitored.

To address these issues, the poll(7D) interface was created. poll(7D) is a more performant option than poll(2), and it should be leveraged in cases where there are a large number of fds to monitor. That said, the poll(7D) interface still has performance issues due to the limitations of the infrastructure shared by the poll interfaces. Specifically, because of the implementation, the response time is dependent on the number of fds within the list to be monitored. This illustrates the history and complexity of the problems that application developers needed to be aware of when implementing event-aware applications.

In the age of fast, cheap, multiprocessor systems, scalability has become a focus for many application developers. Due to the amount of time that has passed since the design and implementation of the frameworks mentioned above, most were based upon the idea that the process was the fundamental unit of execution, as opposed to the thread. For example, the AIO framework was designed to support AIO transactions in a per-process manner, and thus it does not scale well for highly multithreaded applications. With this in mind, event completion ports were designed to be used by a single thread or a subset of the threads in the application.

The architects of the event completion framework decided to build a new event framework within the Solaris OS kernel in order to avoid the gaps within the historic interfaces. In creating a new framework, they focused on solving the issues listed previously in this section.

Solaris Event Completion API

To give the developer a general idea of how to use the event completion framework, I would like to start out with a simple code example. The fundamental piece of the event completion framework is the port. Applications use ports to register and reap events on the objects of interest. Code Sample 1 gives a basic example of how to use the general event completion framework.

Code Sample 1: Example Event Completion Code

/* Create port to use for event completion */
                        int portfd = port_create();
                        ...
                        /* Register, or associate, the objects and events you are
                        interested in */
                        port_associate(portfd,  ... );
                        ...
                        /* Block until a single event appears on the port */
                        port_get(portfd,  ... );

Note that using this framework is as simple as creating a port, registering the events that you wish to receive events for the objects of interest, and then using a single interface to reap a single event or multiple events from the previously created port. As will be seen later, the port_associate() call can be replaced with other initializing functions (such as timer_create(3RT), aio_read(3RT), and so on) in order to use event completion ports with other frameworks.

The Solaris 10 OS event completion API includes the functions listed in Code Sample 2.

Code Sample 2: Event Completion Function Specifics

int     port_create(void);
                        int     port_associate(int port, int source,
                        uintptr_t object, int events,
                        void *user);
                        int     port_dissociate(int port, int source,
                        uintptr_t object);
                        int     port_send(int port, int events, void *user);
                        int     port_sendn(int ports[], int errors[],
                        uint_t nent, int events,
                        void *user);
                        int     port_get(int port, port_event_t *pe,
                        const  timespec_t *timeout);
                        int     port_getn(int port, port_event_t list[],
                        uint_t  max, uint_t *nget,
                        const timespec_t *timeout);
                        int     port_alert(int port, int flags, int events,
                        void *user);

The port_create(3C) function creates a port by which completion events can be delivered to a thread. This function returns a non-negative integer representing the port‘s identifier.

The port_associate(3C) associates an object (such as file, socket, timer, and so on) with a previously created port. The first parameter is the port identifier, which was the return value of the port_create() method. The second parameter associates a list of objects that will be monitored by the port; these may include the aiocb structure (found in aio.h), time_t structure (found in time.h), an unsigned integer pointer to a user-defined variable/structure, or a file descriptor, depending on the type of I/O the application is binding to the port. Please note that an object is automatically disassociated from the port once the object‘s event has been reaped from the port. This is required because the poll interface doesn‘t maintain any state. Thus, once an object‘s event has been reaped, port_associate(3C) must be used to reassociate the object with the port if there is still an interest in any events pertaining to that object.

The port_disassociate(3C) function takes the object referenced by the third parameter out of the list of objects monitored by the port specified by the first parameter. The second parameter indicates the source of the events, which was indicated at the time of port association.

The port_send(3C) and port_sendn(3C) functions put a user-defined event onto the port indicated by the first parameter, port. In this case the difference between the two functions is that the port_sendn() function can send an event to more than one port. The events value indicates what user-defined event is being put on the port. This could be used to process the event when the application has a number of possible user-defined event types that could arrive on the port. In addition, the pointer userp represents the user-defined payload that is delivered to the port for the receiver to consume. As shown later, in the Examples section, this can be as complex a structure as the application developer chooses.

The port_get(3C) and port_getn(3C) functions reap completed events from the port indicated by the first parameter, port. The difference between the two functions is that port_getn() function can reap more than one event from the port. Again, once an object‘s event has been reaped, that object is disassociated from the port. When the port_getn() call returns, the number of reaped events is reflected by the value of the fourth parameter, nget. The timepsec timeout parameters communicate how long the functions should block waiting for an event to arrive on the port. If there is an error (for example, if timeout occurs), the functions will return a value of -1.

When the port_get(3C) or port_getn(3C) functions return with reaped events, the second parameter is one or more (depending on the function used) port_event_t structures filled with information to identify the event that took place. The structure listed in Code Sample 3 includes the source of the event, which can be found by using the values in Code Sample 4.

Code Sample 3: Event Completion Structure Listed in /usr/include/sys/port.h

typedef struct port_event {
                        int         portev_events;    /* event data is source specific*/
                        ushort_t    portev_source;    /* event source */
                        ushort_t    portev_pad;       /* port internal use */
                        uintptr_t   portev_object;    /* source specific object */
                        void        *portev_user;     /* user cookie */
                        } port_event_t;

Code Sample 4: Event Sources Listed in /usr/include/sys/port.h

#define PORT_SOURCE_AIO		1
                        #define PORT_SOURCE_TIMER	2
                        #define PORT_SOURCE_USER	3
                        #define PORT_SOURCE_FD		4
                        #define PORT_SOURCE_ALERT 	5

Depending on the source of event, the portev_object of the structure is different; this can be seen on the port_create() man page. For example, when the source of the event is an AIO transaction, the portev_object is an aiocb structure. As can be seen in the Examples section, the portev_user pointer can be used to consume the user-defined payload, which was indicated at the time the object was associated with the port.

The port_alert(3C) function puts the port indicated by the first parameter, port, into alert mode by setting the third parameter, events, to a non-zero value. Once a port is put into alert mode, all of the threads waiting in the port_get() or port_getn() functions will awake with a PORT_SOURCE_ALERT event on the port. By setting the events parameter to 0, the port will be returned to a non-alert state.

When initiating an AIO transaction or arming a timer, the port_notify structure needs to be associated with the call (see Code Sample 5).

Code Sample 5: Event Notification Structure Listed in /usr/include/sys/port.h

typedef struct  port_notify {
                        int	portnfy_port;   /* bind request(s) to port */
                        void	*portnfy_user;  /* user defined */
                        } port_notify_t;

In the case of AIO and timers, the port_notify_t structure is pointed to by the signal event structure‘s sigev_value.sival_ptr member (see the timer and AIO example listings in the Appendix).

Examples

To introduce the use of the event completion API, the following subsections include educational examples. Each one of the subsections takes one of the historic frameworks we have spoken about previously, giving the reader a bit of background information concerning the historic API and a pointer to a sample program that leverages the Solaris 10 OS event completion framework. The expectation is that the examples referenced here can help developers understand how the event completion API can be used in each scenario.

Asynchronous I/O

Asynchronous I/O is a framework by which an application can submit an I/O request that the system will handle without interacting with the application until the I/O request is complete. Generally, AIO is a framework that developers use to build applications that need to continue execution without waiting until an I/O request is complete. This need usually arises because an application has severe timing constraints.

The AIO framework within the Solaris OS has been built upon the aio_read(3RT) and aio_write(3RT) functions to submit the AIO requests. In older versions of the operating system, an application could reap a completed AIO transaction by using the aiowait(3AIO), aio_waitn(3RT), or aio_suspend(3RT) functions. Using these functions works well for processes with a few threads but not for highly multithreaded applications.

To provide an alternative, the new event completion framework within the Solaris 10 OS delivers the AIO event completion to a port. An application can reap the AIO event completion information using the port_get(3C) or port_getn(3C) functions. With the ability to create a port that is bound to a single thread or a group of threads, the developer of a highly multithreaded application can scale the AIO requests using the thread (as opposed to the process) as a basis.

In the Appendix, Listing 1 provides a simple program that initiates an AIO write and then reaps the status using the event completion framework.

Please note that the historic functions that were used within the AIO framework are still present within the Solaris 10 OS and function as expected.

Poll

Prior to the Solaris 10 OS, the best method to check if a fd was ready for reading and writing was to use the poll(2) or poll(7D) functions. poll(2) traditionally works well when the list of file descriptors is small and all the file descriptors in the list return with events. As was noted earlier, poll(7D) works well when the number of file descriptors does not change.

In the Solaris 10 OS the event completion provides a way to reap the status of the fds within an application. As is mentioned in the poll(7D) man page, the event completion framework should be used in any situation where a developer would historically have used the poll(7D) interface. When using the event completion framework to reap fd status, port remembers the registered file descriptors (unlike in the poll implementations). In addition, only new fds or fds that have an event pending need to be reactivated.

In the Appendix, Listing 2 shows a sample program that illustrates how to use the POLLIN event source as the fourth parameter in the port_associate(3C) call. This example shows how, historically, one could write an application that was implemented using only the poll() interfaces.

Timers

Timers, created using the timer_create(3RT) function, are used within applications to set up a timer that fires a signal when the timer expires. The signal delivered to the application is specified within the second parameter of the timer_create(3RT) function. Using the port_notify_t structure we can have the signal directed to a port of our choosing.

In the Appendix, Listing 3 provides a working sample of arming a timer and catching the expiration of that timer using the event completion framework.

User-Defined Events

In the Appendix, Listings 4 and 5 provide sample programs that illustrate how to send and receive user-defined events and payloads using a single thread and between processes. Please note that in Listing 5 the port identifier was passed through a pipe from one process to another in order for the processes to have access to the port‘s events.

Also, note that the code in Listing 5 contains two source files (denoted by 5a and 5b), port_sendfd_example and port_rcvfd_example. In order to run this example, please execute the port_sendfd_example binary first and then execute the port_rcvfd_example binary.

Related Work

Several other operating systems have implemented an event completion framework, to some extent. Within the following section I will step through several popular operating systems and describe the functionality they provide in comparison to the Solaris 10 OS event completion framework.

Windows

The event completion framework consists of the I/O Completion API in the Windows NT OS and the WaitForMultipleObjects API in the Windows Win32 OS.

The Microsoft Developer Network (MSDN) describes the I/O Completion framework as a pool of threads created when an application was started in order to process asynchronous I/O requests.¹ The threads within this pool are solely used to asynchronously complete I/O requests issued by the application. This framework consists of the CreateIoCompletionPort, GetQueuedCompletionStatus, and the PostQueuedCompletionStatus functions.

The CreateIoCompletionPort() call sets up a port with one or more file handles associated with it. When the I/O operations (like read, write, and so on) complete on these file handles, those events are posted to the port. In order to collect information about those events, the application has to call GetQueuedCompletionStatus(), which returns a key within the argument list to indicate the file that completed some I/O transaction. As with the port_get() function within the Solaris 10 OS, the argument list contains a timeout interval that indicates the maximum amount of time the call will wait for a completion event (that is, the timeout interval). And finally, the PostQueuedCompletionStatus() call can be used to post a completed I/O event into the port in lieu of the system. This last function is very similar in nature to the port_send() functionality in the Solaris 10 OS.

The WaitForMultipleObjects API provides a framework that takes an array of objects and waits for one or all of them to complete. This API can process objects of the following type: console input, user event, memory resource notification, mutex, process, semaphore, thread, and waitable timers. When the WaitForMultipleObjects() call is made and the completion of an event has not taken place, the calling thread enters the wait state.

The array of handles passed into the WaitForMultipleObjects framework can consist of a heterogeneous set of these objects. However, the array cannot contain multiple copies of the same handle. In addition, if one of these handles is closed before the wait timeout interval expires, the function‘s behavior is undefined.

The framework described here does not provide a simple, unified interface to create and use completion ports for asynchronous I/O, socket I/O, user events, and timers across the Windows OS variants.

FreeBSD, NetBSD, OpenBSD

FreeBSD, NetBSD, and OpenBSD provide the generic kqueue framework to take care of event completion.² The design of the kqueue framework provides a method to determine if AIO transactions, signal delivery, file transactions, process events (such as fork, exit, and so on), and file system changes have completed. The design goals of the kqueue project closely resemble those of the event completion framework within the Solaris OS due to the interest in creating a scalable framework to deliver events to threads. The architects of both frameworks decided early in the design phase to build an extendable system that could handle a growing number of objects (that is, files, pipes, sockets, and so on) and events.

The kqueue API consists of the kqueue() and kevent() functions. The kqueue() call creates a queue in which the application can register events of interest, such as AIO reads and writes, and so on. Once the queue has been created, the application has to register the events of interest using the kevent(). In addition, the kevent() call also reaps the completed events from the queue.

Linux Asynchronous I/O

The asynchronous I/O functionality has been integrated into Linux 2.6.³ For the last few years, prior to Linux 2.6, Ben LaHaise has maintained an AIO patch for the 2.4 Linux kernel.⁴ For our purposes, we will only examine the AIO functionality distributed within the standard Linux 2.6 kernel.

Within the Linux AIO framework, the io_submit() and io_getevents() are the functions an application developer can use to submit I/O requests and reap the completion or status of these events, respectively. This Linux AIO framework supports reading() / writing() on a raw disk and files opened with O_DIRECT on the ext2, ext3, JFS, and XFS file systems. As of now, the Linux AIO framework does not support AIO fsync, AIO read()/write() on sockets and pipes, and files not opened with O_DIRECT.

AIO was not integrated within the standard Linux kernel until version 2.6. In addition, the AIO framework in Linux 2.6 has not been implemented as a general framework from which an application developer can use timers and user events.

Conclusion

In the past, developers had to rely on a group of frameworks to handle I/O events (such as AIO, poll(), timers, and so on) within an application. None of these frameworks allowed for an application thread to send an event with user-defined payload to another set of threads within the same application.

With the advent of the event completion framework within the Solaris 10 OS, a general framework has been implemented so that application developers can reap AIO, timer, poll(), and user-defined events using the same methods. In addition to extending functionality, the event completion framework has also focused on providing a more scalable and performant solution for the delivery of these events.

References

Kqueue: A Generic and Scalable Event Notification Facility (pdf)
Kernel Asynchronous I/O (AIO) Support for Linux
Linux-AIO Home Page

Acknowledgments

Thanks to Miguel Isenberg, Solaris 10 OS Event Completion Architect, for his invaluable documentation.

About the Author

Rob Benson is currently an engineer in the Market Development Engineering organization of Sun Microsystems. His group is focused on partner adoption of the Solaris OS, x86 Platform Edition.

Appendix: Code Example Listings

Listing 1: Example of using a port to reap the status of AIO
Listing 2: Threaded example of using ports to reap fd status using POLLIN events
Listing 3: Example of using a port to receive the firing of an expired timer
Listing 4: Example of using a port to send a user-defined payload
Listing 5a and Listing 5b: Examples of a port being shared between processes