Litherum: Asynchronous IO

Performing blocking IO in your applications main thread is generally considered to be bad practice, because a machine’s IO facilities operate asynchronously from the machine’s CPU. Therefore, when the machine is performing IO, the CPU is idle, waiting for the IO to complete. If we, instead, allow the CPU to continue doing work while the IO is being performed, the machine can perform both duties in parallel. However, this complicates the programming model; we now need some way of reacting when an IO request has been completed.

There are a couple ways of achieving this. The most straightforward way of doing something asynchronously is to do it synchronously on another thread. Then, while the second thread is blocked, the first thread can continue with any computation it might want to perform. If the main thread needs to perform some task that depends on the IO, you can use a mutex and condition variable [1] to make the main thread block on the results of the IO (but only once it has performed computation while the IO is occurring; otherwise the whole effort would be for naught). This model diminishes the amount of time the main thread is spent blocked, but it doesn’t eliminate it.

A similar model is to use continuation-passing style [2] to simply send any computation which depends on the IO to the thread which performs the IO. When that thread completes the IO, it can then simply run the continuation. However, this model requires your computation to hop between threads, which means reasoning about sequential ordering difficult.

Enter the runloop. We can set up a similar model where each thread has a work queue protected by a condition variable. A thread can enqueue a function invocation on another thread’s work queue, and then signal that thread’s condition variable to wake that thread up. Upon waking up, a thread simply performs all the tasks in its work queue, and then waits on its condition variable. This means that, if you want to perform some asynchronous IO combined with some computation which depends on the IO, you need four things:

Something representing the IO request you want to perform,
A callback which is to be queued when the IO request is complete,
A thread / run queue combination on which to perform the IO, and
A thread / run queue combination on which to perform the callback.

If we don’t care which particular thread a request runs on, we can aggregate threads into a thread pool, and the thread pool can create and destroy threads as necessary to handle all the in-flight requests.

This model is almost perfect; however, there’s a big problem with it: a particular thread can only perform a single IO request at a time. This means that we need a unique thread for every IO request that is in flight. Threads are fairly heavyweight, so this isn’t always reasonable. It would be useful if there was a way to perform multiple IO requests in a single system call. Actually that’s a bit of a red herring; all we really want is to be able to know when it would be possible to perform an IO request without blocking. That way, we can separate our IO into two parts - the part that blocks and the part that doesn’t. We’re interested in multiplexing the blocking part of multiple IO requests together (because if we can perform IO without blocking, we’ve already won).

For reading, this makes sense because we want to 1) Wait until the data is available, and 2) Actually receive the data. The first part definitely blocks, but if we’ve waited on the first part, then the second part shouldn’t block at all. For writing, we actually write into a buffer, which means that we have to 1) Wait until there is space in the buffer, and 2) Actually write the data. The waiting here is necessary because the kernel has internal buffers that only have a finite size, so they may be full.

So now, our run loop has the following structure:

Wait on a bunch of IO requests to become serviceable. This part will finish when at least one IO request is ready to be performed without blocking.
Figure out which of the many requests is ready.
Perform those IO requests (which shouldn’t block).
Perform any of the callbacks associated with those IO requests. These callbacks might add additional IO requests to the ones we’re supposed to be servicing.
Loop back around and start waiting again.

Using this model, we actually don’t need any secondary threads at all, since the blocking for all our IO requests is being completed by the same thread in the same system call. Also, note that this model only works if there’s only one thread reading/writing to a particular file descriptor - if someone takes your data out from under you, you’ll be stuck blocking on more data.

As you will notice, we’ve replaced our “blocking on a condition variable” to “blocking on IO requests.” This is a little problematic because it means that we’ve now lost the ability to be arbitrarily woken up by secondary threads (which applications can reasonably be expected to make). This problem is often solved by creating a “wake-up channel.” Kernels usually expose some kind of in-memory IO subsystem where one thread can write into some opaque object and another thread can read from it. If we create one of these ahead of time, and we always add reading-from-this-object to our in-flight IO request list, then another thread can write into this object and we will get woken up. Then, if we notice that this particular object is the one that woke us up, we can then look in our work list for anything that the writer may have enqueued there. (Therefore, in this method, there are no condition variables.)

Timers are a little interesting because we don’t want to have to create a thread for every timer that we create in our applications. Therefore, we want a single thread which will multiplex all the current timers that our application has created, and wake up for each one, one by one. Such a thread would have to sleep for a predetermined amount of time, but also might have to be woken up arbitrarily in case anyone creates a new timer which should expire earlier than any existing timer. This is usually implemented with the mechanism described above, except that our “wait for IO to be available” call usually also takes a timeout. We can set the timeout to whatever the next timer’s expiration time is, and any time someone creates a new timer, we can use the wake up channel to make sure the timeout gets updated.

Classically, this system is implemented with the select() system call [3], and the wake up channel is implemented with the pipe() system call [4], which works for reading/writing a particular file or network socket. However, modern operating systems (read: BSD) support waiting on more things than simply a particular file descriptor. The kevent() system call [5] was created, which can wait on:

Reading and writing to file descriptors (just like select() does)
AIO operations (more on this later)
Filesystem operations, like if a file is deleted, written to by another process, renamed, etc.
Child process exit()ing, fork()ing, or exec()ing
Receiving a signal (lower precedence than signal handlers)
IO on mach ports
Explicit support for timers

This means that you don’t have to have separate threads watching (polling?) for these events, and then sending messages back to your main thread when they have occurred. Instead, the main thread itself can watch for these events at the same time for watching for IO on file descriptors.

Linux has a similar system call called epoll() [6].

Even though this idea is pretty cool, you shouldn’t implement it yourself for a production project. There are many mature projects out there that implement this for you. Take a look at libevent [7] and libev [8]. On OS X, Grand Central Dispatch [9] and NSRunLoop [10] do this.

Now, after all that is said and done, I wanted to mention that modern kernels also have support for actually asynchronous IO (Recall how, in the previous section, we are performing synchronous IO, we’re just are performing it for all our requests at the same time). There are two relevant mechanisms: 1) Setting the O_NONBLOCK flag [11] on a file descriptor, and 2) using AIO calls [12].

O_NONBLOCK doesn’t actually cause file descriptors to perform asynchronous IO; however, it causes IO to error in the case that it would otherwise have blocked. That’s one way to avoid the “my main thread is sleeping” problem - simply never sleep! You can also specify this at open() time. It also looks like not all kinds of file descriptors, such as pipes, support this flag on all OSes.

AIO calls, such as aio_read() and aio_write() actually perform synchronous IO. The trouble here comes in when the kernel wants to let you know that your asynchronous IO is complete. The AIO calls let you specify a struct sigevent structure which encodes how you would like to be notified of IO completion. There are currently two mechanism for the notification: 1) You can provide a callback to the call that the OS will call on an indeterminate thread (which might cause the OS to create a new thread for such a purpose), or 2) You can specify a signal [13] that should be delivered to your process. OS X currently only supports the latter.

Signals have their own problems in this regard, though. Signals, first of all, are heavier weight than the previously mentioned alternative. The real doozie, however, is that a signal can be delivered at any point, and, depending on which version of which kernel you are running, on any thread (this is usually configurable, at least). The “at any point” criteria means that your thread is suspended, a new stack is created for the signal handler, and the signal handler is invoked, wherever the original thread happened to be executing. This can wreak havoc on critical sections in your code because run loop invocations are no longer atomic. This can make it very difficult to reason about your code or to show correctness.

You could also use kevent() to listen for AIO completions, but OS X doesn’t support this.

On some other operating systems (Windows, Solaris), you can specify an “IO Completion Port” [14] with your asynchronous IO operation, and when the operation is complete, an item is written to the port. You can then read from the port in your program at your own discretion.

[1] http://en.wikipedia.org/wiki/Monitor_(synchronization)#Condition_variables
[2] http://en.wikipedia.org/wiki/Continuation-passing_style
[3] http://en.wikipedia.org/wiki/Select_(Unix)
[4] http://en.wikipedia.org/wiki/Anonymous_pipe
[5] http://en.wikipedia.org/wiki/Kqueue
[6] http://en.wikipedia.org/wiki/Epoll
[7] http://libevent.org
[8] http://software.schmorp.de/pkg/libev.html
[9] http://en.wikipedia.org/wiki/Grand_Central_Dispatch
[10] https://developer.apple.com/library/mac/documentation/Cocoa/Reference/Foundation/Classes/NSRunLoop_Class/index.html
[11] https://www.freebsd.org/cgi/man.cgi?query=fcntl
[12] https://www.freebsd.org/cgi/man.cgi?query=aio_read
[13] http://en.wikipedia.org/wiki/Unix_signal
[14] http://msdn.microsoft.com/en-us/library/windows/desktop/aa365198(v=vs.85).aspx

Litherum

Saturday, January 10, 2015

Asynchronous IO

No comments:

Post a Comment