Thursday, January 29, 2015


Previously, I had described the logic behind B-splines. NURBS are not much different, but still require a little more thought.

The acronym NURBS stands for Non-Uniform Rational B-Spline. First off, the “non-uniform” part just means that the knots are not equally spaced. It seems silly to explicitly call this out in the name, as this is the natural state of B-splines.

That leaves us with the “Rational” part. Rational B-splines are B-splines where each basis function has a “weight” associated with it. However, this doesn’t simply mean that each of the basis functions gets naively multiplied by its weight. Instead, it’s a little more subtle than that.

First of all, recall that the original formulation of a B-spline is of the form Σ basis * value. If you naively multiply a particular basis function by a weight, that is mathematically equivalent to multiplying the value instead, which would effectively be moving your input point. This might make the resultant curve not resemble our input values at all, which is the whole point of B-splines and NURBS.

Instead, we want to augment the contribution that a particular basis function makes, relative to the other basis functions. Let’s unpack that statement:
  1. We are interested in the “contribution” of a basis function. That means that we are interested in the value of a basis function compared to the sum of all the basis functions.
  2. We are augmenting a basis function in order to create a different basis function. Once we have our augmented basis functions, the formulation of a NURBS is still Σ newbasis * value.
  3. We want the weights to be relative. For example, if all the weights are 2, that should have the same resulting curve as if all the weights are 1.
With this goal in mind, it becomes clear that, as you increase the weight of a basis function, the curve gets drawn to the associated value, rather than simply just “upwards.”

Let’s take an an example which will help us show how these weights get applied. Here is a chart that depicts 10 basis functions of order 4. The knot vector is [0, 1, 2, 3, 4, 100, 200, 300, 400, 500, 600, 601, 602, 603, 604].

Now, let’s show the same data in a stacked graph, in order to show each basis function’s contributions.

As you can see, the sum of all the basis functions is a constant 1. This isn’t an accident - it’s a fundamental requirement of B-splines. This is why the curves grow and shrink as you move knots around.

If we pick a particular input, we can see that there is more than one basis function that contributes to the resulting sum. We can see the different contributions of that particular value easily.

Going back to our goal, we want to augment the contributions of the constituent basis functions in proportion to their weights. This means that we don’t want to affect the sum total of the basis functions, but we do want to grow and shrink the contribution of each individual basis. The way we do that is pretty straightforward - multiply each of the basis functions by their weight, and then normalize by the new total. This looks like newbasis[i] = oldbasis[i] * weight[i] / (Σ oldbasis[k] * weight[k])

Here’s what the example data above looks like if you give one of the basis functions a weight of 5 and all the other ones a weight of 1. As you can see, the sum of all the basis functions is still 1; however, the contribution that the heavily-weighted basis function makes has grown proportionately.

Monday, January 26, 2015


B-splines are similar to Bezier curves in some ways and different than Bezier curves in different ways. Both are calculated using a coefficient vector and a basis function vector. The benefit of B-splines over Bezier curves, however,  is that B-splines’ basis functions are more local than Bezier curves’ basis functions.

Recall that the Bernstein polynomials were useful because each of them had a hump in a particular place along the (0, 1) range. However, the Bernstein polynomials are nonzero everywhere inside that same range. It would be beneficial if a particular basis function was 0-valued everywhere except in some small range, and if this range consisted entirely of the hump.

Such a function, however, is not a polynomial. There are no polynomials that are zero-valued everywhere except for some finite range, and nonzero within that range. This means that if we want to adopt such basis functions, our resulting curve won’t be a polynomial. Which is fine.

This also means that our resultant curve will be zero-valued outside of union of all the ranges of all the basis functions.

So where do we put each of these nonzero ranges? Well, we want the ranges of the different basis functions to overlap (If they didn’t, then our resultant curve would be forced to 0 at all of the non overlapping parts). The most logical way to do this is to cut up your range into k segments, and give each curve a sliding window into each of these segments. For example, you could give the first basis function segments 1 and 2, and give the second basis function segments 2 and 3, and give the third basis function segments 3 and 4, etc.

If we want to keep this general, we should also provide a way to change the width of that sliding window. We probably also want a way to give each basis function more than two consecutive segments (instead of 2 in the previous example). If we want to give each basis function 4 segments, then the first basis function gets segments 1, 2, 3, and 4, while the second basis function gets segments 2, 3, 4, and 5. We probably want this to be a tunable parameter in our system. In the picture above those basis functions are of order 4.

So how do we decide where on the number line the boundaries between our segments lie? Well, we could make them regularly spaced; however, that would limit the kinds of curves that we could create with this system. Instead, we should probably make these locations a tunable parameter to the system.

We now have the structure that we need to describe B-splines. A B-spline is characterized by three things:
  1. How many basis functions (and coefficient values) we want
  2. How many sequential segments each basis function is nonzero over. This is called the order of the spline, and is also equivalent to the order of the resulting polynomials and the basis polynomials.
  3. A vector of positions on the X axis which denote the locations of the segment boundaries. This is called the knot vector. Note that the number of elements in the knot vector is equal to the number of basis functions plus the order of the spline. Also note that the elements in this vector, called knots, must be increasing.
With these pieces of information, we can generate each of our basis functions. The function that generates these basis functions was created so that they form a basis over the vector space of splines. This is important because it means that any polynomial spline can be written as a linear combination of B-splines.

Also note that discontinuities can be created if multiple knots are placed on top of each other (have the same value). The kind of discontinuity can be controlled with how many knots are all on top of each other.

Another interesting phenomenon is that, if the order of the B-spline is equal to the number of basis functions, as well as the first half of all the knots are on top of each other and the second half of all the knots are on top of each other, then you actually get a single effective interval (between the two piles of knots). The bases that traverse the entire range have the same behavior as the Bernstein polynomials (albeit limited to a particular domain). This means that every Bezier curve is representable as a B-spline.

So there you have it - the logical conclusion to the idea that the basis functions should be nonzero only in a particular range and zero everywhere else, generalized to a system which can represent any spline.

Sunday, January 25, 2015

Bezier Curves

In order to talk about Bezier curves, I first need to talk about polynomials.

A polynomial is a function y = a + b*x + c*x2 + d*x3 + … + p*xn. Such a polynomial is said to have a “degree” of n.

Polynomials are smooth functions, so it is impossible to represent the curve by simply listing all the points on the curve. A better way to represent a particular polynomial is with a coefficient vector, which is [a, b, c, d, …, p] in the above example. If someone wants to recreate the original curve, they can do so by multiplying each term in the coefficient vector by successive powers of x. This is important because, if you are trying to communicate a polynomial of degree n, you can do so by only transferring n coefficients (assuming that both ends have already agreed upon ahead of time the process of that multiplication with successive powers of x. This process is usually baked into code).

Another way to think of this operation is to perform a dot product between the coefficient vector and a basis vector. Therefore, the above polynomial could be represented by the dot product of [a, b, c, d, …, p] · [x0, x1, x2, x3, …, xn]. This is interesting because the values in the coefficient vector are simply scalars, while the values in the basis vector are functions themselves. In general, the communication of a polynomial of degree n can be communicated in n terms, assuming that both ends have previously agreed upon that basis vector ahead of time (and the basis functions form a basis over the vector space you are interested in).

However, there are alternative basis vectors. There are actually many different basis vectors that have been invented throughout history, but only one is relevant when discussing Bezier curves. The functions that Bezier curves use are called Bernstein polynomials. (For comparison, note that both the Fourier series and Taylor polynomials are different basis vectors.)

Before getting into Bernstein polynomials themselves, I want to discuss the fact that other basis vectors are useful in the first place (Naively, it seems like they just add more work to have to convert between different basis vectors). The reason is that each of the curves in a particular basis have some semantic meaning to them. In the Fourier series, each curve has a particular frequency. In the Taylor polynomial, each curve involves a particular derivative of a function.

The semantics behind the Bernstein polynomials is that each curve lies between y=0 and y=1, has a single hump, and the humps for all the functions are roughly equidistant across the interval from x=0 to x=1. In addition, as you move through the polynomials, the location of that function’s hump moves steadily to the right.

Here's a plot of all the Bernstein basis functions for degree 5 drawn on top of each other:

What this means is that a particular coefficient has an effect on the shape of the curve that is intuitive. Boosting a particular coefficient will have a local boosting effect on one part of the curve. This means that it is generally intuitive to figure out what values your coefficients should be if you have a particular shape in mind that you want your curve to hold. You can sculpt your coefficients and the resulting curve will have the same overall shape.

For comparison, this property definitely does not hold for the original basis vector [x0, x1, x2, x3, …, xn].

Usually when people talk about Bezier curves, they are actually talking about a sequence of Bezier curves that are connected tail-to-head, and they are talking about two-dimensional Bezier curves. A two dimensional Bezier is just two Bezier curves, where each curve represents one particular dimension’s output values. (Each of the two Bezier curves have different coefficient vectors). Because the coefficient vectors bear spatial resemblance to the output curve, the coefficient values and the output curve are usually plotted on the same surface.

In the two-dimensional case, a particular coefficient value is usually represented as a dot on the screen. That dot’s X position represents a coefficient in the X Bezier curve, and the dot’s Y position represents the coefficient in the Y Bezier curve. Once you have all the coefficients in the X Bezier curve, you can perform the dot product described above, which results in a description of all the X values that the resulting curve should hold when drawn. The same logic applies to the Y coordinates. Then, you can pair up the curve’s X and Y positions and plot them on the screen, eventually showing the entirety of the resultant curve.

Do this sequence for all of a path’s constituent Bezier curves, and viola! You’ve got a description for the outline of a glyph in a font. In TrueType, all the Bezier curves that describe a glyph are of order 2. In OpenType, all the Bezier curves that describe a glyph can be of order 3 (depending on .... details).

Sunday, January 11, 2015

Mach Ports

Classically, Unix IO was performed on file descriptors. There are two kinds of interactions with a file descriptor - stream based interactions and packet based interactions. Stream based interactions are designed for a flow of bytes, which is the model for file IO and TCP connections.  When using these kinds of streams, you use the read()/write() family of system calls. The kernel can coalesce multiple writes() into a single read(), which makes sense - the model is a stream of bytes.

Packet-based IO is used for things like UDP socket connections. When performing this kind of IO, you use sendmsg()/recvmsg() and the semantics of reliability or ordering come from the underlying transport mechanism (meaning: if you use a UDP Internet connection, neither strict ordering nor reliable delivery are guaranteed).

Unix has the concept of pipes and sockets. Pipes are byte-oriented, while sockets can be either byte-oriented or packet-oriented, depending on how you create them (a socket can represent both TCP streams and UDP streams). The userland interface to both of these concepts is a file descriptor, which you pass to read()/write() and sendmsg()/recvmsg() for performing the required IO.

It is also possible to use these mechanisms for communication between processes on the same machine. The pipe() system call will create a new pipe with both ends available for reading and writing, and the socketpair() system call will do the same thing for sockets. Usually, when we’re interacting with another process, we want to perform some sort of RPC [1], so the packet-based model is a better conceptual fit than the stream model. (It is possible to use/create a multiplexer / demultiplexer so that you can use byte-oriented streams to pass messages, but packet-based communication doesn’t have this need.) Using a unix domain socket [2] even has the appropriate semantics for passing reliably and correctly ordered messages.

So what exactly is a packet, then? Well, it’s defined in struct msghdr. There are four conceptual pieces:
  1. The destination address. For unix domain sockets, this is ignored
  2. An array of buffers
  3. Protocol-specific control information. For unix domain sockets, this is a struct cmsghdr
  4. Flags
Everything is pretty straightforward, but it is interesting that you can specify control information in your packet. The cmsghdr structure consists of a vector of triplets which contain:
  1. The originating protocol. This should be SOL_SOCKET
  2. The protocol-specific type of control information (an int)
  3. An arbitrary buffer
For a unix domain socket, the type can be SCM_RIGHTS, which is used for sending a file descriptor over the socket. In particular, if I have a file descriptor, I can use the SCM_RIGHTS control type and specify the file descriptor. Then, when the message is delivered, the file descriptor that the reader gets will be a duplicate of the one sent. When passing the packet, the kernel will duplicate the file descriptor for the receiving process. You can also specify SCM_TIMESTAMP, SCM_CREDS, and SCM_TIMESTAMP_MONOTONIC types.

Mach Ports [3] are conceptually similar to this model. A mach port is a message-based communications mechanism. However, it allows for much finer grained control than the packet-based approach described above.

The concept of a mach port is distinct from the concept of the rights that you have to interact with the port. The rights that you have for a particular port can change over time. There are three important types of rights: send rights, receive rights and send-once rights. A send-once right is exactly what it sounds like: the right to send to a port exactly one time.

Just like with socketpair(), you can create a new port with mach_port_allocate(), passing in MACH_PORT_RIGHT_RECEIVE. Once you have the receive right for the port, you can then grant yourself a send right with mach_port_insert_right().

You actually use the same call to send and receive from a port: mach_msg(). The message itself is modeled in the mach_msg_header_t structure, which has the following conceptual pieces:
  1. A remote port. When sending, this is the destination, and when receiving, this is the sender.
  2. A local port. When sending, you specify a port here for the receiver to reply to you with. This can be MACH_PORT_NULL if you don’t want the receiver to reply.
  3. A buffer
  4. A message ID. This is application-specific; the receiver can do whatever it wants with this. Usually it is an identifier of a particular RPC call; every function gets its own ID.
  5. Flags. These flags are used for the following:
    1. Remote right. This is used so you can specify which right you want to use when sending the message
    2. Local right. This is the right that the other end should get on the reply port that the sender specifies.
    3. Possibly mark the message as “complex.”
As you can see, mach ports aren’t bidirectional, and they don’t come in pairs. A port represents one end of a communication, so if you want to simply send a port a message, you don’t need a port of your own - you only need the destination port.

A “complex” message is where things get even more interesting. This is where you can specify an array of “descriptors” after the message header. There are various different kinds of descriptors. One kind wraps a mach port which can be sent to the receiver, similar to the SCM_RIGHTS example above. Another kind specifies out-of-line data which can be sent along with the message. This out-of-line descriptor specifies if the data should be copied to the receiver, or if the pages should be simply mapped into the receiver. A third type of descriptor serves as a combination of the two previous kinds, and allows you to specify an out-of-line array of port names to send to the receiver.

Performing asynchronous IO with mach ports is possible by using port sets. You can create a port set which contains particular ports in it, and use the port set in the mach_msg() call. The call will perform the request to any of the ports in the set. You can therefore use this similarly to how I described using select() [4].

One of the big differences between file descriptors and mach ports is that ports are not inherited through a fork() the way that file descriptors are. This means that if a process wants to talk to its child processes, it must find them using a general lookup mechanism that any process can use to look up any service. This is implemented using a process’s “bootstrap port,” which is a port that all processes automatically get that they can use to talk to a system service. The system service’s job is to allow processes to register ports under a particular name, and to allow processes to find ports by a particular name. Therefore, if two ports want to talk to each other, one uses its bootstrap port to register a well-known service name, and the other provides its bootstrap port with the same well-known name and asks for a port which it can use to communicate with the service. Therefore, there is a service discovery mechanism built right in to the mach ports infrastructure.

As you can see, all these additional facilities make mach ports ideal for interprocess RPC. Being able to explicitly codify reply messages, being able to explicitly give a message an ID which corresponds to the RPC call being performed, being able to send shared out-of-line memory, being able to send ports through ports, and being able to perform dynamic service discovery all stack up to make it a great choice.

However, I want to point out that mach ports don’t allow you to do anything that you couldn’t already do with socket packets. Sending a file descriptor over a unix domain socket is possible, as described above. Out of line data is possible to send by sending the file descriptor that represents a shared memory segment as created by shm_open() or shmget(). The smaller message limits for socket packets can be worked around by breaking up packets into smaller packets (Indeed, it looks like there is even some system facility for this with the MSG_WAITALL flag). File descriptors don’t have the same access controls that ports have, but that just means that disobeying the access controls is something that you can do with file descriptors that you can’t do with ports. Dynamic service discovery is possible with a process that listens on a well-known named pipe, like how D-Bus [5] works.

There is a big difference, though, between the two approaches. The performance of mach ports seems dramatically better than the performance of unix domain sockets. On my machine, the biggest size of a packet that I could send through a unix domain socket is 2048 bytes, which is fairly small when it comes to RPC calls. Using messages of this size, my throughput using mach messages is around 2.5 times faster than using unix domain sockets. In addition, the maximum size of a message that I could send through a mach port was on the order of 50 megabytes, which means that I would need many more socket packets to send a large message than I would need mach port messages, which would only make the case for unix domain sockets worse. I do wonder, however, if there is something innately different between sending a packet and sending a mach message that makes mach fundamentally faster, or if unix domain sockets could be sped up.


Saturday, January 10, 2015

Asynchronous IO

Performing blocking IO in your applications main thread is generally considered to be bad practice, because a machine’s IO facilities operate asynchronously from the machine’s CPU. Therefore, when the machine is performing IO, the CPU is idle, waiting for the IO to complete. If we, instead, allow the CPU to continue doing work while the IO is being performed, the machine can perform both duties in parallel. However, this complicates the programming model; we now need some way of reacting when an IO request has been completed.

There are a couple ways of achieving this. The most straightforward way of doing something asynchronously is to do it synchronously on another thread. Then, while the second thread is blocked, the first thread can continue with any computation it might want to perform. If the main thread needs to perform some task that depends on the IO, you can use a mutex and condition variable [1] to make the main thread block on the results of the IO (but only once it has performed computation while the IO is occurring; otherwise the whole effort would be for naught). This model diminishes the amount of time the main thread is spent blocked, but it doesn’t eliminate it.

A similar model is to use continuation-passing style [2] to simply send any computation which depends on the IO to the thread which performs the IO. When that thread completes the IO, it can then simply run the continuation. However, this model requires your computation to hop between threads, which means reasoning about sequential ordering difficult.

Enter the runloop. We can set up a similar model where each thread has a work queue protected by a condition variable. A thread can enqueue a function invocation on another thread’s work queue, and then signal that thread’s condition variable to wake that thread up. Upon waking up, a thread simply performs all the tasks in its work queue, and then waits on its condition variable. This means that, if you want to perform some asynchronous IO combined with some computation which depends on the IO, you need four things:
  1. Something representing the IO request you want to perform,
  2. A callback which is to be queued when the IO request is complete,
  3. A thread / run queue combination on which to perform the IO, and
  4. A thread / run queue combination on which to perform the callback.
If we don’t care which particular thread a request runs on, we can aggregate threads into a thread pool, and the thread pool can create and destroy threads as necessary to handle all the in-flight requests.

This model is almost perfect; however, there’s a big problem with it: a particular thread can only perform a single IO request at a time. This means that we need a unique thread for every IO request that is in flight. Threads are fairly heavyweight, so this isn’t always reasonable. It would be useful if there was a way to perform multiple IO requests in a single system call. Actually that’s a bit of a red herring; all we really want is to be able to know when it would be possible to perform an IO request without blocking. That way, we can separate our IO into two parts - the part that blocks and the part that doesn’t. We’re interested in multiplexing the blocking part of multiple IO requests together (because if we can perform IO without blocking, we’ve already won).

For reading, this makes sense because we want to 1) Wait until the data is available, and 2) Actually receive the data. The first part definitely blocks, but if we’ve waited on the first part, then the second part shouldn’t block at all. For writing, we actually write into a buffer, which means that we have to 1) Wait until there is space in the buffer, and 2) Actually write the data. The waiting here is necessary because the kernel has internal buffers that only have a finite size, so they may be full.

So now, our run loop has the following structure:
  1. Wait on a bunch of IO requests to become serviceable. This part will finish when at least one IO request is ready to be performed without blocking.
  2. Figure out which of the many requests is ready.
  3. Perform those IO requests (which shouldn’t block).
  4. Perform any of the callbacks associated with those IO requests. These callbacks might add additional IO requests to the ones we’re supposed to be servicing.
  5. Loop back around and start waiting again.
Using this model, we actually don’t need any secondary threads at all, since the blocking for all our IO requests is being completed by the same thread in the same system call. Also, note that this model only works if there’s only one thread reading/writing to a particular file descriptor - if someone takes your data out from under you, you’ll be stuck blocking on more data.

As you will notice, we’ve replaced our “blocking on a condition variable” to “blocking on IO requests.” This is a little problematic because it means that we’ve now lost the ability to be arbitrarily woken up by secondary threads (which applications can reasonably be expected to make). This problem is often solved by creating a “wake-up channel.” Kernels usually expose some kind of in-memory IO subsystem where one thread can write into some opaque object and another thread can read from it. If we create one of these ahead of time, and we always add reading-from-this-object to our in-flight IO request list, then another thread can write into this object and we will get woken up. Then, if we notice that this particular object is the one that woke us up, we can then look in our work list for anything that the writer may have enqueued there. (Therefore, in this method, there are no condition variables.)

Timers are a little interesting because we don’t want to have to create a thread for every timer that we create in our applications. Therefore, we want a single thread which will multiplex all the current timers that our application has created, and wake up for each one, one by one. Such a thread would have to sleep for a predetermined amount of time, but also might have to be woken up arbitrarily in case anyone creates a new timer which should expire earlier than any existing timer. This is usually implemented with the mechanism described above, except that our “wait for IO to be available” call usually also takes a timeout. We can set the timeout to whatever the next timer’s expiration time is, and any time someone creates a new timer, we can use the wake up channel to make sure the timeout gets updated.

Classically, this system is implemented with the select() system call [3], and the wake up channel is implemented with the pipe() system call [4], which works for reading/writing a particular file or network socket. However, modern operating systems (read: BSD) support waiting on more things than simply a particular file descriptor. The kevent() system call [5] was created, which can wait on:
  1. Reading and writing to file descriptors (just like select() does)
  2. AIO operations (more on this later)
  3. Filesystem operations, like if a file is deleted, written to by another process, renamed, etc.
  4. Child process exit()ing, fork()ing, or exec()ing
  5. Receiving a signal (lower precedence than signal handlers)
  6. IO on mach ports
  7. Explicit support for timers
This means that you don’t have to have separate threads watching (polling?) for these events, and then sending messages back to your main thread when they have occurred. Instead, the main thread itself can watch for these events at the same time for watching for IO on file descriptors.

Linux has a similar system call called epoll() [6].

Even though this idea is pretty cool, you shouldn’t implement it yourself for a production project. There are many mature projects out there that implement this for you. Take a look at libevent [7] and libev [8]. On OS X, Grand Central Dispatch [9] and NSRunLoop [10] do this.

Now, after all that is said and done, I wanted to mention that modern kernels also have support for actually asynchronous IO (Recall how, in the previous section, we are performing synchronous IO, we’re just are performing it for all our requests at the same time). There are two relevant mechanisms: 1) Setting the O_NONBLOCK flag [11] on a file descriptor, and 2) using AIO calls [12].

O_NONBLOCK doesn’t actually cause file descriptors to perform asynchronous IO; however, it causes IO to error in the case that it would otherwise have blocked. That’s one way to avoid the “my main thread is sleeping” problem - simply never sleep! You can also specify this at open() time. It also looks like not all kinds of file descriptors, such as pipes, support this flag on all OSes.

AIO calls, such as aio_read() and aio_write() actually perform synchronous IO. The trouble here comes in when the kernel wants to let you know that your asynchronous IO is complete. The AIO calls let you specify a struct sigevent structure which encodes how you would like to be notified of IO completion. There are currently two mechanism for the notification: 1) You can provide a callback to the call that the OS will call on an indeterminate thread (which might cause the OS to create a new thread for such a purpose), or 2) You can specify a signal [13] that should be delivered to your process. OS X currently only supports the latter.

Signals have their own problems in this regard, though. Signals, first of all, are heavier weight than the previously mentioned alternative. The real doozie, however, is that a signal can be delivered at any point, and, depending on which version of which kernel you are running, on any thread (this is usually configurable, at least). The “at any point” criteria means that your thread is suspended, a new stack is created for the signal handler, and the signal handler is invoked, wherever the original thread happened to be executing. This can wreak havoc on critical sections in your code because run loop invocations are no longer atomic. This can make it very difficult to reason about your code or to show correctness.

You could also use kevent() to listen for AIO completions, but OS X doesn’t support this.

On some other operating systems (Windows, Solaris), you can specify an “IO Completion Port” [14] with your asynchronous IO operation, and when the operation is complete, an item is written to the port. You can then read from the port in your program at your own discretion.