## Saturday, May 25, 2013

### Speeding up Counting Sort with Histopyramids

I was listening to Matt Swoboda’s 2012 GDC talk, where he mentions using histopyramids to optimize the marching cubes algorithm. You can also use histopyramids to speed up counting sort, as he describes.

A histopyramid is essentially a mip chain of a tensor, except instead of the reduction operator being an average, the reduction operator is addition. Because our input is only one dimensional, a histopyramid can more simply be described as adding elements pairwise in our input array, then recursing. For example, if our input list is this:

[3, 1, 4, 1, 5, 9, 2, 6]

then the next level of the histopyramid would be

[4, 5, 14, 8]

and the subsequent two levels are

[9, 22]
[31]

It can easily be seen that there are O(log(n)) levels in the histopyramid, and that the last element is the sum of all the elements in the original list. You can also see that, if you have a parallel processor, creating each level from the successive level has constant time delay, which means the delay for creating the entire histopyramid is O(log(n)). Memory allocation is easy because n + n/2 + n/4 + n/8 + ... + 1 just equal 2n, so you can simply allocate one array that’s twice the size of the input, then populate the second half of the input from the first half. There has to be synchronization between generating each level of the histopyramid, however, and because the OpenCL ‘barrier’ function only operates on thread groups and not globally, this has to be done with multiple kernel invocations.

Okay, so how do we use this to actually sort the list? Well, let’s associate each value in the histopyramid with the slice of the input array that contribute to its value. The last value in the histopyramid would therefore correspond to the entire input list. In the previous level of the histopyramid, the two values correspond to the first half of the input array and the second half of the input array, respectively, and so on and so forth.

Now, let’s say that we create a histopyramid of the frequency histogram. In such a histopyramid, each value in the histopyramid corresponds to the size of the output array that its slice encompasses. In our example above, this would mean that there are 31 values in the output array, 9 of which are between [0..3] and 22 of which are between [4..7] (because these are the bounds of the slices that each value corresponds to). If you look one level deeper, you can see that 4 of the output values are between [0..1], 5 values are between [2..3], 14 values are between [4..5], etc.

This means that, if i’m trying to populate a particular cell in the output list, I first ask “is the index of my cell less than 9 or greater than 9?” This is relevant because I know that the first 9 values are between [0..3]. If my index is greater than 9, I then shrink my range  to the 22 that I’m about to inspect (by subtracting 9 from my cell index), and ask if my index is now less than 14 or greater than 14. Eventually, I find the cell in the frequency histogram that my index corresponds to, so I can then fill in my cell with the frequency histogram’s cell’s index. You can see that this algorithm is O(log(n)).

Therefore, we have three passes. The length of the input array is ‘n’, and the size of the maximum value in the input array is ‘m’.
- To initialize the frequency histogram and the histopyramid to 0, 2m threads perform a O(1) computation
• To create the frequency histogram, n threads perform a O(1) computation (using atomic_inc). Because atomic operations have to be serialized, this has a delay of mode(input).
• To create the histopyramid, there are log(m) passes, the biggest of which requires m/2 threads, and they each perform a O(1) computation
• To create the output array, n threads perform a O(log(m)) computation.

Therefore, overall, the overall delay of the algorithm is O(mode(input) + log(m)). Also, max(n, 2m) threads are required to achieve this delay. The memory complexity is the same as for the serial case, because the histopyramid is the same size as the frequency histogram (so it’s just twice as big, which is the same asymptotic function).

### Counting Sort's Relation to Bead Sort

Counting sort is a sorting algorithm that doesn’t use comparisons to re-order its objects. Instead, counting sort creates a frequency histogram, then recreates the sorted list from the frequency histogram. For example, if you know that the input array consists of 3x of value A and 4x of value B, and that A < B, you know the sorted array is just AAABBBB. An obvious limitation of this is that the memory complexity of the sort is O(maximum value). Counting sort is the same as bucket sort, if our buckets have a size of 1.

Implementing counting sort on a scalar processor looks something like this (using my own made-up syntax):

max := maximum(input)  // Assumed to be known at compile time
histogram[] := allocate max cells
initialize histogram[] to 0
foreach input as x:
histogram[x]++
foreach histogram as (index, x):
for 0 to x:
emit index

However, I’m interested in implementing this on a parallel processor (such as a graphics card). The first loop is straightforward to implement, as it is embarrassingly parallel (the increment can be done with atomic operations). Each iteration of the second loop, however, depends on the previous iteration, which makes that loop entirely serial.

Interestingly enough, Wikipedia describes this second loop a little differently. Its description looks like this:

strictly_increasing := scanl1((+), histogram)  // scanl1 function taken from Haskell
foreach input as x:
output[atomic_inc(&strictly_increasing[x])] = x  // where atomic_inc returns the original value

In this approach, the last loop is also embarrasingly parallel, however, the scanl1 isn’t. Therefore, it’s not actually any better for a parallel processor.

There is, however, a way around this. What if we tried to calculate the strictly_increasing array directly? Instead of simply incrementing the histogram count of a particular value, we can modify the original loop to to increment all the histogram values to the right of the particular value. Incrementing each one of these histogram values is independent of each other, so we can actually do all these increments in parallel. This means that, instead of assigning each thread to a particular item in the input array, we can assign one thread to the combination of an item in the input array and a cell in the histogram. This means that we go to a two-dimensional work size. Each thread, therefore, simply executes “if (histogram_index > input_element) atomic_inc(&histogram[histogram_index]);”.

Looking at that code, it would appear that this is a constant-time sort. However, it isn’t. Those atomic increment operators need to be serialized on the memory bus. Therefore, the delay of this algorithm is on the order of whichever value needs to be incremented the most. Because the output of this histogram is strictly increasing, the rightmost value is the one that gets incremented the most. It also gets incremented for every element in the input list. This means that this algorithm is simply O(n).

Interestingly enough, we can actually use almost the exact same kernel to transform this histogram into the sorted list. The value of each value in the histogram corresponds to how many values in the input array are less than the index of that value in the histogram. Because we also know the length of the input array, we can determine, for each value in the histogram, how many values in the output array are greater than that particular index into the histogram. Well, if we know that k elements are greater than some threshold, why don’t we increment the last k elements in the output array? If we step along the thresholds one by one, the last elements in the array will be incremented according to their ultimate size. Therefore, if we go through all the thresholds (which are each of the indices in the histogram), we can recreate the sorted list. We can stepping through the histogram in one dimension, step through output indices in the other dimension. Each work item can check to see if it should increment its cell, and does so accordingly. This is essentially exactly the same code as the previous pass; the only difference is that the bounds are reversed. The number of work items is also the same because the bounds of the work grid are just flipped.

This second pass isn’t constant time either, for the same reason that the previous pass isn’t constant time. In particular, the increments that end up creating the greatest element have to be serialized. That means that this second pass is O(m), where m is the greatest value in the input array. Therefore, overall, this algorithm is O(n+m).

There is a caveat, however. Moving to a two-dimensional work size means creating a whole lot of threads. I’m running this on my graphics card, which is a parallel processor; however, my graphics card has some limitations. In particular, it only has two execution units, which means that I can only ::actually:: execute two work groups at the same time. Any more than that, and the GPU will have to context switch between the two groups. My graphics card also has a maximum work group size of 1024 elements. Because we have a two-dimensional work size, this means that my 1024 elements should be accessed as a 32x32 grid. We have two execution units, so our grid is effectively 32x64 elements big. This means that, if our maximum value that we want to sort is roughly the same as the length of the input list, we can only sort lists of about sqrt(32*64)=45 elements long without context switching the GPU. Context switching is bad because it means that our parallel threads aren’t actually executing in parallel, which is bad for our parallel sort. Therefore, moving to a two-dimensional work item size means we need more threads than a modern GPU can provide.

## Sunday, May 12, 2013

### Design of Mesa 3D Part 9: Vertex Buffer Objects

After the last series of posts, we now have a pretty good handle on how shaders are compiled. The next thing that a GL program typically does is create buffers and upload them. All of the relevant calls are in src/mesa/main/bufferobj.c.

We keep track of all the buffer objects in a hash table stored inside ctx->Shared->BufferObjects. This means that creating a buffer object is easy; just ask the hash table what its lowest unused key is (which is potentially an O(n) exhaustive search), ask the driver for a NewBufferObject(), then insert it into the hash table with that key. We have to surround that piece with a mutex because shared contexts may be executing buffer commands on different threads, and buffers are shared. The implementation of the software driver's NewBufferObject() function, or _mesa_new_buffer_object(), just malloc's a gl_buffer_object and initializes it with some default values.

Here's the layout of gl_buffer_object, defined in src/mesa/main/mtypes.h:

struct gl_buffer_object
{
GLint RefCount;
GLuint Name;
GLenum Usage;        /**< GL_STREAM_DRAW_ARB, GL_STREAM_READ_ARB, etc. */
GLsizeiptrARB Size;  /**< Size of buffer storage in bytes */
GLubyte *Data;       /**< Location of storage either in RAM or VRAM. */
/** Fields describing a mapped buffer */
/*@{*/
GLbitfield AccessFlags; /**< Mask of GL_MAP_x_BIT flags */
GLvoid *Pointer;     /**< User-space address of mapping */
GLintptr Offset;     /**< Mapped offset */
GLsizeiptr Length;   /**< Mapped length */
/*@}*/
GLboolean Written;   /**< Ever written to? (for debugging) */
};

_mesa_BindBuffer() is also fairly simple; it immediately delegates to bind_buffer_object().  We switch over the target to find which pointer in the context we're going to be updating, and then see if we can take an early out if the buffer is already bound properly. Otherwise, we look up the buffer in our hash map, and ask the driver to create the buffer if it didn't already exist. We then reference the buffer we found in our hash map, actually change the pointer in the context to point to the buffer, and ask the driver to BindBuffer(), which has no default implementation.

Alright, now let's allocate some space for the data with _mesa_BufferData(). After checking that the input data is good, we just hand the call off to the driver. The software driver (_mesa_buffer_data()) implements this by simply realloc'ing the Data pointer in the object, setting the fields for the size and usage of the buffer, and, if the input pointer is not NULL, memcpy()'ing the data into it. _mesa_BufferSubData() and _mesa_buffer_sub_dat()a work almost exactly the same way (with additional bounds checking).

_mesa_MapBuffer() creates an access bit string based on the access enum provided, does some error checking, then delegates to the driver's MapBuffer() function. If the driver was successful, we fill in the gl_buffer_object's AccessFlags field with the bit string we previously computed, then return the Pointer field of the gl_buffer_object. The driver function, mesa_buffer_map(), disregards the access parameter, and returns a pointer to the original data.

Using a software renderer sure makes it simple to deal with buffers! I'll get into how hardware drivers implement this in another post. For now, it's onward to glDraw* and the tnl pipeline!

## Monday, May 6, 2013

### Runtime of std::hash

One of the questions that I like to interview candidates for my team with involves hashing strings. I was talking with someone about it, and they insisted that computing a hash function takes constant time, (O(1)). I then asked how good the has function could really be if it didn't look at all the characters. He insisted that you could look at all the characters, but still have the function be constant time.

I decided that this would be a good opportunity to see how it's really done. The new C++ standard, C++11, includes std::hash, which does the computation that we're interested in. The implementation that FreeBSD's libstdc++ is also open source, so I can just go look at the source.

The symbol is defined in string.h, so let's take a look at that file. We can immediately see that hash<basic_string<_CharT, _Traits, _Allocator> >::operator() delegates to __do_string_hash() which delegates to __murmur2_or_cityhash(). That function is defined in include/memory, and has a really interesting trick to it, using templates:

// We use murmur2 when size_t is 32 bits, and cityhash64 when size_t
// is 64 bits.  This is because cityhash64 uses 64bit x 64bit
// multiplication, which can be very slow on 32-bit systems.
template <class _Size, size_t = sizeof(_Size)*__CHAR_BIT__>
struct __murmur2_or_cityhash;

As you can see, we're creating a new templated value called size_t. We can then use template specialization to define various implementations of the function. In particular, we have these two implementations:

template <class _Size>
struct __murmur2_or_cityhash<_Size, 32>
{
_Size operator()(const void* __key, _Size __len);
};

// murmur2
template <class _Size>
_Size
__murmur2_or_cityhash<_Size, 32>::operator()(const void* __key, _Size __len)

We've also got a cityhash64 implementation.

template <class _Size>
struct __murmur2_or_cityhash<_Size, 64>
{
_Size operator()(const void* __key, _Size __len);

...

// cityhash64
template <class _Size>
_Size
__murmur2_or_cityhash<_Size, 64>::operator()(const void* __key, _Size __len)

Cool! Alright, now let's read those functions to try to see what the running time is. First, let's take murmor. The main loop in this function looks like this:

for (; __len >= 4; __data += 4, __len -= 4)

In addition, __len isn't modified in the loop. Well, that makes it simple; this is clearly O(n). What about cityhash? Its main loop looks like this:

do {
...
__len -= 64;
} while (__len != 0);

__len isn't modified elsewhere within the loop. Well, looks like we've got an answer! std::hash is O(n).