Counting sort is a sorting algorithm that doesn’t use comparisons to re-order its objects. Instead, counting sort creates a frequency histogram, then recreates the sorted list from the frequency histogram. For example, if you know that the input array consists of 3x of value A and 4x of value B, and that A < B, you know the sorted array is just AAABBBB. An obvious limitation of this is that the memory complexity of the sort is O(maximum value). Counting sort is the same as bucket sort, if our buckets have a size of 1.
Implementing counting sort on a scalar processor looks something like this (using my own made-up syntax):
max := maximum(input) // Assumed to be known at compile time
histogram := allocate max cells
initialize histogram to 0
foreach input as x:
foreach histogram as (index, x):
for 0 to x:
However, I’m interested in implementing this on a parallel processor (such as a graphics card). The first loop is straightforward to implement, as it is embarrassingly parallel (the increment can be done with atomic operations). Each iteration of the second loop, however, depends on the previous iteration, which makes that loop entirely serial.
Interestingly enough, Wikipedia describes this second loop a little differently. Its description looks like this:
strictly_increasing := scanl1((+), histogram) // scanl1 function taken from Haskell
foreach input as x:
output[atomic_inc(&strictly_increasing[x])] = x // where atomic_inc returns the original value
In this approach, the last loop is also embarrasingly parallel, however, the scanl1 isn’t. Therefore, it’s not actually any better for a parallel processor.
There is, however, a way around this. What if we tried to calculate the strictly_increasing array directly? Instead of simply incrementing the histogram count of a particular value, we can modify the original loop to to increment all the histogram values to the right of the particular value. Incrementing each one of these histogram values is independent of each other, so we can actually do all these increments in parallel. This means that, instead of assigning each thread to a particular item in the input array, we can assign one thread to the combination of an item in the input array and a cell in the histogram. This means that we go to a two-dimensional work size. Each thread, therefore, simply executes “if (histogram_index > input_element) atomic_inc(&histogram[histogram_index]);”.
Looking at that code, it would appear that this is a constant-time sort. However, it isn’t. Those atomic increment operators need to be serialized on the memory bus. Therefore, the delay of this algorithm is on the order of whichever value needs to be incremented the most. Because the output of this histogram is strictly increasing, the rightmost value is the one that gets incremented the most. It also gets incremented for every element in the input list. This means that this algorithm is simply O(n).
Interestingly enough, we can actually use almost the exact same kernel to transform this histogram into the sorted list. The value of each value in the histogram corresponds to how many values in the input array are less than the index of that value in the histogram. Because we also know the length of the input array, we can determine, for each value in the histogram, how many values in the output array are greater than that particular index into the histogram. Well, if we know that k elements are greater than some threshold, why don’t we increment the last k elements in the output array? If we step along the thresholds one by one, the last elements in the array will be incremented according to their ultimate size. Therefore, if we go through all the thresholds (which are each of the indices in the histogram), we can recreate the sorted list. We can stepping through the histogram in one dimension, step through output indices in the other dimension. Each work item can check to see if it should increment its cell, and does so accordingly. This is essentially exactly the same code as the previous pass; the only difference is that the bounds are reversed. The number of work items is also the same because the bounds of the work grid are just flipped.
This second pass isn’t constant time either, for the same reason that the previous pass isn’t constant time. In particular, the increments that end up creating the greatest element have to be serialized. That means that this second pass is O(m), where m is the greatest value in the input array. Therefore, overall, this algorithm is O(n+m).
There is a caveat, however. Moving to a two-dimensional work size means creating a whole lot of threads. I’m running this on my graphics card, which is a parallel processor; however, my graphics card has some limitations. In particular, it only has two execution units, which means that I can only ::actually:: execute two work groups at the same time. Any more than that, and the GPU will have to context switch between the two groups. My graphics card also has a maximum work group size of 1024 elements. Because we have a two-dimensional work size, this means that my 1024 elements should be accessed as a 32x32 grid. We have two execution units, so our grid is effectively 32x64 elements big. This means that, if our maximum value that we want to sort is roughly the same as the length of the input list, we can only sort lists of about sqrt(32*64)=45 elements long without context switching the GPU. Context switching is bad because it means that our parallel threads aren’t actually executing in parallel, which is bad for our parallel sort. Therefore, moving to a two-dimensional work item size means we need more threads than a modern GPU can provide.