I was listening to Matt Swoboda’s 2012 GDC talk, where he mentions using histopyramids to optimize the marching cubes algorithm. You can also use histopyramids to speed up counting sort, as he describes.
A histopyramid is essentially a mip chain of a tensor, except instead of the reduction operator being an average, the reduction operator is addition. Because our input is only one dimensional, a histopyramid can more simply be described as adding elements pairwise in our input array, then recursing. For example, if our input list is this:
[3, 1, 4, 1, 5, 9, 2, 6]
then the next level of the histopyramid would be
[4, 5, 14, 8]
and the subsequent two levels are
[9, 22]
[31]
It can easily be seen that there are O(log(n)) levels in the histopyramid, and that the last element is the sum of all the elements in the original list. You can also see that, if you have a parallel processor, creating each level from the successive level has constant time delay, which means the delay for creating the entire histopyramid is O(log(n)). Memory allocation is easy because n + n/2 + n/4 + n/8 + ... + 1 just equal 2n, so you can simply allocate one array that’s twice the size of the input, then populate the second half of the input from the first half. There has to be synchronization between generating each level of the histopyramid, however, and because the OpenCL ‘barrier’ function only operates on thread groups and not globally, this has to be done with multiple kernel invocations.
Okay, so how do we use this to actually sort the list? Well, let’s associate each value in the histopyramid with the slice of the input array that contribute to its value. The last value in the histopyramid would therefore correspond to the entire input list. In the previous level of the histopyramid, the two values correspond to the first half of the input array and the second half of the input array, respectively, and so on and so forth.
Now, let’s say that we create a histopyramid of the frequency histogram. In such a histopyramid, each value in the histopyramid corresponds to the size of the output array that its slice encompasses. In our example above, this would mean that there are 31 values in the output array, 9 of which are between [0..3] and 22 of which are between [4..7] (because these are the bounds of the slices that each value corresponds to). If you look one level deeper, you can see that 4 of the output values are between [0..1], 5 values are between [2..3], 14 values are between [4..5], etc.
This means that, if i’m trying to populate a particular cell in the output list, I first ask “is the index of my cell less than 9 or greater than 9?” This is relevant because I know that the first 9 values are between [0..3]. If my index is greater than 9, I then shrink my range to the 22 that I’m about to inspect (by subtracting 9 from my cell index), and ask if my index is now less than 14 or greater than 14. Eventually, I find the cell in the frequency histogram that my index corresponds to, so I can then fill in my cell with the frequency histogram’s cell’s index. You can see that this algorithm is O(log(n)).
Therefore, we have three passes. The length of the input array is ‘n’, and the size of the maximum value in the input array is ‘m’.
- To initialize the frequency histogram and the histopyramid to 0, 2m threads perform a O(1) computation
- To create the frequency histogram, n threads perform a O(1) computation (using atomic_inc). Because atomic operations have to be serialized, this has a delay of mode(input).
- To create the histopyramid, there are log(m) passes, the biggest of which requires m/2 threads, and they each perform a O(1) computation
- To create the output array, n threads perform a O(log(m)) computation.
Therefore, overall, the overall delay of the algorithm is O(mode(input) + log(m)). Also, max(n, 2m) threads are required to achieve this delay. The memory complexity is the same as for the serial case, because the histopyramid is the same size as the frequency histogram (so it’s just twice as big, which is the same asymptotic function).