The way I was doing this before was to line up each of the beads in a giant 2-D array in one stage, make the beads fall in another stage, then manipulate the new array into telling me the sorted numbers in a third stage. I realized that I could combine the first two stages, speed up the entire program, and cut down on the amount of code that need exist.

Instead of setting up a 2-D array of booleans ahead of time, and then making them all fall, I could simply make them fall as I put them on the columns. This decreases the code size drastically. The algorithm for the first two stages now simply goes as such:

- Assign an index number to each column, from 0 to n.
- Give each column a single int counter variable, which will refer to the number of beads that are stacked up on that pole
- Have each column go through the numbers to be sorted, and, if the number to be sorted is greater than the index for the current column, increment the column’s count. Note: This can be done in a very tricky way by saying “count += num > index”.
- Have each column output it’s count to an output array, in the index of its own column index. Call this output array the “counts” array.

Now, you have that nice little triangle of sorted numbers. How should one go about reading them off the columns? Well, here’s the “aha” moment. The first number in the sorted array is the number of columns with beads in the first position of the column. How many columns have this property? Well, all the columns with counts greater than 0 have this property. Okay, now, the second number in the sorted array is the number of columns with breads in the second position of the column. How many columns have this property? Well, all the columns with counts greater than 1. The next number in the sorted array is the number of columns with counts greater than 2.

Now, this is the exact same logic as used in the first phase of the bead sort algorithm! Now, each column goes through the counts array, and if the value in the counts array is greater than the column index, that column’s count gets incremented. The only difference is now, the values of m and n are switched. On this second pass, each column is going through the count array, which is of size m (the maximum size of a number to sort). The new output array will be of size n (it has to be, because n is the number of numbers to sort, and the new output array is the sorted numbers).

The real interesting part of this algorithm is that you can simply run the parallel part of the algorithm twice, with different arguments. This leads to very small, clean, readable code. It also turns the runtime of the algorithm from O(m*n) to O(max(m, n)).

My cleanest implementation of this code is probably written in CUDA, and can be found here.