Recently (read: past couple of years) I have become interested in high performance computing. This includes everything from PlayStation 3s to GPGPUs to BlueGenes. It seems that each company has created it’s own framework and hardware with which to program a high performance program.
As I was researching all these different kinds of technologies, I thought to myself that each one was interesting, but programmed in a completely different way. For example, when programming with NVidia CUDA, you have to make sure that the execution path is the same for all the threads that you want to run (If you want the best performance). If you’re programming with the Cell Broadband Engine, you have to use their (arcane) mbox to pass messages between the multiple processors. If you’re using MPI, you have to have many computers that all have the same hardware, are set up with single sign on, and have a shared filesystem – certainly a non-trivial task.
Not only does the programming have many different quirks for each system, but the performance varies greatly as well. The lastest NVidia (295GTX) card boasts 289 gigaflops (http://en.wikipedia.org/wiki/GeForce_200_Series); the PlayStation 3 is reported to have 28 gigaflops (http://fah-web.stanford.edu/cgi-bin/main.py?qtype=osstats). A beowulf cluster simply has the addition of all the flops of its constituent parts. Now, I’ve heard amazing things about each of these technologies, and those numbers don’t seem to reflect the hype.
I thought I would take a survey of each of these technologies to try to make some sense of the mess out there. I believe if I carefully document the problems I encounter, the pros and cons that I have found for each system, performance, and cost, both the world and I may be a little happier.
The problem that I have chosen to implement on each of these technologies is a program called Bead Sort (http://en.wikipedia.org/wiki/Bead_sort). It is a sorting algorithm (hence the name) and I thought it would be particularly appropriate since computer scientists like myself seem to be obsessed with sorting algorithms. I’ll try to explain it quickly here (I know clicking on links is strenuous!):
Imagine, if you will, there are 10 totem poles in front of you, all lined up. Next to the totem poles is a pile of gigantic washers – big enough to fit around the totem poles. Now, you want to represent the number 5, so you pick up five washers and put one on each of the first five totem poles. The washers slide down the poles, and now you have 5 washers lying at the bottom of the first five totem poles. Now you want to represent the number 7, so you follow suit and drop 7 washers down the first 7 totem poles. Now, however, there are already 5 washers on the floor, so the first 5 of this set of 7 land on the washers that are already there, and the two left over drop down to be flush with the floor. The sort works by dropping each of the numbers you want to sort down on the totem poles, and you’re left with a nice little triangle of washers on your totem poles, with each horizontal level required to have less washers on it than the level below it. Then you simply read off the number of washers in each level in order, and your list is now sorted.
The benefit of this sorting method is that, if each of the totem poles are handled in parallel (as they are via gravity), it is an O(n) sort (sort of, see below). You just drop sets of washers down on these poles, and the number of sets you drop is the number of numbers you want to sort. Once you’re done dropping the sets, you’re done. “But wait!” you may be screaming, “converting the number 5 to 11111 is itself an O(n) loop! That turns the computation into an O(n*m) one, where n is the number of numbers and m is the size of the maximum number!” I do have answer for you, though it may not seem very tasteful.
In theoretical computer science, complexity is a function of the size of the input. The idea is that if an algorithm is O(n^2), doubling the size of the input should quadruple the processing time required. The input to a program is technically a sequence of characters – not numbers themselves. Therefore, doubling the size of the input of a program, technically, should be doubling the number of characters in the input. The value of a number and the representation for a number are not linearly related, however. If you take the number “123” and double its representation to get “123123”, you don’t get a number that’s twice as big – you get a number that’s roughly 1000 times as big.
So, the loop that converts that “5” to 11111 is really just a loop to represent a number in the format that it should be represented in – it does have the property where doubling the representation to 1111111111 doubles its value. As computer scientists, and even mathematicians, we don’t want to write all those tally marks to represent large numbers, so we have adopted a terminology that lets us logarithmically decrease the number of digits we have to write. Therefore, counting the loop to turn digits into sequences of bits should not be counted. I know this argument is weak, but I feel that it at least provides a feeble defense.
One of the benefits of using this algorithm is that sorting is a standard practice in computer science; so many different sorting algorithms can be easily swapped out for each other. In addition, the contract that a sorting function makes is very well defined. I can use this to my advantage – I could make a separate object file for each of the different kinds of sorts that I want to use, each of which includes a sort function of the same name and signature. Then I can use the same calling code with all methods – I just have to choose which object file I link this calling code with to choose which implementation to test. This will work to an extent – I can only do this for programs that will run on the same hardware. Other than that, I will have to build an entirely new binary for each entirely new system that this will run on (i.e. a PlayStation 3 cannot share the calling binary that my laptop uses).
Anyway, I have been programming with NVidia CUDA for maybe a year or so, so I thought that I have a fair amount of experience with graphics cards, but I wanted to try a new platform for the first implementation. I have heard about OpenCL, and thought that I would give it a try, so I watched Dave Gohara’s OpenCL tutorials (http://macresearch.org/opencl) to learn about the platform.
My initial reaction was that OpenCL was (almost) everything that GPU computing should be. The first thing that struck me was that OpenCL uses just-in-time compilation, so the binary form of your kernel is not actually made at compile time. When you run your OpenCL program, you have to give your program the text of the C code that describes your kernel, and your program compiles, builds, links it, and runs it. This means that there is a compiler and linker inside the binary of my bead sort program. Not only that, but I do not get the benefit of using a well-established compiler like G++ - I have to hope that the OpenCL compiler is fast and optimized. I realize that computers today (and certainly any computer that you would be using OpenCL on would have a decent graphics card and would therefore be newish) are fast, powerful, have large disk drives and large amounts of memory, but something like this would make my software engineering professor cry. It throws cohesion out the window.
Now, I can see the problem that they were trying to solve. First of all, my program should run on whatever kind of hardware it wants to, regardless of where it was compiled. By compiling the kernel at runtime, it can be tuned for the specific hardware that will be running it. In addition, it has the benefit that I can compile my OpenCL program with any old compiler and not have to worry about the compiler having support for making binaries for graphics cards. CUDA solved this problem by simply creating their own compiler that does know how to compile binaries for specific graphics cards. Both ways work, and both ways have pros and cons, so I’m going to delay judgment of this design decision. I will say, however, that CUDA’s solution makes a lot of sense given that the hardware is fixed – the programs only need to run on NVidia hardware, and the fact that they are attempting to keep their technology closed and proprietary. OpenCL appears to be attempting to keep the options open for what hardware the program will run on.
That being said, OpenCL is entirely extendable. In fact, it’s designed so that you can change the device your program runs on by simply changing one line of code – the line of code that selects the active device. This works well enough that you can run your program on a CPU or a GPU just by changing this line. Think about that – it means that it will generate entirely different binaries depending on where you want to run your calculation, on the fly. This is incredibly powerful, and I can see this technology being extended to any of the various HPC systems out there. There is a current effort to port OpenCL to the PlayStation 3, to make it much easier to program with (http://sites.google.com/site/openclps3/). According to that website, even IBM has released an OpenCL implementation for the cell architecture. I have a dream, that one day I will be able to program all HPC hardware in one unified way. OpenCL appears that, one day, it may be that way.
So, after learning how to use OpenCL, I started to actually write my program. This actually required a fair amount of thinking. Clearly, one thread (or work-unit, in OpenCL terminology) should handle each totem pole. In addition, the presence of a washer at a specific place on the totem pole clearly should be represented by a bit (1 = present, 0 = absent). Therefore, the totem poles would be represented as a two-dimensional array of bits. Each thread will crawl up one of the totem poles and make them “fall.” However, computers do not have the concept of a single bit being addressable, so you must use bytes. Then, 8 threads must be responsible for a single byte. But each thread is not aware of the other threads operating on the same byte, so thread 1 can set a specific byte to be something (say a washer is falling on totem pole 1) but thread 5 is going to set it to be something completely different (say a washer is not falling on totem pole 5). Since the memory is byte-addressable, the threads will step on each others’ toes. Okay – simple solution: have 1 thread operate on each 8 bits in a byte, and wrap up the computation in a for-loop that executes 8 times, once for each bit in the byte.
Okay, but how about if column 4 has a washer every other spot, and column 7 has no washers in it. How would one go about actually making the washers “fall?” Well, the simple solution would be to simply count up the number of washers in each column, call it x, and set the n-x values of the totem pole to be 1 and the remaining values to be 0. This presents a problem, however, as the range of that last loop is data dependent. In the example given, column 4 is going to loop from 1 to n/2, and column 7 is going to loop from 0 to 0. As with any GPU programming, all the threads must have the same execution path for best performance. So the GPU threads cannot do the actual dropping of the washers – the only thing they can do is count the number of washers on a totem pole.
This, however, is okay! I have developed a linear time algorithm to take the counts for the totem poles and recreate the original numbers. Say you have counts of [7, 5, 5, 5, 3, 2] (meaning totem pole 1 has 7 washers on it, and totem pole 5 has 3 washers on it). Because there are 6 numbers here and therefore 6 totem poles in a row that have washers on them, you know that the largest row in the triangle (the bottom row) has 6 washers in it. In addition, the smallest number in that list is a 2, so you know that there are 2 rows with 6 washers in them. The output list is now [6, 6]. Now, erase these two largest rows because they have already been counted. Now the counts are [5, 3, 3, 3, 1]. There are 5 numbers in this list, so you know that there is a row with 5 washers in it. Add this to the output, and erase that row from the triangle. Now the output is [6, 6, 5] and the counts are [4, 2, 2, 2]. After repeating this until the entire pyramid has been removed, the output is now [6, 6, 5, 4, 4, 1, 1], which is the correct sequence. With proper value caching, this algorithm can be reduced to linear time. I believe it is a dynamic programming algorithm, but to say for sure would take more study.
So, everything is linear time. In addition, the timing for this last step on the CPU is around 0.1% or the timing for getting the totem pole counts on the GPU, so this last step can be safely ignored.
After coding this up, I felt that I needed to optimize a bit. The algorithm itself does not use shared (or “local,” in OpenCL lingo) memory at all – I figured this clearly was a problem. So, I gave each computational block (or “work group”) its own chunk of shared memory, and modified the kernel so that the computer copies chunks of data into this cache, operates on the cache, and then copies the next chunk in. The idea is that copying a splotch of data from global memory to shared memory is quick, assuming you format your data correctly. The main rules that govern this is, as I understand it, that all base pointers should just be divisible by 32, and the copy should be a “thread x copies byte x” one-to-one copy. Once the data is quickly copied to local memory, access to it is exceptionally quick, especially if you are accessing that data multiple times like I am – each thread has to deal with each byte 8 times for each of the 8 bits in the byte (This is where you see a greater increase in speed – n*speedup is much more noticeable than just a speedup).
In addition, I had to make sure that there were no bank conflicts. Each byte in local memory is located inside a bank, and if two thread attempt to access the same bank, one thread has to wait for the other to finish, resulting in a slowdown. From what I understand, shared memory byte x is located in bank x mod 32 – the bytes roll down the banks. Okay, so if the number of threads in a workgroup is limited to 32, each one can operate on its own bank, assuming the base pointer is divisible by 32 and each thread in the workgroup offsets from the base pointer by that thread’s id. Limiting the number of threads per workgroup to 32 will still keep each multiprocessor at full load, because the underlying driver will put two workgroups on the same multiprocessor if they will fit, and in this case the numbers are small enough that multiple workgroups will fit on the same multiprocessor.
So I coded this up in OpenCL and unhappily found that this new, memory-optimized version of the kernel is, in fact, slightly slower than the original, unoptimized version. This set me back a little bit – All that work for nothing! Perhaps I coded the solution incorrectly, but even after looking over the code everything seemed to be in place. After thinking about it for a little while, here is the explanation I came up with:
In the unoptimized version, the bytes being read are being read quickly, for the same reason that the optimized version can copy in data from global memory very quickly – the memory is coalesced. In both versions of the program, both versions are reading in the same amount of data from global memory – that should not account for any time differences. Therefore, the slowest part of the kernel is the same for both versions. As OpenCL has the concept of automatic variables, where the system will determine in which part of memory (local, global, etc) to put the variable, I am assuming that OpenCL is actually doing the caching that I explicitly programmed in under the hood in the unoptimized version. It is surely an intelligent system.
Looking back on programming in OpenCL, a fair amount of work was done for me under the hood (like the optimization). The amount of code to set up the OpenCL environment seems a little large and overly verbose, even though I know what the function of each line of code I wrote is. It makes sense that all those lines need to be there, but the sheer amount of the code I had to write just to run a kernel was a little daunting. Writing the kernel, though I had lots of things to keep in mind as I did it (such as memory coalescing and how I broke up the problem), seemed to be fairly straightforward. All in all, programming for OpenCL is very similar to programming with CUDA, with a couple more function calls that are required. Looking over the program afterwards seems deceptively simple – finding good values for the constants that I use was nontrivial, and having to pad the data so that its size is a multiple of 32 in both dimensions is not exactly elegant, but the final product is fairly clean. As this is the first framework that I have used, I cannot do any performance comparisons, but my laptop sorts anything I can throw at it (Anything that fits in the 128MB that is addressable on the video card) in under half a second. It surely is a fairly speedy framework.
My (uncommented) code can be found here.