Wednesday, October 20, 2010

Brute Forcing Einstein's Riddle with CUDA

The other day, I wrote a blog post describing how I brute forced Einstein's riddle with the Cell Broadband Engine. Being the geek that I am, I decided to see how the program would run on NVidia CUDA.

Here are some things I learned:

  • Programming is much easier in CUDA than on the Cell. You don't have to worry about memory alignment, you don't have to worry about DMA transfers, you don't have to worry about inter-SPU communication, you don't have to worry about waiting for asynchronous calls, or anything like that. All you do is a CUDA memcpy, and it just works.
  • NVidia's extension to the C language (namely, the <<<>>> syntax), seems a little inelegant to me. I like the Cell's build process better, it seems conceptually simpler, even though there are more steps.
  • Dealing with 2-D arrays is a little tricky in CUDA, because a pitch is involved. The reason a pitch is necessary is that memory is stored in banks. In a typical application, each thread will process a single row of a 2-D array. Now, if the length of the bank and the width of the array are not divisible easily, there can be many threads trying to access the same bank, which is bad. The solution, therefore, is to waste space, and pad empty space at the end of every row, so that each row exists in a different bank. This makes pointer arithmetic a little tricky, because, in order to access row i and column j, you have to take the pointer to the 2-d array, cast it as a char (because the unit of the pitch is in bytes), add i*pitch bytes to the pointer (which makes the pointer point to the correct row), cast it back to its original type, then add j to it. Ugliness, but it makes the program run faster.
  • I was having a problem where my kernels were failing to execute. It turns out it was because my blocksize was too large for the number of registers each thread was using. The idea is that a single multiprocessor has a pool of registers to use. Every thread in a block executes on the same multiprocessor. That means that the number of registers used is the number of threads in the block times the number of registers that each thread used. This multiplication must yield a number less than 8192, which is how many registers each multiprocessor has on my card. Fortunately, the number of threads in the block is adjustable (the number of registers the compiler uses is not adjustable).
  • The size of the grid you pass to CUDA can only be as big as 2^16-1. The maximum number of threads in the grid can be found above. This means that there is a maximum number of threads that can execute at a time. This number is several hundred times smaller than the size of the problem space that I was exploring. The solution to this is to call the kernel in a for loop, and pass an offset to all the threads in a particular kernel invocation. Each thread will then calculate its own unique index, add that to the offset, and use this sum as it's true index. That way, chunks of the problem space can be searched in series. This can be a little tricky, though, because CUDA threads don't support 64-bit ints, which are necessary to hold the entire problem space. This means that I had to pass the offset to the kernel in base 120 (as 5 ints), have each thread calculate its own local index, convert that to base 120 (so the index is now in a form of 5 ints), and perform a software addition of the two base-120 numbers.
  • The way CUDA works is that you write your kernel as if it is running in a single thread, using scalars. The CUDA compiler then figures out a way to run many copies of your program on a single vector processor. For example, say you have the code "a = b + c" in your program. Now, let's say that you have 4 threads executing this program simultaneously, and they are all executing the addition step at once. Let's call thread 1's a a1, thread 2's a a2, etc. If you have vectors of [b1, b2, b3, b4], and [c1, c2, c3, c4], you can add these and get [a1, a2, a3, a4] in a single vector add, rather than 4 scalar ads. Anyway, the compiler figures out a way to vectorize your program. When a kernel gets invoked, It schedules different blocks to run on the different multiprocessors. Inside a single block, the CUDA runtime executes threads in warps. This means that it executes threads 0-31 at the same time, using the vector processor, and using a first batch of registers (let's call them registers 0-39). After this warp has executed for some amount of time, that warp stops running and the runtime fetches the next warp, and executes threads 32-63 at the same time, using the same vector processor, using a second batch of registers (registers 40-59). It keeps executing warps this way until the block is done executing. The CUDA runtime is free to schedule some blocks to run after others have entirely completed, so there may be more blocks to be run than there are multiprocessors. If two threads in the same warp actually don't execute the same instruction at the same time (i.e. an "if" statement caused two different threads to execute different blocks), the CUDA runtime has no choice but execute both codepaths, and throw away the results for the wrong codepath for each group of threads. This is usually done with a "Select Bits" assembly command.
After this is all said and done, it turns out that the CUDA program is about 3.4 times faster than the Cell program. I attribute this to the fact that the problem is an embarrassingly parallel problem, and CUDA can execute multiple threads at the same time, if those threads follow the same codepath. I wonder which kind of problem has enough dynamic branching (or a problem that just isn't embarrassingly parallel) where it would beat CUDA in performance.
litherum@vincent:~/einstein$ time ./einstein
Found solution:
0 1 2 3 4
3 4 0 1 2
3 2 0 4 1
4 2 1 0 3
1 2 0 4 3
2 3 1 4 0

real    6m44.386s
user    6m43.105s
sys     0m1.132s
Here is my code.

No comments:

Post a Comment