Litherum: October 2010

The other day, I wrote a blog post describing how I brute forced Einstein's riddle with the Cell Broadband Engine. Being the geek that I am, I decided to see how the program would run on NVidia CUDA.

Here are some things I learned:

Programming is much easier in CUDA than on the Cell. You don't have to worry about memory alignment, you don't have to worry about DMA transfers, you don't have to worry about inter-SPU communication, you don't have to worry about waiting for asynchronous calls, or anything like that. All you do is a CUDA memcpy, and it just works.
NVidia's extension to the C language (namely, the <<<>>> syntax), seems a little inelegant to me. I like the Cell's build process better, it seems conceptually simpler, even though there are more steps.
Dealing with 2-D arrays is a little tricky in CUDA, because a pitch is involved. The reason a pitch is necessary is that memory is stored in banks. In a typical application, each thread will process a single row of a 2-D array. Now, if the length of the bank and the width of the array are not divisible easily, there can be many threads trying to access the same bank, which is bad. The solution, therefore, is to waste space, and pad empty space at the end of every row, so that each row exists in a different bank. This makes pointer arithmetic a little tricky, because, in order to access row i and column j, you have to take the pointer to the 2-d array, cast it as a char (because the unit of the pitch is in bytes), add i*pitch bytes to the pointer (which makes the pointer point to the correct row), cast it back to its original type, then add j to it. Ugliness, but it makes the program run faster.
I was having a problem where my kernels were failing to execute. It turns out it was because my blocksize was too large for the number of registers each thread was using. The idea is that a single multiprocessor has a pool of registers to use. Every thread in a block executes on the same multiprocessor. That means that the number of registers used is the number of threads in the block times the number of registers that each thread used. This multiplication must yield a number less than 8192, which is how many registers each multiprocessor has on my card. Fortunately, the number of threads in the block is adjustable (the number of registers the compiler uses is not adjustable).
The size of the grid you pass to CUDA can only be as big as 2^16-1. The maximum number of threads in the grid can be found above. This means that there is a maximum number of threads that can execute at a time. This number is several hundred times smaller than the size of the problem space that I was exploring. The solution to this is to call the kernel in a for loop, and pass an offset to all the threads in a particular kernel invocation. Each thread will then calculate its own unique index, add that to the offset, and use this sum as it's true index. That way, chunks of the problem space can be searched in series. This can be a little tricky, though, because CUDA threads don't support 64-bit ints, which are necessary to hold the entire problem space. This means that I had to pass the offset to the kernel in base 120 (as 5 ints), have each thread calculate its own local index, convert that to base 120 (so the index is now in a form of 5 ints), and perform a software addition of the two base-120 numbers.
The way CUDA works is that you write your kernel as if it is running in a single thread, using scalars. The CUDA compiler then figures out a way to run many copies of your program on a single vector processor. For example, say you have the code "a = b + c" in your program. Now, let's say that you have 4 threads executing this program simultaneously, and they are all executing the addition step at once. Let's call thread 1's a a1, thread 2's a a2, etc. If you have vectors of [b1, b2, b3, b4], and [c1, c2, c3, c4], you can add these and get [a1, a2, a3, a4] in a single vector add, rather than 4 scalar ads. Anyway, the compiler figures out a way to vectorize your program. When a kernel gets invoked, It schedules different blocks to run on the different multiprocessors. Inside a single block, the CUDA runtime executes threads in warps. This means that it executes threads 0-31 at the same time, using the vector processor, and using a first batch of registers (let's call them registers 0-39). After this warp has executed for some amount of time, that warp stops running and the runtime fetches the next warp, and executes threads 32-63 at the same time, using the same vector processor, using a second batch of registers (registers 40-59). It keeps executing warps this way until the block is done executing. The CUDA runtime is free to schedule some blocks to run after others have entirely completed, so there may be more blocks to be run than there are multiprocessors. If two threads in the same warp actually don't execute the same instruction at the same time (i.e. an "if" statement caused two different threads to execute different blocks), the CUDA runtime has no choice but execute both codepaths, and throw away the results for the wrong codepath for each group of threads. This is usually done with a "Select Bits" assembly command.

After this is all said and done, it turns out that the CUDA program is about 3.4 times faster than the Cell program. I attribute this to the fact that the problem is an embarrassingly parallel problem, and CUDA can execute multiple threads at the same time, if those threads follow the same codepath. I wonder which kind of problem has enough dynamic branching (or a problem that just isn't embarrassingly parallel) where it would beat CUDA in performance.

litherum@vincent:~/einstein$ time ./einstein
Found solution:
House
0 1 2 3 4
---------
3 4 0 1 2
3 2 0 4 1
4 2 1 0 3
1 2 0 4 3
2 3 1 4 0


real    6m44.386s
user    6m43.105s
sys     0m1.132s
litherum@vincent:~/einstein$

Here is my code.

The other day, my girlfriend stumbled upon this riddle. It is a riddle created by Einstein, and he postulated that 98% of the world’s population would be able to solve it. The riddle is like one of those riddles you solve in third grade about matching 5 people and their favorite juices, or whatever. The setup of the riddle is fairly simple: There are five houses in a row. Each of these houses is a different color, and has an owner who has a different nationality, a different kind of beverage, a different brand of cigar, and keeps a different pet. This means there are 5 different houses and 5 different criteria to describe each house. The problem gives 15 statements that are true about the correct configuration.

Being the computer scientist that I am, I thought about how to get a computer to solve this for me. Brute forcing the solution seemed fairly simple, so I started there. If you don’t know what brute forcing a solution is, it’s a computer program that tries every possible input configuration to a problem until it finds the right configuration. Such a program can only work on a problem where the space of configurations is not to big (which is why brute-force attacks on common encryption doesn’t work: the key space is simply too big!)

Let’s get a feel for the problem space. If there was only one house, this house would have 5 different choices for color, 5 different choices for nationality, 5 different choices for beverage, etc., making the number of possibilities 5*5*5*5*5, or 5^5 possible configurations. Now, if a second house were to exist, that house would have only 4 choices for color, because it can’t be the same color as the first house, whatever color that happens to be. By the same logic, the second house only has 4 choices for nationality, beverage, etc. This means that, for each configuration of the first house, the second house has 4*4*4*4*4 different configurations, for a total of 5^5*4^5 configurations with 2 houses. A third house would only have 3 choices for each dimension, so this brings the total number of configurations up to 5^5*4^5*3^5. Continuing the pattern, there are (5!)^5 different valid house configurations. This comes out to 24,883,200,000 configurations, or about 24 billion. That’s actually not that many configurations for a computer!

So, how shall we represent each of these configurations? This seems fairly simple enough. A particular configuration can be represented as a table of houses and assignments to values. An example would be the following:

	House 1	House 2	House 3	House 4	House 5
Color	1	2	3	4	5
Nationality	2	5	4	1	3
Beverage	4	5	1	3	2
Cigar	5	2	1	4	3
Pet	5	4	3	2	1

This configuration contains 25 numbers, and each number can be represented in a single byte, so this configuration can be represented as 25 bytes. However, given a configuration like this, it’s not immediately obvious what the “next” representation to test is. Also, I realized that between similar states, there is a lot of repeated information.

This led me to realize that the number of configurations for each row is fairly small: there are only 5! possible different rows that are valid (because none of the numbers can be repeated in a single row.) Why not represent each row by a number from 1 to 5!, and use this number as an index into a constant table of all the possible different rows? That would decrease the size of the configuration down to 5 bytes (a single byte can hold values up to 255, so 1-120 will surely fit). Also, the “next” configuration becomes very apparent: a configuration of [53, 92, 108, 7, 14] would have a next configuration of [53, 92, 108, 7, 15]. Thus, enumerating all the configurations is simply counting from 0 to 120^5-1 in base 120. This provides a pretty simple way to count the configurations of the problem: the i'th configuration can be found by simply converting i to base 120.

Testing a particular configuration is fairly straightforward. Because of the structure of the 15 clues, a function that finds the house number where property x is set to y would be useful. Another function that finds the value of property z in house w would also be useful. Each of the tests can be formulated as combinations of those two functions.

Then, I decided to run this on the Cell Broadband Engine, as found in the PlayStation 3. Last year I wrote a ray tracer on the Cell, but I felt that a revisit would be insightful. The structure of the program is pretty simple:

Generate each of the possible rows (find all the permutations of the list [0, 1, 2, 3, 4]
Divide up the problem into n pieces (simple because of the structure of each configuration)
Send each range to each worker processor, along with the permutations generated earlier. Each worker processor goes through the range, converting each trial into a configuration and tests it. If it's the right configuration, push it to main memory in a DMA, and notify the other processors that they should stop. If it's not the right configuration, try the next one
Print out the solution in main memory

Programming on the Cell was fairly uneventful, but here are some interesting/frustrating points that I came across:

As described earlier, the problem size is over 4 billion. Therefore, keeping track of each state has to be done in a 64-bit int (a long long)
Instead of converting an index into base 120 for every test, I kept the number in base 120, and implemented an increment and test-less-than function in software for base 120 numbers
DMAs can only be done in sizes divisible by 16 bytes, and only done to addresses divisible by 16. This was super annoying in many cases. For example, when the solution was found, I wanted to DMA the 5 numbers in the correct configuration back to main memory. The DMA, however, had to be 32 bytes long, though, because I was using 4 byte ints. This means that the local representation of the 5 numbers has to be at least 8 ints long, because, otherwise, the DMA might access memory that you haven't claimed as your own, and generate a segmentation fault.
DMAing structs from main memory to a worker processor's memory assumes that the internal representation of the struct is layed out the same. It isn't necessarily the same though, if you use different versions of compilers for your SPU and PPU code.
SPE to SPE notification is done in a kind of interesting way: it's done as a (high-priority) DMA to a memory-mapped section of host memory. The idea is that there is a section of main memory that's memory-mapped to each SPE's local memory. When you want to notify another processor of something, a DMA is done into the section of main memory that's mapped to the SPE you want to notify. That SPE then sets up a volatile variable that lives at the location of memory just written to by the DMA, and the C program can read from this variable every tick of an inner program loop (polling). This means that, in order to send a notification to a process on another SPE, you have to know the location where that SPE's memory is mapped to in main memory, but also the memory location inside of that other SPE's executing code to write into. If the executable that is doing the notification reading and notification writing are the same executable, then this second piece of information is free, because variables live in the same memory addresses no matter which SPU they're running on (because the binaries are exactly the same, so they're laid out in memory the same, and the programs start at the same location in memory, because the SPU program loader loads all programs into the same starting address)
From the PPU's point of view, SPU programs run inside "contexts." When you want to run a program, you create a context, load a "SPU program object thingy" into the context, and tell the context to run. Now, every context has a memory-mapped range of memory that is mapped into wherever the SPU program is executing. Loading the SPU program actually does a memcpy of the program binary into that memory mapped space. This has the effect that you can have more contexts than SPUs, and SPUs can context switch between these contexts. Also, this context switch has to re-set up memory mapping so that the correct section of main memory is mapped to the correct SPE when the executing program changes. Also, this lets your program migrate around from SPE to SPE as load changes.
Waiting for a DMA to complete is a two-step process:
1. Write 1s into the bit positions for the DMA tags you want to wait for into a hardware register
2. Wait, using this hardware register
However, SPUs can be interrupted between these two steps, and the value of that register is not restored on a context switch back. This means that these two steps have to be surrounded with a critical section block, via the calls mfc_begin_critical_section() and mfc_end_critical_section()

Anyway, after the program was written, it solves the problem in about 23 minutes.

litherum@takashi:~/einstein$ time ./einstein
Found solution:
House
0 1 2 3 4
---------
3 4 0 1 2
3 2 0 4 1
4 2 1 0 3
1 2 0 4 3
2 3 1 4 0


real    23m8.818s
user    0m0.002s
sys     0m0.015s
litherum@takashi:~/einstein$

It may be possible to vectorize the solution, but the implementation for this isn't immediately obvious. Because there are no strict vectors, I would have to run 4 tests at once using the vector registers. However, the tests don't fit into vector registers very well, and I'm not sure how feasible this is. I decided not to pursue it.

Here is my code.

Litherum

Wednesday, October 20, 2010

Brute Forcing Einstein's Riddle with CUDA

Sunday, October 17, 2010

Brute Forcing Einstein’s Riddle