Wednesday, October 20, 2010

Brute Forcing Einstein's Riddle with CUDA

The other day, I wrote a blog post describing how I brute forced Einstein's riddle with the Cell Broadband Engine. Being the geek that I am, I decided to see how the program would run on NVidia CUDA.

Here are some things I learned:

  • Programming is much easier in CUDA than on the Cell. You don't have to worry about memory alignment, you don't have to worry about DMA transfers, you don't have to worry about inter-SPU communication, you don't have to worry about waiting for asynchronous calls, or anything like that. All you do is a CUDA memcpy, and it just works.
  • NVidia's extension to the C language (namely, the <<<>>> syntax), seems a little inelegant to me. I like the Cell's build process better, it seems conceptually simpler, even though there are more steps.
  • Dealing with 2-D arrays is a little tricky in CUDA, because a pitch is involved. The reason a pitch is necessary is that memory is stored in banks. In a typical application, each thread will process a single row of a 2-D array. Now, if the length of the bank and the width of the array are not divisible easily, there can be many threads trying to access the same bank, which is bad. The solution, therefore, is to waste space, and pad empty space at the end of every row, so that each row exists in a different bank. This makes pointer arithmetic a little tricky, because, in order to access row i and column j, you have to take the pointer to the 2-d array, cast it as a char (because the unit of the pitch is in bytes), add i*pitch bytes to the pointer (which makes the pointer point to the correct row), cast it back to its original type, then add j to it. Ugliness, but it makes the program run faster.
  • I was having a problem where my kernels were failing to execute. It turns out it was because my blocksize was too large for the number of registers each thread was using. The idea is that a single multiprocessor has a pool of registers to use. Every thread in a block executes on the same multiprocessor. That means that the number of registers used is the number of threads in the block times the number of registers that each thread used. This multiplication must yield a number less than 8192, which is how many registers each multiprocessor has on my card. Fortunately, the number of threads in the block is adjustable (the number of registers the compiler uses is not adjustable).
  • The size of the grid you pass to CUDA can only be as big as 2^16-1. The maximum number of threads in the grid can be found above. This means that there is a maximum number of threads that can execute at a time. This number is several hundred times smaller than the size of the problem space that I was exploring. The solution to this is to call the kernel in a for loop, and pass an offset to all the threads in a particular kernel invocation. Each thread will then calculate its own unique index, add that to the offset, and use this sum as it's true index. That way, chunks of the problem space can be searched in series. This can be a little tricky, though, because CUDA threads don't support 64-bit ints, which are necessary to hold the entire problem space. This means that I had to pass the offset to the kernel in base 120 (as 5 ints), have each thread calculate its own local index, convert that to base 120 (so the index is now in a form of 5 ints), and perform a software addition of the two base-120 numbers.
  • The way CUDA works is that you write your kernel as if it is running in a single thread, using scalars. The CUDA compiler then figures out a way to run many copies of your program on a single vector processor. For example, say you have the code "a = b + c" in your program. Now, let's say that you have 4 threads executing this program simultaneously, and they are all executing the addition step at once. Let's call thread 1's a a1, thread 2's a a2, etc. If you have vectors of [b1, b2, b3, b4], and [c1, c2, c3, c4], you can add these and get [a1, a2, a3, a4] in a single vector add, rather than 4 scalar ads. Anyway, the compiler figures out a way to vectorize your program. When a kernel gets invoked, It schedules different blocks to run on the different multiprocessors. Inside a single block, the CUDA runtime executes threads in warps. This means that it executes threads 0-31 at the same time, using the vector processor, and using a first batch of registers (let's call them registers 0-39). After this warp has executed for some amount of time, that warp stops running and the runtime fetches the next warp, and executes threads 32-63 at the same time, using the same vector processor, using a second batch of registers (registers 40-59). It keeps executing warps this way until the block is done executing. The CUDA runtime is free to schedule some blocks to run after others have entirely completed, so there may be more blocks to be run than there are multiprocessors. If two threads in the same warp actually don't execute the same instruction at the same time (i.e. an "if" statement caused two different threads to execute different blocks), the CUDA runtime has no choice but execute both codepaths, and throw away the results for the wrong codepath for each group of threads. This is usually done with a "Select Bits" assembly command.
After this is all said and done, it turns out that the CUDA program is about 3.4 times faster than the Cell program. I attribute this to the fact that the problem is an embarrassingly parallel problem, and CUDA can execute multiple threads at the same time, if those threads follow the same codepath. I wonder which kind of problem has enough dynamic branching (or a problem that just isn't embarrassingly parallel) where it would beat CUDA in performance.
litherum@vincent:~/einstein$ time ./einstein
Found solution:
0 1 2 3 4
3 4 0 1 2
3 2 0 4 1
4 2 1 0 3
1 2 0 4 3
2 3 1 4 0

real    6m44.386s
user    6m43.105s
sys     0m1.132s
Here is my code.

Sunday, October 17, 2010

Brute Forcing Einstein’s Riddle

The other day, my girlfriend stumbled upon this riddle. It is a riddle created by Einstein, and he postulated that 98% of the world’s population would be able to solve it. The riddle is like one of those riddles you solve in third grade about matching 5 people and their favorite juices, or whatever. The setup of the riddle is fairly simple: There are five houses in a row. Each of these houses is a different color, and has an owner who has a different nationality, a different kind of beverage, a different brand of cigar, and keeps a different pet. This means there are 5 different houses and 5 different criteria to describe each house. The problem gives 15 statements that are true about the correct configuration.

Being the computer scientist that I am, I thought about how to get a computer to solve this for me. Brute forcing the solution seemed fairly simple, so I started there. If you don’t know what brute forcing a solution is, it’s a computer program that tries every possible input configuration to a problem until it finds the right configuration. Such a program can only work on a problem where the space of configurations is not to big (which is why brute-force attacks on common encryption doesn’t work: the key space is simply too big!)

Let’s get a feel for the problem space. If there was only one house, this house would have 5 different choices for color, 5 different choices for nationality, 5 different choices for beverage, etc., making the number of possibilities 5*5*5*5*5, or 5^5 possible configurations. Now, if a second house were to exist, that house would have only 4 choices for color, because it can’t be the same color as the first house, whatever color that happens to be. By the same logic, the second house only has 4 choices for nationality, beverage, etc. This means that, for each configuration of the first house, the second house has 4*4*4*4*4 different configurations, for a total of 5^5*4^5 configurations with 2 houses. A third house would only have 3 choices for each dimension, so this brings the total number of configurations up to 5^5*4^5*3^5. Continuing the pattern, there are (5!)^5 different valid house configurations. This comes out to 24,883,200,000 configurations, or about 24 billion. That’s actually not that many configurations for a computer!

So, how shall we represent each of these configurations? This seems fairly simple enough. A particular configuration can be represented as a table of houses and assignments to values. An example would be the following:

House 1House 2House 3House 4House 5
This configuration contains 25 numbers, and each number can be represented in a single byte, so this configuration can be represented as 25 bytes. However, given a configuration like this, it’s not immediately obvious what the “next” representation to test is. Also, I realized that between similar states, there is a lot of repeated information.

This led me to realize that the number of configurations for each row is fairly small: there are only 5! possible different rows that are valid (because none of the numbers can be repeated in a single row.) Why not represent each row by a number from 1 to 5!, and use this number as an index into a constant table of all the possible different rows? That would decrease the size of the configuration down to 5 bytes (a single byte can hold values up to 255, so 1-120 will surely fit). Also, the “next” configuration becomes very apparent: a configuration of [53, 92, 108, 7, 14] would have a next configuration of [53, 92, 108, 7, 15]. Thus, enumerating all the configurations is simply counting from 0 to 120^5-1 in base 120. This provides a pretty simple way to count the configurations of the problem: the i'th configuration can be found by simply converting i to base 120.

Testing a particular configuration is fairly straightforward. Because of the structure of the 15 clues, a function that finds the house number where property x is set to y would be useful. Another function that finds the value of property z in house w would also be useful. Each of the tests can be formulated as combinations of those two functions.

Then, I decided to run this on the Cell Broadband Engine, as found in the PlayStation 3. Last year I wrote a ray tracer on the Cell, but I felt that a revisit would be insightful. The structure of the program is pretty simple:
  1. Generate each of the possible rows (find all the permutations of the list [0, 1, 2, 3, 4]
  2. Divide up the problem into n pieces (simple because of the structure of each configuration)
  3. Send each range to each worker processor, along with the permutations generated earlier. Each worker processor goes through the range, converting each trial into a configuration and tests it. If it's the right configuration, push it to main memory in a DMA, and notify the other processors that they should stop. If it's not the right configuration, try the next one
  4. Print out the solution in main memory
Programming on the Cell was fairly uneventful, but here are some interesting/frustrating points that I came across:
  • As described earlier, the problem size is over 4 billion. Therefore, keeping track of each state has to be done in a 64-bit int (a long long)
  • Instead of converting an index into base 120 for every test, I kept the number in base 120, and implemented an increment and test-less-than function in software for base 120 numbers
  • DMAs can only be done in sizes divisible by 16 bytes, and only done to addresses divisible by 16. This was super annoying in many cases. For example, when the solution was found, I wanted to DMA the 5 numbers in the correct configuration back to main memory. The DMA, however, had to be 32 bytes long, though, because I was using 4 byte ints. This means that the local representation of the 5 numbers has to be at least 8 ints long, because, otherwise, the DMA might access memory that you haven't claimed as your own, and generate a segmentation fault.
  • DMAing structs from main memory to a worker processor's memory assumes that the internal representation of the struct is layed out the same. It isn't necessarily the same though, if you use different versions of compilers for your SPU and PPU code.
  • SPE to SPE notification is done in a kind of interesting way: it's done as a (high-priority) DMA to a memory-mapped section of host memory. The idea is that there is a section of main memory that's memory-mapped to each SPE's local memory. When you want to notify another processor of something, a DMA is done into the section of main memory that's mapped to the SPE you want to notify. That SPE then sets up a volatile variable that lives at the location of memory just written to by the DMA, and the C program can read from this variable every tick of an inner program loop (polling). This means that, in order to send a notification to a process on another SPE, you have to know the location where that SPE's memory is mapped to in main memory, but also the memory location inside of that other SPE's executing code to write into. If the executable that is doing the notification reading and notification writing are the same executable, then this second piece of information is free, because variables live in the same memory addresses no matter which SPU they're running on (because the binaries are exactly the same, so they're laid out in memory the same, and the programs start at the same location in memory, because the SPU program loader loads all programs into the same starting address)
  • From the PPU's point of view, SPU programs run inside "contexts." When you want to run a program, you create a context, load a "SPU program object thingy" into the context, and tell the context to run. Now, every context has a memory-mapped range of memory that is mapped into wherever the SPU program is executing. Loading the SPU program actually does a memcpy of the program binary into that memory mapped space. This has the effect that you can have more contexts than SPUs, and SPUs can context switch between these contexts. Also, this context switch has to re-set up memory mapping so that the correct section of main memory is mapped to the correct SPE when the executing program changes. Also, this lets your program migrate around from SPE to SPE as load changes.
  • Waiting for a DMA to complete is a two-step process:
    1. Write 1s into the bit positions for the DMA tags you want to wait for into a hardware register
    2. Wait, using this hardware register
    However, SPUs can be interrupted between these two steps, and the value of that register is not restored on a context switch back. This means that these two steps have to be surrounded with a critical section block, via the calls mfc_begin_critical_section() and mfc_end_critical_section()
Anyway, after the program was written, it solves the problem in about 23 minutes.

litherum@takashi:~/einstein$ time ./einstein
Found solution:
0 1 2 3 4
3 4 0 1 2
3 2 0 4 1
4 2 1 0 3
1 2 0 4 3
2 3 1 4 0

real    23m8.818s
user    0m0.002s
sys     0m0.015s

It may be possible to vectorize the solution, but the implementation for this isn't immediately obvious. Because there are no strict vectors, I would have to run 4 tests at once using the vector registers. However, the tests don't fit into vector registers very well, and I'm not sure how feasible this is. I decided not to pursue it.

Here is my code.

Wednesday, August 18, 2010

Prolog Use

It is no news that I have just taken a new job. This job has giving me a pretty nice setup (better than my setup at home, at any rate): (2) 24" monitors, next to each other. Everyone who has set up 2 monitors knows the potential problem of setting them up: making the computer realize which monitor is the left or the right can be annoying to fix. Of course, you could just reverse the order of the monitor connections into the computer, but that doesn't work if the two monitors are set up differently (maybe one is sitting vertically instead of horizontally) or if the two monitors are fundamentally different pieces of hardware. I was thinking: Wouldn't it be nice if the monitors themselves would automatically realize where they were in relation to each other?

This got me thinking about the problem. I felt that the most straightforward way for such a system to be set up would be if each monitor had a wireless transmitter/receiver inside of it. Not only that, but if each monitor had two directional antennas, one directed to each side of the monitor (assuming none of the monitors are on top of the others), then each monitor would know the set of other monitors that are on each side of it. However, finding the order (and therefore placement) of the monitors given this information is no easy task.

Or is it? This is a constraint satisfaction problem - the kind that Prolog is really good at solving. (Haskell, too, for what it's worth, but Prolog is a little more suited for this problem). Essentially, there are a set of possible scenarios: all the permutations of orders of the monitors. There are also a set of rules (or predicates) which must be true. Not only are there these predicates, but they all have a specific form: Monitor X is to the left of Monitor Y, and Monitor Y is to the right of Monitor X. Prolog has really good support for predicates, and predicates can even have arguments to them, so the similar form of all of these predicates can be captured in one higher-level predicate.

So, I wrote a prolog program that would solve one particular configuration of the problem. Doing all the actual logic was actually really easy and straightforward: the rightOf() predicate is simply 2 lines of simple recursion. Because of the architecture of Prolog and the fact that free variables can be set or not-set at will, a rightOf predicate is unnecessary: simply switch the free/bound argument to leftOf and it becomes rightOf.

The difficult part was actually getting the program to compile and run just the way I wanted it to. It would be nice if I could compile the Prolog program and tell the compiler to run a specific predicate in the executable (and not run the prolog repl). It turns out that this is much more difficult that you would expect, mostly because the two prolog compilers (gprolog and swipl) act very differently. The gprolog documentation (which sucks, by the way, it took me forever to figure out how to use a directive) said that in order to make the compiled program not run the repl, I needed to specify the --no-top-level flag to the compiler. This flag, however, makes the program do absolutely nothing, as no query gets run. The fix is to include a compiler directive to tell gprolog to run a specific predicate as soon as it loads the file (and, as this predicate is the only thing that prints information, is just as good of a time to run it as any other time). This works just fine.

However, swipl does things very differently. First of all, there is a command line flag you can pass to the compiler and it will compile the program to load in the prolog program, execute a specific directive, and then quit, which is exactly what I want. However, my program still has that compiler directive that makes it run the main predicate as soon as the file is loaded, so using this command line option makes the program run twice - not what I want. Not including this command line option, however, makes the compiler compile in the top level repl, which I want to skip. To get around this, I used the command line argument, but instead of pointing it to the predicate which runs my program, I pointed it to the predicate of "fail". This makes the compilation occur correctly in both compilers. This also has the side effect of the compiler actually running the program, because the compiler loads the program, and the directive tells Prolog to run the program when it's loaded... so compiling the program itself runs it.

Another difficulty was making prolog print out all the possible solutions to the satisfaction constraint problem, instead of just one solution. Doing that actually requires knowledge about how the internal Prolog algorithm works. Prolog will search the space it needs to search, and if it finds a successful state in the space, it will consider that predicate satisfied and stop searching. You (the programmer) can't change that. So, you have to have the predicate find a solution, print it out, and then fail. This makes the algorithm keep searching through the whole tree, because it never successfully satisfies the predicate. What you have, in the end, is a predicate that prints out all the correct answers to your problem, and then fails (because you can't have it fail during the search and then make the overall search successful). It seems like this is the standard architecture for the top-level predicate of your program... Perhaps this is standard practice to turn your program into these top-level predicates that are supposed to fail and then the logic predicates that have semantic meaning. Because of the setup described earlier, Prolog prints out a warning that the directive the compiler ran to generate the file failed. It's a harmless warning, but annoying nonetheless. There is a way to get rid of this, however: after your main predicate, have another predicate after the first main predicate just saying "main.". This makes main always return true (after doing the search), so the predicate the compiler directive calls is successful, and there is no warning.

The benefit of using this Prolog approach is that all of the data for this specific program is contained in one predicate. Everything else in the program is a constant (from one problem of this type to the next).

The system I'm envisioning is a shell script of some kind. It would first call an executable written in C of some program to actually do the device interfacing to simply read in the data from all the monitors. Then a Perl script (or something) could generate the Prolog program. This would be simple for the point above, where all the changing code in the prolog program is contained within one predicate. Then the script would compile the Prolog program and then run it. The operating system could then read in the output of the script (which is just the output of the Prolog program) and set up the order of the monitors accordingly. If there are 0 lines of output, there are conflicting results, and the user will have to configure the location of their monitors by hand. If there is one result, it should be chosen. If there are more than one result, a menu could be displayed with all of the possible setups.

There aren't very many domains where I've seen Prolog be a suitable candidate to solve a real problem. This is interesting, perhaps because (I believe) it is on.

My code can be found here.

Monday, June 14, 2010

Mountain View Pictures

I recently took a job a Google, Inc., headquartered in Mountain View. Last week, I took a trip out there to find an apartment to live in that was close enough to the Googleplex that I could bike to work most days. I found a fantastic place, and here are the pictures I took during the trip:

Mountain View Apartment Hunt

Hot Plate Simulation in Haskell

I have recently become interested in thermodynamics. Mind you – not really intense gigantic equation thermodynamics. I am really only interested in the basics. It isn’t even really thermodynamics – it’s more like solving some fairly simple differential equations with computers.

As such, I decided I would try to model one of the simplest thermodynamic systems: a hot plate in a room. The problem is incredibly simple: There is a room of finite size and low temperature. At one end of the room is a hot plate that has constant high temperature. The idea is to model the flow of heat throughout the system. For this example, we can assume the room is 2-dimensional.

The most straightforward thing to do is obviously to divide up the room into a grid of small squares. If these squares are small, it can be assumed that the temperature across the area of the square is constant, so each square can have a single temperature associated with it. The problem is then reduced to trying to figure out how the temperatures of these little squares changes over time.

The broad thermodynamic solution is fairly simple. The idea is that the rate of change of a particular point is proportional to the temperature difference between that particular point and the neighboring points. This makes intuitive sense – something really really cold heats up faster than something room temperature. As we are dealing with 2-dimensional squares, the neighboring squares can be the ones above, below, and on either side of a given square. Finding a single “surrounding” temperature can be done by simply taking the average of these four adjacent squares.

The formula can then be written dT/dt = alpha*deltaT. deltaT can therefore be written as (Tup + Tdown + Tright + Tleft)/4 – Tlocal. (little t is time, big T is temperature). This is a differential equation, and may or may not be able to be solved with analytical methods. I have in front of me, however, a computer, so perhaps I could use it to try to solve this problem. In a computer simulation, time can be simulated by computation time, so each dt can be approximated by each iteration of an outer loop. Performing the calculation on the right side creates the rate of change of the temperature. This must be multiplied by that dt to create the temperature delta that needs to be updated. If we assume that dt is a constant for each loop iteration, we can assume that it has the value of 1, because using any other value of dt will simply scale the time dimension of the final result. Therefore, dividing by one can be neglected, and the equation can simply be reduced to temperature += alpha*deltaT. This calculation must be performed for each point in the space, or for each little square.

Because this calculation must be performed for each of the little squares, it can be done in parallel (as there is no dependency on the squares). There is a trick, however. The temperature of neighboring squares is used in the calculation of the temperature of the current square. This means that the output of the calculation is the same type and semantics as the input. It would be easy to straightforwardly update the current temperature inside a thread. However, the neighboring squares may or may not have been calculated. If they have been calculated already, they would have used the old value for the current temperature. If they have yet to be calculated, they will use the new value for the current temperature, because you just overwrote the old one. The way to get around this is to use two buffers, an input buffer and an output buffer (or a read buffer and a write buffer; call them what you like). Reading in the values from the read buffer and writing to another buffer means that there won’t be any inconsistency in the system. After doing the computation for each of the little squares, simply swap the buffers for the next iteration.

I decided to write the program in Haskell, because it is my favorite language. The program requires an iteration over all the little squares in the requisite area, as discussed before. For each one of those little squares, the neighbors need to be accessed to find the surrounding temperature of the current square. Haskell’s lists are linked lists, however, so lookup is not O(1). The iteration could be done by using 5 iterators: one for the current square and four for the surrounding squares. This would be rather complicated to do, so I elected instead to use Haskell’s Vector library. Vectors have O(1) lookup time, so I could simply just query for the neighboring pixels.

Coding up the iteration was fairly simple. More difficult, though, was displaying the result. Viewing the temperatures as text is pretty visually unappealing, so I decided I would try again to use libGD. This turned out to be another headache (see this for previous headaches). Since that previous post, a Snow Leopard release of GHC has been released in the Ports library, so I have migrated to that build. I used Cabal to install Graphics.GD without any hitch. However, the library didn’t appear to work very well. I tested out the library in GHCi by creating an image, writing to it, and saving it to disk. However, GHCi segfaulted whenever an image fell out of scope. I’m assuming that this has to do with a bug in the image closing code that the garbage collector would invoke. I found that the code would work just fine if I used a text file and invoked runghc on it, however, so I decided to just do that instead of use the interactive mode. However, only about half of the drawing commands appeared to be showing up with this solution. Out of desperation, I tried to compile the program into an executable with ghc –make. This appeared to solve the problem completely, and all the drawing commands appeared in the resulting image. This solution works, but it is a bit of a hack. Oh well – it appears the GHC maintainers aren’t so concerned with certain specific libraries.

After running the program, I observed something quite interesting. It has to do with the walls of the area I am interested in. Finding the neighboring squares is pretty straightforward when you’re on the border of the arena: simply disregard the pixels that are beyond the boundary. This means that the denominator in the average calculation found above isn’t always a “4.” It is the number of neighboring squares that are within the bounds of the area in question. This led to a very straightforward looking heat transfer from one side of the area to the other. More interestingly, I tried hardcoding this denominator to 4 to see what would happen. Functionally, this means that all four surrounding squares are inspected, but the ones outside the area in question have a temperature of 0. This actually led to a very interesting looking heat transfer pattern – a sort of rounding of the border away from the edges of the area.

I also generated a couple thousand frames and found that after a while, the temperature doesn’t appear to be changing much if the area in question is large enough. This makes me wonder about convergence of the equation. It is true that the smaller the temperature difference, the slower the temperature change. This means that the temperature changes less and less as time goes on. However, after generating a few thousand frames, the temperature doesn’t seem to be changing at all. Intuitively, I know that the stable point of the system is when the entire area is the same temperature as the hot plate. I wonder if this is the kind of system where the analog, mathematical formula does not converge (outside of the stable point) but the digital, binary approximation does converge to something that is not the stable point. Food for thought.

I’m hoping to code this same program again in C and in CUDA to compare runtimes. CUDA should be an interesting build, as most of this problem is not computation time but communication time. Communicating between blocks in CUDA is an interesting problem, and I’m curious as to how to do it in the most optimal way. More on that later!

My Haskell code can be viewed here.

Tuesday, February 23, 2010

Bead Sort, Revisited

I have been implementing Bead Sort on a variety of different hardware in order to get a feel for programming on each of these different platforms. I haven’t actually spent much thought about the actual algorithm, until recently. I realized after thinking for only a couple minutes that I was going about the whole thing very, very incorrectly.

The way I was doing this before was to line up each of the beads in a giant 2-D array in one stage, make the beads fall in another stage, then manipulate the new array into telling me the sorted numbers in a third stage. I realized that I could combine the first two stages, speed up the entire program, and cut down on the amount of code that need exist.

Instead of setting up a 2-D array of booleans ahead of time, and then making them all fall, I could simply make them fall as I put them on the columns. This decreases the code size drastically. The algorithm for the first two stages now simply goes as such:
  1. Assign an index number to each column, from 0 to n.
  2. Give each column a single int counter variable, which will refer to the number of beads that are stacked up on that pole
  3. Have each column go through the numbers to be sorted, and, if the number to be sorted is greater than the index for the current column, increment the column’s count. Note: This can be done in a very tricky way by saying “count += num > index”.
  4. Have each column output it’s count to an output array, in the index of its own column index. Call this output array the “counts” array.
That’s it! Much, much simpler code. What’s more, it decreases the complexity from O(m*n) to O(n), where n is the number of numbers to sort, and m is the maximum size of the numbers to sort.
Now, you have that nice little triangle of sorted numbers. How should one go about reading them off the columns? Well, here’s the “aha” moment. The first number in the sorted array is the number of columns with beads in the first position of the column. How many columns have this property? Well, all the columns with counts greater than 0 have this property. Okay, now, the second number in the sorted array is the number of columns with breads in the second position of the column. How many columns have this property? Well, all the columns with counts greater than 1. The next number in the sorted array is the number of columns with counts greater than 2.

Now, this is the exact same logic as used in the first phase of the bead sort algorithm! Now, each column goes through the counts array, and if the value in the counts array is greater than the column index, that column’s count gets incremented. The only difference is now, the values of m and n are switched. On this second pass, each column is going through the count array, which is of size m (the maximum size of a number to sort). The new output array will be of size n (it has to be, because n is the number of numbers to sort, and the new output array is the sorted numbers).

The real interesting part of this algorithm is that you can simply run the parallel part of the algorithm twice, with different arguments. This leads to very small, clean, readable code. It also turns the runtime of the algorithm from O(m*n) to O(max(m, n)).

My cleanest implementation of this code is probably written in CUDA, and can be found here.

Memory Leaks in my 2-D Perlin Noise Generator

As noted previously, I have been in the process of optimizing my 2-Dimensional Perlin Noise generator. As of my last post, the noise generator had been optimized for time and not for space. I was hoping to see if I could optimize this time for space.

My first crack at the problem involved blindly assuming that Haskell was creating a gigantic thunk for the entire computation, and then executing the thunk. The logic was that a gigantic thunk would take up a gigantic amount of memory. Forcing Haskell to evaluate this thunk early would get rid of the thunk and would solve all my problems. So, as dictated by Real World Haskell, I went about using the seq function to force Haskell to evaluate my thunks as they were being created.

This didn’t actually turn out to solve my problem. The memory signature didn’t go down at all. After re-reading the relevant chapter in Real World Haskell, I found a paragraph that I had overlooked before. This paragraph stated that the seq command only evaluates down to the first constructor. This means that the seq command will evaluate 3 + 4 and get 7, but will not evaluate elements in a list. This is because the (:) operator is technically a constructor (The list type ([]) is defined as a : List | Empty). Therefore, calling seq on a list will not actually do anything – it will evaluate down to the first constructor, which is the first (:) in the list itself. Therfore, the strategy of using the seq command won’t work at all.

Then I took a step back. I had assumed that the problem was caused by a giant thunk; what if it wasn’t? I had no evidence to support either claim. Wouldn’t it be nice if Haskell had a profiling tool to tell me exactly where the memory leak is? I tweeted about the apparent lack of a profiling tool, and was responded to very quickly by a twitter-er named donsbot. He pointed me to the relevant chapter in Real World Haskell on profiling. As it turns out, Haskell has some very advanced tools for system profiling!

Later, I did some searching on donsbot, and I found that it referred to someone who goes by the handle dons, or Don S. A little bit later in the search, I found that this guy is a co-author of my very own Haskell book, Real World Haskell! I felt incredibly geeked out by learning this information – I blindly tweeted and the co-author of the book I’ve been using responded! I immediately started following him on twitter and that has led me to check out some of the really interesting work he’s doing, currently on the Vector libraries. He, too, is on Stack Overflow, and has some very thoughtful and inspiring answers on that site.

Anyway, after running some of these profiling tools, I found that almost all my memory was being put into lists of doubles. After looking through my code, I immediately found the main problem.

A Perlin Noise generator works by splitting the image into multiple levels. Each level is furthermore split into a grid of blocks, with keypoints pre-generated at the corners of each block. In my previous implementation, I was iterating through each level, generating each pixel in that level. I was then sorting the pixels to put them in the same order for each level, and using zip to combine all the pixels for a specific position into the final value. The problem was in the sorting step. In order to sort a list, the entire list has to be memory resident. Therefore, all the pixels in all the levels had to be memory resident.

The best way I came up with to solve this problem is to think of Haskell’s execution model as a streaming framework. If I wrote the program correctly, Haskell will be able to generate the value in a specific pixel, write it out to the image, have the garbage collector collect that pixel, and move on to the next pixel. This means that each pixel has to be independent of each other pixel. The way I could get around sorting the pixels is to always generate the pixels in the same order.

For example, in the old way, I would generate the pixels block by block. This would mean that, for two different levels with two different block structures, the pixels would be generated in these orders:

For a blocksize of 2x2, the pixels would be generated like this:

For a blocksize of 4x4, the pixels would be generated like this:

I had to alter this so that the pixels would be generated the same way, regardless of level. If I did this, Haskell could generate the same pixel in each level, apply the operator I used in zip, get the final pixel, and write it out to the image, without ever having to generate any more pixels. This turns the entire program into what can be thought of as a stream – it streams each pixel, one by one.

Modifying the program to do this was fairly straightforward. Loop through the X values, if you come across a new block, generate a new Hermite vector, else, use the existing Hermite vector. If you come to the end of a line, go to the next line. If the next line is in a new block, update the current line and block keypoints. Then, generate the Hermite vector for this new block. This function is even tail-recursive.

The new program certainly has a smaller memory footprint. It is still not quite where I would like to see it, however. The current memory graph shows that it doesn’t quite run in linear space. If I had programmed it correctly, it should run in linear space, so clearly I still have at least one unanticipated bug. It appears that a fair amount of memory is being held in the list data structure itself, which means that it is being held in the 'next' pointer for linked lists. Perhaps I could alleviate this problem by using Vectors?

Also, I bit the bullet and made the damn thing use LibGD directly.

Here is the current memory graph for the new program:

My code can be found here.

Thursday, January 28, 2010

More Assorted Closures

As I am learning more and more about functional programming, I am increasingly coming across the idea of closures. Closures deal with the concept of free variables, or variables that are not explicitly defined in the arguments to the function. The idea of a closure is that the binding of free variables in a function stays with that function. It can be thought of as saving the context a specific function runs in with the function itself.

Let's say you have a function with a free variable. Let's call it multbya, and let's define it to take an argument x, and return x * a, where a is a free value.

All languages will recognize a as a free variable. Now, different languages will do different things with this information. Many languages will simply freak out, saying "Oh God Oh God, you referenced a variable that I don't know anything about!" In some situations, this makes sense: Strongly typed languages (e.g. Scala freaks out) guarantee that the multiplication operator will work, but the language cannot guarantee that the multiplication operator is defined for any old symbol which may or may not be defined at runtime. Other languages (e.g. JavaScript doesn't freak out) may simply say "Well, I don't know what a is now, but fortunately I don't have to run the function now. I'm going to hope that a exists at runtime, and if it does, I'll use it. Also: hopefully the multiplication operator will exist for a."

Note that this concept only really applies in a REPL, as opposed to a file. In a file, nothing really is declared before or after anything else, because all the definitions are there in the file. The compiler can make 2 passes to the file, one to find symbols, and the other to define symbols. That way, symbols that are used can be defined after their use in the file. This is what Haskell appears to do.

This general idea, however, is a little different than what happens in a REPL. In Haskell and Prolog, the "program" is really what is referred to as a "knowledge base." The program doesn't flow: there are no variables and nothing changes. You define your program as a tree of transformations which transforms the input to the output. The "knowledge base" is just a list of those transformations. There are no variables - only arguments.

Okay, back to business. Let's say, for longevity's sake, that we define the variable a ahead of time. Let's set it equal to an integer equal to 3. Now, when you define multbya after defining a, all languages should be happy. Now, you can call the function and it should work just great.

Now, let's try to mask the variable a. Let's make another function, testit, with one variable, a. Inside this function, let's call multbya.

Now, in a language without closures, you could think of the body of the function simply being plopped down in place of the function call. Therefore, in a non-closured language, the a inside multbya is going to refer to the highest-level a, which is going to be the argument to testit. Therefore, in a non-closured language, the output of this function is going to be the argument to testit times the argument that testit called multbya with.

In a language with closures, the bindings associated free variables are passed around with the function. In technical language, the function is closed around its free variables. This means that, even though the highest level a is the argument to testit, multbya remembers its own binding to a. So, multbya remember that it was defined with a free variable of a, and that this free variable refers to the global a. This works anywhere, not just with global variables: functions remember what variables are closed under it so that a caller can't mask the callee's variables and change the function's functionality.

Not all languages use closures the same way, however. In our little example, let's pretend that we then change the value of the global a, then re-run testit. Now, this only works in languages where you can change values, or re-bind values.

Now, a language could deal with this in one of a few ways. Either a language lets you change values, or a language doesn't let you change values. Also, a closured language will either save values in closures, it will save references to values, or it will save symbolic references and a scope identifier.

If a language saves values in a closure, then changing the global value of a should have no effect on the output of testit.

If a language saves references in a closure, and if the language lets you change values, then the value returned from testit should change accordingly.

If references are saved, but the language simply re-binds the global name a to a new value, then the output should not be changed, because the reference still refers to the old value. Keeping this reference around makes sure that the garbage collector doesn't collect the value.

A way to get around this is to save a symbolic reference and a scope identifier. This is like saving the idea of "I want a variable a, and I want the one that's in the global namespace." Thus, no matter how references are handled under the hood, the output value is updated to use that variable.

All the languages that I have tried, except Haskell, have used either references or the last idea, because they have updated their output when I change a. As Haskell is purely functional, it might be doing some caching under the hood because values are not supposed to change.

Here is the Scala code I used:
hani:~ litherum$ scala
Welcome to Scala version (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_17).
Type in expressions to have them evaluated.
Type :help for more information.

scala> var a = 3
a: Int = 3

scala> def multbya(x : Int) : Int = x * a
multbya: (Int)Int

scala> def testit(a : Int) : Int = multbya(6)
test: (Int)Int

scala> testit(5)
res0: Int = 18

scala> a = 4
a: Int = 4

scala> testit(5)
res2: Int = 24

Here is the Clojure program I used:

hani:~ litherum$ clj
Clojure 1.1.0
user=> (def a 3)
user=> (defn multbya [x] (* x a))
user=> (defn testit [a] (multbya 6))
user=> (testit 5)
user=> (def a 4)
user=> (testit 5)

Here is the Common Lisp program I used:

hani:~ litherum$ clisp
i i i i i i i ooooo o ooooooo ooooo ooooo
I I I I I I I 8 8 8 8 8 o 8 8
I \ `+' / I 8 8 8 8 8 8
\ `-+-' / 8 8 8 ooooo 8oooo
`-__|__-' 8 8 8 8 8
| 8 o 8 8 o 8 8
------+------ ooooo 8oooooo ooo8ooo ooooo 8

Welcome to GNU CLISP 2.48 (2009-07-28)

Copyright (c) Bruno Haible, Michael Stoll 1992, 1993
Copyright (c) Bruno Haible, Marcus Daniels 1994-1997
Copyright (c) Bruno Haible, Pierpaolo Bernardi, Sam Steingold 1998
Copyright (c) Bruno Haible, Sam Steingold 1999-2000
Copyright (c) Sam Steingold, Bruno Haible 2001-2009

Type :h and hit Enter for context help.

[1]> (setf a 3)
[2]> (defun multbya (x) (* a x))
[3]> (multbya 4)
[4]> (defun testit (a) (multbya 6))
[5]> (testit 5)
[6]> (setf a 4)
[7]> (testit 5)

Here is the JavaScript program I used:

hani:rhino1_7R2 litherum$ java -jar js.jar
Rhino 1.7 release 2 2009 03 22
js> var a = 3
js> function multbya(x) {
> return x * a;
> }
js> function testit(a) {
> return multbya(6);
> }
js> testit(5)
js> a = 4
js> testit(5)

Here is another JavaScript program I used:

hani:v8-read-only litherum$ ./shell
V8 version
> var a = 3
> function multbya(x) { return x * a; }
> function testit(a) { return multbya(6); }
> testit(5)
> a = 4
> testit(5)

Here is the Python program I used:

hani:~ litherum$ python3.1
Python 3.1.1 (r311:74480, Sep 25 2009, 09:54:21)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a = 3
>>> def multbya(x):
... return a * x
>>> def testit(a):
... return multbya(6)
>>> testit(5)
>>> a = 4
>>> testit(5)

Here is the Haskell program I used:

hani:~ litherum$ ghci
GHCi, version 6.10.4: :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer ... linking ... done.
Loading package base ... linking ... done.
Prelude> let a = 3
Prelude> let multbya x = x * a
Prelude> let testit a = multbya 6
Prelude> testit 5
Prelude> let a = 4
Prelude> testit 5

Friday, January 15, 2010

Google Interview #7

I just came from a particularly challenging interview at Google’s headquarters in Mountain View, California. I forget the name of the interviewer I had for interview # 7 (Edit: I now remember, but I won't recognize him by name), but it was a particularly interesting interview. Before I begin talking about it, I should probably say that I did extremely poorly on this interview, but I did end up learning quite a bit from the interviewer (who, I’m assuming, took pity on me and ended up telling me the answers to his two questions). I’m going to recreate the two questions as best I can here.

Question #1: Suppose you have 10 identical machines, each of which has:
  • 4GB memory
  • 500GB hard drive, with 400GB full of (Apache) log files.
How do you find the top 1 million most commonly visited pages in the log files, with a total computation time of under 24 hours?

This is roughly how the interview went:

Me: Well, because the data is unordered, each machine has to read all of the data on its disk, sequentially. It can then make a hash map where the key is a hash of the URL, and the value is the frequency count. As each computer reads a line, it extracts the URL, hashes it, and looks up the hash in the hash map. If the hash is in the hash map, it extracts the existing associated value, increments it, and stores the value back in the map under the same hash. Otherwise, it stores the hash in the hash map with a frequency of 1.

Interviewer: Okay, how would you then combine the results into one single top 1 million list?

Me: Well, you could combine the data hierarchically, where machines 0 and 1 combine their data, machines 2 and 3 combine their data, and the two combinations are then combined, etc, until one master list is obtained. However, combining the entire hash map would be too big - the last combination would be combining every single entry into a gigantic list. I guess you would then only take the top, say, 2 million entries from each hash map and combine those. That doesn't guarantee correctness, but it's probably good enough.

Interviewer: It's not good enough. Let's say a particular website is #2,000,001 in each of the log files. It is feasible that that website could be the most popular website, if, say, spots #1 - #2,000,000 are a tie. How could you guarantee correctness?

Me: Hrmph. Perhaps you could fit the whole hash map on one machine?

Him: You blow. That wouldn't fit in memory, causing that machine to swap heavily, slowing it down by 2 orders of magnitude, causing you to miss your deadline, and exploding the world.

Me: Yeah, I do blow. Please hire me!

Him: But think about this! It wouldn't fit in 4GB, but sitting in front of you, you have a total of 40GB of memory!

Me: Aha! You could make a distributed hash table. You could partition your hash space into 10 different spaces, and map each space to a different PC.

Him: Something like PC ID = hash(URL) mod 10.

Me: Yeah. Thanks for the hint.

Him: You bet. So how would you combine the results now?

Me: Well, because all the counts for each individual URL are in one place, you just have to combine the hash maps hierarchically, as I described before, except you only need to keep the top 1 million entries in the combination. This should go fairly quickly.

Him: Okay, fair enough. (Looks disappointed) Now, say the nice folks over at Microsoft are able to reverse engineer your algorithm completely. They then hit our website repeatedly for crafted URLs, so that significantly more than 1/n of the input URLs in the log files map to the same machine. Pretend that, now, 100% of the URLs in the log files have the quality where hash(URL) mod 10 == c, where c is a constant. What does this mean?

Me: Well, this is going to overload the hash table on one of the particular computers. It'll make the hash table swap out to disk, slowing down the computer, making us lose our deadline and ending in the world blowing up.

Him: So how would you solve this problem?

Me: You could take a hash of a hash of a hash...

Him: Pretend that 100% of your algorithm is known by the hacker.

Me: You could have a dynamic partitioning system, where, if the hash map on one machine becomes too big, it offloads part of its map to the next machine...

Him: That would be too complicated to program. Also, it would result in you missing your deadline. If you do it that way, you have to have one of the PCs dedicated to keeping track of what keys go where. All the other PCs have to hit this PC for each URL it encounters, introducing a bottleneck in the system, slowing everything down. Also, it would be slow to have two machines handling the mapping of key to machine, because they have to have consistent data, which introduces lots of cross traffic, and the whole thing is basically a giant clusterfuck.

Me: So, you can't change your hashing algorithm, and you can't artificially move your data around .... Seems like your stuck.

Him: WOW. You REALLY blow. Even harder than I thought!

--We stare at each other for a minute--

Him: Have you ever heard of universal hashing?

Me: Nope. By the way, did you know that I blow?

Him: I had gathered. Do you know what a salt is?

Me: No.

Him: A salt is a constant value you append to the end to the input of your hash function. If you choose a salt at random, the folks at Microsoft won't be able to know it (because it's data, not your algorithm). Now, they don't know where the hashes end up. You can then change your salt daily so they don't have time to guess your salt.

Me: Gee, that sure is smart.

Him: You blow.

Me: Sorry.

Question #2: Google has a web crawler. This web crawler periodically crawls the same sites repeatedly to keep track of changes. However, people get mad if Google crawls their website too often (for whatever reason). So, the crawler has to remember the last time it crawled a website, and can then decide whether or not it should crawl again. This means we need some sort of database system that maps URLs to a timestamp, and, for purposes of this interview, a 1-byte enumeration of a site's status. This will be something like {online, 404, 403, timeout, etc}. The crawler is going to want to query the system for a given URL, and we need to return the timestamp and the site status. When the crawler does end up crawling a site, it's going to update the information for a given URL. Also, the crawler may insert data for a new URL. Assume that this system is going to hold information for 200 billion websites, and we allow a one in a million failure rate (giving incorrect information is a failure). How would you design this system? And, how many machines would you need?

Me: Well the most straightforward solution would be to use a traditional database system, like MySQL or something.

Him: You suck. You didn't even ask me the rate of queries! This is the most obvious thing you should have asked me. The query rate is 50,000 queries / second.

Me: So that's too much for a traditional database system.

Him: Yep.

Me: Well, you could use the same system that we described in the last question, where you use a distributed hash map, where the keys are the url, and the value is a tuple of the timestamp and the status. You could use the salt thingy to make sure the data is distributed on the machines evenly.

Him: Okay. Sounds good. How many machines would you need, given the memory constraints of saying each computer has 16GB of usable memory?

Me: Well it's probably more than 1 machine.

Him: You really have no clue, do you?

Me: Yeah, I blow.

Him: Well, think about it this way. How many bits do you need in your hash?

Me: Well, the space of the hash should be big enough to provide a 1 in a million failure rate, so 2^n = 10^6

Him: That would be the space of a hash if you wanted to hash a single element.

Me: (Making this up as I go along) So I guess I multiply the right side by 20 billion?

Him: What would that make, then?

Me: Well,
2^n = 10^6 * 200 * 10^9
2^n = 200 * 10^15
2^n = 2 * 10^17
2^n = 2 * (2^10)^(17/3)
2^n = 2 * (2^10)^6
2^n = 2 * (2^60)
2^n = 2^61
n = 61 bits, rounding up to 64.

Him: 64 is nice and round. So how much space do you need for the timestamp?

Me: Well, Unix timestamps are 64-bits, so that sounds like a good place to start. You could compress that by subtracting the value for the first timestamp recorded. Also, if you don't need such high resolution, you could divide by some constant.

Him: That's fine. We'll just make it 56 bits. We've already said that the status is 1 byte, or 8 bits. This makes a total of how much data per row?

Me: 64 + 56 + 8 = 128

Him: Not quite. Hash maps have internal data structures to handle collisions.

Me: (I actually didn't understand this, as the problem statement said that hash collisions are okay if they're infrequent (1 in a million), but I went with it anyway) Okay, so give each hash a linked list. Therefore, the tuple of (timestamp, status) now becomes a triple of (timestamp, status, nextptr). Give each pointer 32 bits (Thanks, Matt, for pointing out that if the machines have 16GB memory (as described later), this should be 64 bits for a pointer). This brings the total up to 128 + 32 = 160 bits = 20 bytes.

Him: Sounds good so far. How many machines would you need then?

Me: Well, each row has 20 bytes, and there are 200 billion rows, so that's 4TB of memory. If each machine has 16GB of memory, that's 4000 GB / (16 GB / Machine) = 250 machines.

Him: Sounds pretty good. Now, if we distribute the 50,000 queries / second, do you think our 250 machines will be able to handle the load?

Me: Well, that means that each machine handles (50,000 queries / second) / (250 machines) = 200 queries / second / machine. That sounds pretty reasonable.

Him: I agree. Sounds good.

Me: Fantastic.

Edit: I have been offered (and accepted!) this job, even though this interview went atrociously. If you have had a similar terrible interview, don't give up hope!

Here are the pictures i've taken from Google's Campus.

2-Dimensional Perlin Noise Generator, Revisited

The 2-Dimensional Perlin Noise Generator I wrote had a major flaw – it was excessively slow. Most of the computation was due to the function that finds the four bounding keypoints given a particular point to calculate. The function that finds the color for a specific pixel was called in two nested loops. This function needed to find out the four bounding keypoints, so it filtered the list of keypoints a bunch of different ways.

I got to thinking about this and discovered that the bounding box hardly changes between a particular pixel and the next pixel. Because the loop iterates over pixels that are next to each other, I could keep track of the bounding box between calls, and now only need to check to see if the bounding box changes. However, once an entire row of pixels has been generated, the sequence of bounding boxes repeats.

It would be nice to think about the problem backwards and, instead of finding the bounding box that a particular pixel is in, find all the pixels that are in a specific bounding box. That way, I could iterate over the bounding boxes, and just sort the result at the end (to re-order the values into line-by-line order). Iterating over the bounding boxes is easy in two dimensions – split the list of keypoints into a matrix of keypoints, and iterate across the y dimension 2 at a time, and inside that, iterate across the x dimension 2 at a time. If the matrix is implemented as a list of lists, the iteration over y can be done by iterating over the main list, and the iteration over x can be done by iterating over the lists inside the main list. Because this is done iteratively, it can be done quickly. Iterating over the pixels in a block is similarly easy.

Then, I got to thinking about the bicubic interpolation performed at each step. It occurred to me that I was finding a dot product between a vector of coefficients and a Hermite vector. Looking at the code to generate the Hermite vector, I realized that this code only depends on the values of the local bounding box. That is to say, all the values within a single block share a Hermite vector. An exploitation of this fact was easy to calculate given the new structure of the program – a new Hermite vector was calculated inside the iteration over local blocks, and this vector was passed to the interpolation routine for each pixel in the block.

I also noticed that the output images were looking quite blocky, as you would expect from filling in a matrix of blocks of values. I realized, though, that one way to somewhat alleviate this problem would be to actually calculate derivatives. Previously, I was assuming that the derivatives at each keypoint were 0. This means that if successive keypoints have values of 0, 3, 6, 9, and 12, the output would look blocky because I was pretending the derivatives at the keypoints were all 0. This would lead to an output that was flat at each of the keypoints, and grew quickly between keypoints.

Calculating derivatives was a bit tricky. Three derivatives are necessary for each point – the partial x derivative, the partial y derivative, and the cross derivative. Early on, I decided that the cross derivative would be too difficult to determine, so I left that one 0. Thus, all that was left were the two partial derivatives. The partial x derivative for a point P should be calculated as rise/run for the two points immediately adjacent to P in the x dimension. However, if P is the left-most point in the row, P should be used instead of a point immediately to the left of P (because such a point does not exist). The opposite is true for the right-most point. I created a tricky function to do this.

Finding all the partial x derivatives involves turning the list of keypoints into a matrix, then mapping the iteration across each row of the matrix. Finding the partial y derivatives involves exactly the same thing, except done with the transpose of the matrix. Zipping everything back together is a bit tricky, but it isn’t too bad to make sure everything is done correctly.

The resulting image is still fairly blocky, but it appears to be a bit better. According to my calculations, introducing the optimizations increases the speed of the program by more than 1000x. This is more than I ever hoped.

However, there is a problem. Generating an 800x600 image with a depth of 4 levels of keypoints takes 10 gigabytes of memory (Good thing I have 12 in my PC!). I am not quite sure what all that memory is for, but I believe that either:
  • Haskell is building a gigantic thunk, and then evaluating it, or
  • Because the program is not tail recursive, it is holding an intermediate level of 480,000 pixel values at each 480,000 levels of recursion (one for each pixel in 800x600)
I plan to fix this problem next.

Ray Tracer

I have been working for some time on a ray tracer that runs on many different kinds of hardware. As such, I have decided to program it in C, because this seems to be the language that is most widely accessible between devices.

The idea is that a single CPU would generate a list of the primary rays as individual data items. This CPU function would then break up this list into n pieces, where n is a ray-tracing strategy. In this example, one strategy would be to use NVidia CUDA, and another strategy would be to use the CELL SPEs. Each of these strategy functions would then deal with strategy-specific code. Usually, these functions would inspect the current computer to find, for example, the number of CUDA-enabled devices in the system.

This function would then break up the list of individual data items even further, so that each device has an equal amount of data items to process. Usually it would then sets up m pthreads, where m is the number of devices in the system of that particular strategy (number of CUDA-enabled devices). Each of these m threads then runs a device-specific function on its own individual section of the list of data items.

This structure allows one to easily take advantage of all the compute devices in a system, because the system is hierarchically broken up into strategies and further into individual devices.

Because I designed this ray tracer to run with NVidia CUDA, I had to make some changes to the ray tracing code. First of all, NVIDIA CUDA does not support recursion. Ray tracing is, however, fundamentally a recursive process. The color of a ray is determined by the color of the intersected material and lighting, but also by the color of the reflected and refracted ray. In order to eliminate recursion, I had to implement a tree-based ray dependency tree. Any particular ray depends on both the reflected ray and the refracted ray, as determined by intersecting that particular ray with the world. Therefore, the complete computation can be viewed as a tree, where each node is a ray, and the children for a particular node are the reflected and refracted rays, generated by that ray. As I wanted to make the implementation for this tree as lightweight as possible, I chose to use an array, where the root of the tree exists in the array at index 1, the left child of a particular node n exists at index 2*n, and the right child exists at index 2*n+1.

Generating this tree involves running through the array, starting at index 1. For each element, intersect the particular ray with the world to find an intersection point, then use the intersection point to generate the reflected and refracted rays, which are put into index 2*n and 2*n+1, respectively. Using this method, the entire dependency tree can be populated with a simple, non-recursive, for loop. The loop should run across all the elements of the array, in increasing order, but stop halfway through the array – as the body of the for loop generates rays with index 2*n+1, it is important we stop when 2*n+1 is the last element in the array.

Then, actually evaluating the computation involves running through the array backwards, converting each ray to a color. This requires another array of the same size, except this one is filled with output colors. As with any ray tracer, a function exists to find the immediate color of a particular ray by intersecting it with the world and running shading calculation on the intersection point. This function is applied across the last half of the dependency tree array, because the recursion is stopped at the maximum depth of the recursion. The second half of the array is populated with rays that correspond to the maximum depth of the recursion. Then, the first half of the array is processed, going backwards. For any particular ray, its immediate color must be calculated, and then mixed with the color in slot 2*n (the color of the reflected ray) and the color in slot 2*n+1 (the color of the transmitted ray). Once this loop has completed, the final color is the color in index 1 of the color array.

In this model, special attention must be delivered to the rays that do not intersect the world. Because of the nature of the color conversion, each element in the ray dependency tree must be initialized. Therefore, if a ray does not intersect the world, both of that ray’s children can simply be copies of that ray itself. This guarantees that the entire subtree for this ray is populated with rays that do not intersect the world. When converting the rays to colors, if a particular ray does not intersect the world, a default background color must be used.

Note that this is not a particularly optimized ray tracer – spatial partitioning can be used to greatly speed up computation.

Unfortunately, I cannot get the program to run on the graphics card in my server (8800gt) or the graphics card in my laptop (9400m). Calculating the lighting values for a black and white image works, and calculating an unshaded color works, but the one line that multiplies the target color with the shading color makes the entire scene go black. My only guess is that the smaller lower-end graphics cards in my server and laptop just can’t handle the large kernel required to trace rays. That being said, my 2 GTX 295 cards run the kernel just fine. Using all 4 GPUs in the GTX 295s is about 7 times faster than using all 8 (pseudo-)cores in the Core i7.