I have been working for some time on a ray tracer that runs on many different kinds of hardware. As such, I have decided to program it in C, because this seems to be the language that is most widely accessible between devices.
The idea is that a single CPU would generate a list of the primary rays as individual data items. This CPU function would then break up this list into n pieces, where n is a ray-tracing strategy. In this example, one strategy would be to use NVidia CUDA, and another strategy would be to use the CELL SPEs. Each of these strategy functions would then deal with strategy-specific code. Usually, these functions would inspect the current computer to find, for example, the number of CUDA-enabled devices in the system.
This function would then break up the list of individual data items even further, so that each device has an equal amount of data items to process. Usually it would then sets up m pthreads, where m is the number of devices in the system of that particular strategy (number of CUDA-enabled devices). Each of these m threads then runs a device-specific function on its own individual section of the list of data items.
This structure allows one to easily take advantage of all the compute devices in a system, because the system is hierarchically broken up into strategies and further into individual devices.
Because I designed this ray tracer to run with NVidia CUDA, I had to make some changes to the ray tracing code. First of all, NVIDIA CUDA does not support recursion. Ray tracing is, however, fundamentally a recursive process. The color of a ray is determined by the color of the intersected material and lighting, but also by the color of the reflected and refracted ray. In order to eliminate recursion, I had to implement a tree-based ray dependency tree. Any particular ray depends on both the reflected ray and the refracted ray, as determined by intersecting that particular ray with the world. Therefore, the complete computation can be viewed as a tree, where each node is a ray, and the children for a particular node are the reflected and refracted rays, generated by that ray. As I wanted to make the implementation for this tree as lightweight as possible, I chose to use an array, where the root of the tree exists in the array at index 1, the left child of a particular node n exists at index 2*n, and the right child exists at index 2*n+1.
Generating this tree involves running through the array, starting at index 1. For each element, intersect the particular ray with the world to find an intersection point, then use the intersection point to generate the reflected and refracted rays, which are put into index 2*n and 2*n+1, respectively. Using this method, the entire dependency tree can be populated with a simple, non-recursive, for loop. The loop should run across all the elements of the array, in increasing order, but stop halfway through the array – as the body of the for loop generates rays with index 2*n+1, it is important we stop when 2*n+1 is the last element in the array.
Then, actually evaluating the computation involves running through the array backwards, converting each ray to a color. This requires another array of the same size, except this one is filled with output colors. As with any ray tracer, a function exists to find the immediate color of a particular ray by intersecting it with the world and running shading calculation on the intersection point. This function is applied across the last half of the dependency tree array, because the recursion is stopped at the maximum depth of the recursion. The second half of the array is populated with rays that correspond to the maximum depth of the recursion. Then, the first half of the array is processed, going backwards. For any particular ray, its immediate color must be calculated, and then mixed with the color in slot 2*n (the color of the reflected ray) and the color in slot 2*n+1 (the color of the transmitted ray). Once this loop has completed, the final color is the color in index 1 of the color array.
In this model, special attention must be delivered to the rays that do not intersect the world. Because of the nature of the color conversion, each element in the ray dependency tree must be initialized. Therefore, if a ray does not intersect the world, both of that ray’s children can simply be copies of that ray itself. This guarantees that the entire subtree for this ray is populated with rays that do not intersect the world. When converting the rays to colors, if a particular ray does not intersect the world, a default background color must be used.
Note that this is not a particularly optimized ray tracer – spatial partitioning can be used to greatly speed up computation.
Unfortunately, I cannot get the program to run on the graphics card in my server (8800gt) or the graphics card in my laptop (9400m). Calculating the lighting values for a black and white image works, and calculating an unshaded color works, but the one line that multiplies the target color with the shading color makes the entire scene go black. My only guess is that the smaller lower-end graphics cards in my server and laptop just can’t handle the large kernel required to trace rays. That being said, my 2 GTX 295 cards run the kernel just fine. Using all 4 GPUs in the GTX 295s is about 7 times faster than using all 8 (pseudo-)cores in the Core i7.