Being the computer scientist that I am, I thought about how to get a computer to solve this for me. Brute forcing the solution seemed fairly simple, so I started there. If you don’t know what brute forcing a solution is, it’s a computer program that tries every possible input configuration to a problem until it finds the right configuration. Such a program can only work on a problem where the space of configurations is not to big (which is why brute-force attacks on common encryption doesn’t work: the key space is simply too big!)
Let’s get a feel for the problem space. If there was only one house, this house would have 5 different choices for color, 5 different choices for nationality, 5 different choices for beverage, etc., making the number of possibilities 5*5*5*5*5, or 5^5 possible configurations. Now, if a second house were to exist, that house would have only 4 choices for color, because it can’t be the same color as the first house, whatever color that happens to be. By the same logic, the second house only has 4 choices for nationality, beverage, etc. This means that, for each configuration of the first house, the second house has 4*4*4*4*4 different configurations, for a total of 5^5*4^5 configurations with 2 houses. A third house would only have 3 choices for each dimension, so this brings the total number of configurations up to 5^5*4^5*3^5. Continuing the pattern, there are (5!)^5 different valid house configurations. This comes out to 24,883,200,000 configurations, or about 24 billion. That’s actually not that many configurations for a computer!
So, how shall we represent each of these configurations? This seems fairly simple enough. A particular configuration can be represented as a table of houses and assignments to values. An example would be the following:
|House 1||House 2||House 3||House 4||House 5|
This led me to realize that the number of configurations for each row is fairly small: there are only 5! possible different rows that are valid (because none of the numbers can be repeated in a single row.) Why not represent each row by a number from 1 to 5!, and use this number as an index into a constant table of all the possible different rows? That would decrease the size of the configuration down to 5 bytes (a single byte can hold values up to 255, so 1-120 will surely fit). Also, the “next” configuration becomes very apparent: a configuration of [53, 92, 108, 7, 14] would have a next configuration of [53, 92, 108, 7, 15]. Thus, enumerating all the configurations is simply counting from 0 to 120^5-1 in base 120. This provides a pretty simple way to count the configurations of the problem: the i'th configuration can be found by simply converting i to base 120.
Testing a particular configuration is fairly straightforward. Because of the structure of the 15 clues, a function that finds the house number where property x is set to y would be useful. Another function that finds the value of property z in house w would also be useful. Each of the tests can be formulated as combinations of those two functions.
Then, I decided to run this on the Cell Broadband Engine, as found in the PlayStation 3. Last year I wrote a ray tracer on the Cell, but I felt that a revisit would be insightful. The structure of the program is pretty simple:
- Generate each of the possible rows (find all the permutations of the list [0, 1, 2, 3, 4]
- Divide up the problem into n pieces (simple because of the structure of each configuration)
- Send each range to each worker processor, along with the permutations generated earlier. Each worker processor goes through the range, converting each trial into a configuration and tests it. If it's the right configuration, push it to main memory in a DMA, and notify the other processors that they should stop. If it's not the right configuration, try the next one
- Print out the solution in main memory
- As described earlier, the problem size is over 4 billion. Therefore, keeping track of each state has to be done in a 64-bit int (a long long)
- Instead of converting an index into base 120 for every test, I kept the number in base 120, and implemented an increment and test-less-than function in software for base 120 numbers
- DMAs can only be done in sizes divisible by 16 bytes, and only done to addresses divisible by 16. This was super annoying in many cases. For example, when the solution was found, I wanted to DMA the 5 numbers in the correct configuration back to main memory. The DMA, however, had to be 32 bytes long, though, because I was using 4 byte ints. This means that the local representation of the 5 numbers has to be at least 8 ints long, because, otherwise, the DMA might access memory that you haven't claimed as your own, and generate a segmentation fault.
- DMAing structs from main memory to a worker processor's memory assumes that the internal representation of the struct is layed out the same. It isn't necessarily the same though, if you use different versions of compilers for your SPU and PPU code.
- SPE to SPE notification is done in a kind of interesting way: it's done as a (high-priority) DMA to a memory-mapped section of host memory. The idea is that there is a section of main memory that's memory-mapped to each SPE's local memory. When you want to notify another processor of something, a DMA is done into the section of main memory that's mapped to the SPE you want to notify. That SPE then sets up a volatile variable that lives at the location of memory just written to by the DMA, and the C program can read from this variable every tick of an inner program loop (polling). This means that, in order to send a notification to a process on another SPE, you have to know the location where that SPE's memory is mapped to in main memory, but also the memory location inside of that other SPE's executing code to write into. If the executable that is doing the notification reading and notification writing are the same executable, then this second piece of information is free, because variables live in the same memory addresses no matter which SPU they're running on (because the binaries are exactly the same, so they're laid out in memory the same, and the programs start at the same location in memory, because the SPU program loader loads all programs into the same starting address)
- From the PPU's point of view, SPU programs run inside "contexts." When you want to run a program, you create a context, load a "SPU program object thingy" into the context, and tell the context to run. Now, every context has a memory-mapped range of memory that is mapped into wherever the SPU program is executing. Loading the SPU program actually does a memcpy of the program binary into that memory mapped space. This has the effect that you can have more contexts than SPUs, and SPUs can context switch between these contexts. Also, this context switch has to re-set up memory mapping so that the correct section of main memory is mapped to the correct SPE when the executing program changes. Also, this lets your program migrate around from SPE to SPE as load changes.
- Waiting for a DMA to complete is a two-step process:
- Write 1s into the bit positions for the DMA tags you want to wait for into a hardware register
- Wait, using this hardware register
litherum@takashi:~/einstein$ time ./einstein Found solution: House 0 1 2 3 4 --------- 3 4 0 1 2 3 2 0 4 1 4 2 1 0 3 1 2 0 4 3 2 3 1 4 0 real 23m8.818s user 0m0.002s sys 0m0.015s litherum@takashi:~/einstein$
It may be possible to vectorize the solution, but the implementation for this isn't immediately obvious. Because there are no strict vectors, I would have to run 4 tests at once using the vector registers. However, the tests don't fit into vector registers very well, and I'm not sure how feasible this is. I decided not to pursue it.
Here is my code.