Litherum: 2009

Sunday, December 27, 2009

2-Dimensional Perlin Noise Generator

I have recently finished updating my Perlin Noise Generator to work in two dimensions. This was actually a little bit more challenging that I expected, but here goes an explanation.

Perlin noise is an accumulation of many different wave functions, so the obvious thing to do is to separate each of these out and work on them one at a time. Each one needs a set of unique keypoints at regular intervals over the domain of the function. Because the noise generator is two dimensional, each keypoint needs to have an x coordinate, a y coordinate, and a magnitude. The amount of the keypoints should not be fixed, as the different functions that will need to be generated will have to have different frequencies and therefore different amounts of keypoints. Luckily, I wrote the one-dimensional keypoint generator with this in mind - now all I had to do to run this in two dimensions was to simply use tail recursion to call the one-dimensional keypoint generator many times (making sure to update the random generator each time, or all the random magnitudes will be the same) and append all the results to each other.

The next interesting part - figuring out which keypoints are the ones bounding a given (x, y) coordinate. The output of this function serves as input to an interpolation function, which uses the bounding box magnitudes to figure out what the magnitude of a given point would be. When I was thinking about how to do this, I was wondering what would happen if an (x, y) coordinate lied directly on a keypoint, or what if it doesn't lie in between a square of 4 but actually lies on a border of the square, in between exactly 2 keypoints. Based on this reasoning, I had this function return either 1, 2, or 4 keypoints; if the function returns 1 coordinate, the given point lies exactly on a keypoint, and the interpolation function can just use the magnitude of that point. If two points are returned, the function can use the linear interpolation that I already wrote in the last version of the noise generator. If four points are returned, then the full bicubic interpolation must be run.

That being said, my implementation of this function is not the best. In order to find if the given point lies on a keypoint, I filter the list of keypoints for one that lies on the given point. If this filtered list is empty, I proceed to filter the list of keypoints for all the ones that have only one coordinate equal to that coordinate in the given point (for the linear case). If this list isn't empty, I filter based on the greatest value that is less than the given point and the smalles value that is less than the given point, and return those. If none of those cases were true, I filter the list again twice to get all the points that are in the four quadrants around the given point. The top-left point is simply the last point in the top-left quadrant, and the bottom-right point is simply the first coordinate in the bottom-right quadrant (assuming the points come sorted, and filter doesn't rearrange the list). In order to get the top-right point and the bottom-left point, I had to do folds across the list to find one with minimum in one coordinate and maximum in the other. That makes for a ton of filters, each of which is O(n), and even 2 folds thrown into the mix. After I was done, running the program took a very long time.

Looking back, I feel that I should change the output of the function so that it always returns 4 points. If the given point lies on a point, I should return a specified square (say, to the bottom-right of the given point), and if a point lies on a line, I should return the square either below or to the right of the given point. This way, I can unify all the multiple cases into one case, limit the number of filters, and the interpolation function can deal with when a coordinate value = 0. Cubic and bicubic interpolation do this for us, so no additional code would have to be added.

The interpolation function itself is also a fairly interesting function. After reading the Wikipedia article on bicubic interpolation, it kindly provides a matrix that can be used. The use of this matrix is fairly simple - make a vector of all the values and partial derivatives as specified, multiply them by the given matrix, and use the output vector as coefficients into the simple double-sigma formula at the top of the page (Under "Bicubic spline interpolation"). Because Haskell is a very math-y language, implementing the double-sigma formula was extremely simple. After thinking for a second or two about how to do the matrix multiplication, I realized that Haskell's standard formulas were enough. Matrix multiplication is simply multiplying each entry i in row j with entry i in the vector. So, I am going to be doing some computation over ever row of the matrix - this looks like a map. The computation that I'm going to be doing is a dot product - which looks like a sum over a zip. All in all - the multiplication looks like this:


map (sum . (zipWith (*) vector)) matrix

Of course, this assumes that the matrix is in row-major form.

All in all, everything is mostly the same as the one-dimensional case, except with another dimension artificially added in. My code can be found here.

I did have a neat idea about making an image. All the two-dimensional noise images that I had seen have been grayscale, so that it looks kind of like fog. My idea was to simply run the algorithm 3 times, mapping the output into each of the red, green, and bue channels of the image, respectively. This would create a bright, more appealing image. Here is one such image, with 3 noise functions added together:

Saturday, December 5, 2009

One dimensional Perlin Noise Generator

A few weeks ago, I became a little obsessed (oxymoron, I know) with Perlin Noise. It was originally developed by Ken Perlin as a way to create noise that has variation that varies. While that may sound like a tongue twister, it actually makes a lot of sense.

Simply taking a list of uniform random numbers (Note: this is *not* what Perlin noise is), and perhaps doing some Hermite Spline interpolation between each pair of adjacent points will make a landscape that is equally jaggy as it is flat. There will be no global variation in the random numbers - the "next" value in the list is entirely independent of all previous values.

Now, beautiful noise has the quality that the "next" value lies close to, but not exactly on the current value. This quality should be able to be observed at both a global and local environment. This means that, if you step back and squint, the noise should not be too jaggy or too flat. In addition, if you zoom in and look really close, you should still see small perturbations, and they should not be too grand but not too small that they don't exist either.

Perlin was able to achieve this using a pretty simple idea which appears to come from sound waves. Say you have a low (bass) tone and a high (treble) tone in a song. The low tone is going to have a low frequency but a large amplitude. The large magnitude is necessary because any wave with a low frequency is going to have small amount of energy carried with it, so the amplitude needs to be artificially increased to give the wave a decent amount of energy. The treble tone, however, has a high frequency and therefore does not need to have as high of an amplitude. Adding these two tones together (playing both tones at the same time) can sound like music, and will result in a wave that has small slight perturbations from a longer, more baseline (get the pun?) wave.

Now, simply adding two waves does not create the beautiful noise that we are in search of. We want to add together many, many waves, so we have slight perturbations on slight perturbations all the way down the line, and we get a nice, smooth, varying wave. The waves that we will add together, have a special property. Lets say we have one wave in our list of waves to add together. If the next wave we use has the property that it has twice the frequency and half the amplitude of the wave we already have, an interesting thing occurs. The rate of change of the two waves turns out to be exactly the same. Doubling the frequency halves the wavelength, and the amplitude is halved, so the entire wave is actually just being miniaturized (and different random values are being chosen, of course). So the resultant wave will then seem to vary smoothly, because the perturbations have the same average rate of change as all the other waves. In music, doubling the frequency is the idea of an octave, and playing multiple octaves together indeed sounds beautiful, the way we want our noise to sound.

I've been working on implementing a Perlin Noise generator in Haskell, and have come across a couple things to watch out for. The first is that, because Haskell is purely functional, you have to handle random numbers delicately. In an imperative language, every call to rand() has a side effect of changing the system's random seed. In Haskell, however, this side effect cannot occur. This means that the random function must take a random seed as an argument and return a random number and a split random seed. If you want to generate a new random number, you must then use the new random seed. Therefore, you have to be very careful about how your random seed evolves around the flow of your program. Any function that uses the random function must return a tuple with whatever it would return, and an updated random seed that the caller can continue with. It is quite a headache passing all these random seeds around, even though I can easily see why it is necessary.

I wanted to draw the output of the function with some image drawing library. I have used libGD many times in the past, in many languages (C, C++, PHP). After looking for a bit, I found that there is a Haskell module that provides bindings into libGD. I know that there is a Haskell package manager called Cabal, so I look for Cabal to install within Ubuntu's package manager. It's a no go; even a quick search on the Internet confirmed that Cabal is not currently in APT. Oh well, I'll just download the source myself and install it. This works like a charm on my Ubuntu box, and I look for documentation on how to use this module. There is basically none, and I spend at least a couple hours looking over the two or three useful pages on the Internet pertinent to the topic. Finally I scrap together a couple lines that does what I want it to do - simply draw some pixels on an image in a 'for' loop.

Then I try to run the code on my laptop. My laptop is running Snow Leopard, a 64-bit (mostly) OS. However, 64-bit support for GHC is not enabled on Snow Leopard, so I had to install the 32-bit version of GHC. However, now when I try to install the Haskell module, it must link to a 32-bit version of libGD. Okay, no problem, I go and run `port install libgd +universal` hoping that this builds the 32-bit version of libGD. It starts to, but then errors on some assembly file saying that some x86 assembly commands do not work on x86_64. Awesome. I can't download a binary from the libGD website, because I need the development libraries as that is what the Haskell module will hook into in the end. At this point I gave up; It already runs on my desktop.

But not for long! After a system update to Ubuntu, the darn thing prints out an error message about how libpthread.so has an invalid ELF header. At this point, I have had my fill trying to get this module to work, and I figure it would be much less of a headache just to write that specific part of the program in a language that is much widely accepted.

My idea is to have the Haskell script simply output the locations of all the pixels that should be turned on in the image to standard output. Then, my Perl script would gobble that up, draw the image with libGD, and output the image to standard output. If this works, I could draw the image by simply piping the two commands together. One Perl script later, this is working (And the Perl script is quite cool, may I say ... "while (<>)" is always a shocker when you see it =D Yay for default variables!)

While I am not quite done with this project (I want to implement 2-D and 3-D and 4-D versions of the same program), I have found a couple things that are wrong with Haskell. It's been around for at least 20 years, but the community just isn't there. There are a couple mailing lists that are quite large, and even some textbooks about it, but there was really no help I could find trying to use the little module that I was trying to use. Documentation is slim for these corner cases. Also, maybe I don't know the language well enough, but it seemed freaking impossible to do the simplest things with the image. Simply drawing pixels in a loop seemed to be much harder than it should be - I had to use a list comprehension to make a list of IO commands (one for drawing each pixel) and then make a "Do IO Commands" function that recursively runs through a list and does each command one-by-one. Dealing with IO commands, which is required for what I'm trying to do, just seems entirely awkward. Like I said earlier, though, maybe I just haven't learned how to use them elegantly yet.

Soon, I will post the same version of the program in many dimensions! look out for it!

My current code can be found here.

Tuesday, November 24, 2009

Bead Sort in OpenCL

Recently (read: past couple of years) I have become interested in high performance computing. This includes everything from PlayStation 3s to GPGPUs to BlueGenes. It seems that each company has created it’s own framework and hardware with which to program a high performance program.

As I was researching all these different kinds of technologies, I thought to myself that each one was interesting, but programmed in a completely different way. For example, when programming with NVidia CUDA, you have to make sure that the execution path is the same for all the threads that you want to run (If you want the best performance). If you’re programming with the Cell Broadband Engine, you have to use their (arcane) mbox to pass messages between the multiple processors. If you’re using MPI, you have to have many computers that all have the same hardware, are set up with single sign on, and have a shared filesystem – certainly a non-trivial task.

Not only does the programming have many different quirks for each system, but the performance varies greatly as well. The lastest NVidia (295GTX) card boasts 289 gigaflops (http://en.wikipedia.org/wiki/GeForce_200_Series); the PlayStation 3 is reported to have 28 gigaflops (http://fah-web.stanford.edu/cgi-bin/main.py?qtype=osstats). A beowulf cluster simply has the addition of all the flops of its constituent parts. Now, I’ve heard amazing things about each of these technologies, and those numbers don’t seem to reflect the hype.

I thought I would take a survey of each of these technologies to try to make some sense of the mess out there. I believe if I carefully document the problems I encounter, the pros and cons that I have found for each system, performance, and cost, both the world and I may be a little happier.

The problem that I have chosen to implement on each of these technologies is a program called Bead Sort (http://en.wikipedia.org/wiki/Bead_sort). It is a sorting algorithm (hence the name) and I thought it would be particularly appropriate since computer scientists like myself seem to be obsessed with sorting algorithms. I’ll try to explain it quickly here (I know clicking on links is strenuous!):

Imagine, if you will, there are 10 totem poles in front of you, all lined up. Next to the totem poles is a pile of gigantic washers – big enough to fit around the totem poles. Now, you want to represent the number 5, so you pick up five washers and put one on each of the first five totem poles. The washers slide down the poles, and now you have 5 washers lying at the bottom of the first five totem poles. Now you want to represent the number 7, so you follow suit and drop 7 washers down the first 7 totem poles. Now, however, there are already 5 washers on the floor, so the first 5 of this set of 7 land on the washers that are already there, and the two left over drop down to be flush with the floor. The sort works by dropping each of the numbers you want to sort down on the totem poles, and you’re left with a nice little triangle of washers on your totem poles, with each horizontal level required to have less washers on it than the level below it. Then you simply read off the number of washers in each level in order, and your list is now sorted.

The benefit of this sorting method is that, if each of the totem poles are handled in parallel (as they are via gravity), it is an O(n) sort (sort of, see below). You just drop sets of washers down on these poles, and the number of sets you drop is the number of numbers you want to sort. Once you’re done dropping the sets, you’re done. “But wait!” you may be screaming, “converting the number 5 to 11111 is itself an O(n) loop! That turns the computation into an O(n*m) one, where n is the number of numbers and m is the size of the maximum number!” I do have answer for you, though it may not seem very tasteful.

In theoretical computer science, complexity is a function of the size of the input. The idea is that if an algorithm is O(n^2), doubling the size of the input should quadruple the processing time required. The input to a program is technically a sequence of characters – not numbers themselves. Therefore, doubling the size of the input of a program, technically, should be doubling the number of characters in the input. The value of a number and the representation for a number are not linearly related, however. If you take the number “123” and double its representation to get “123123”, you don’t get a number that’s twice as big – you get a number that’s roughly 1000 times as big.

So, the loop that converts that “5” to 11111 is really just a loop to represent a number in the format that it should be represented in – it does have the property where doubling the representation to 1111111111 doubles its value. As computer scientists, and even mathematicians, we don’t want to write all those tally marks to represent large numbers, so we have adopted a terminology that lets us logarithmically decrease the number of digits we have to write. Therefore, counting the loop to turn digits into sequences of bits should not be counted. I know this argument is weak, but I feel that it at least provides a feeble defense.

One of the benefits of using this algorithm is that sorting is a standard practice in computer science; so many different sorting algorithms can be easily swapped out for each other. In addition, the contract that a sorting function makes is very well defined. I can use this to my advantage – I could make a separate object file for each of the different kinds of sorts that I want to use, each of which includes a sort function of the same name and signature. Then I can use the same calling code with all methods – I just have to choose which object file I link this calling code with to choose which implementation to test. This will work to an extent – I can only do this for programs that will run on the same hardware. Other than that, I will have to build an entirely new binary for each entirely new system that this will run on (i.e. a PlayStation 3 cannot share the calling binary that my laptop uses).

Anyway, I have been programming with NVidia CUDA for maybe a year or so, so I thought that I have a fair amount of experience with graphics cards, but I wanted to try a new platform for the first implementation. I have heard about OpenCL, and thought that I would give it a try, so I watched Dave Gohara’s OpenCL tutorials (http://macresearch.org/opencl) to learn about the platform.

My initial reaction was that OpenCL was (almost) everything that GPU computing should be. The first thing that struck me was that OpenCL uses just-in-time compilation, so the binary form of your kernel is not actually made at compile time. When you run your OpenCL program, you have to give your program the text of the C code that describes your kernel, and your program compiles, builds, links it, and runs it. This means that there is a compiler and linker inside the binary of my bead sort program. Not only that, but I do not get the benefit of using a well-established compiler like G++ - I have to hope that the OpenCL compiler is fast and optimized. I realize that computers today (and certainly any computer that you would be using OpenCL on would have a decent graphics card and would therefore be newish) are fast, powerful, have large disk drives and large amounts of memory, but something like this would make my software engineering professor cry. It throws cohesion out the window.

Now, I can see the problem that they were trying to solve. First of all, my program should run on whatever kind of hardware it wants to, regardless of where it was compiled. By compiling the kernel at runtime, it can be tuned for the specific hardware that will be running it. In addition, it has the benefit that I can compile my OpenCL program with any old compiler and not have to worry about the compiler having support for making binaries for graphics cards. CUDA solved this problem by simply creating their own compiler that does know how to compile binaries for specific graphics cards. Both ways work, and both ways have pros and cons, so I’m going to delay judgment of this design decision. I will say, however, that CUDA’s solution makes a lot of sense given that the hardware is fixed – the programs only need to run on NVidia hardware, and the fact that they are attempting to keep their technology closed and proprietary. OpenCL appears to be attempting to keep the options open for what hardware the program will run on.

That being said, OpenCL is entirely extendable. In fact, it’s designed so that you can change the device your program runs on by simply changing one line of code – the line of code that selects the active device. This works well enough that you can run your program on a CPU or a GPU just by changing this line. Think about that – it means that it will generate entirely different binaries depending on where you want to run your calculation, on the fly. This is incredibly powerful, and I can see this technology being extended to any of the various HPC systems out there. There is a current effort to port OpenCL to the PlayStation 3, to make it much easier to program with (http://sites.google.com/site/openclps3/). According to that website, even IBM has released an OpenCL implementation for the cell architecture. I have a dream, that one day I will be able to program all HPC hardware in one unified way. OpenCL appears that, one day, it may be that way.

So, after learning how to use OpenCL, I started to actually write my program. This actually required a fair amount of thinking. Clearly, one thread (or work-unit, in OpenCL terminology) should handle each totem pole. In addition, the presence of a washer at a specific place on the totem pole clearly should be represented by a bit (1 = present, 0 = absent). Therefore, the totem poles would be represented as a two-dimensional array of bits. Each thread will crawl up one of the totem poles and make them “fall.” However, computers do not have the concept of a single bit being addressable, so you must use bytes. Then, 8 threads must be responsible for a single byte. But each thread is not aware of the other threads operating on the same byte, so thread 1 can set a specific byte to be something (say a washer is falling on totem pole 1) but thread 5 is going to set it to be something completely different (say a washer is not falling on totem pole 5). Since the memory is byte-addressable, the threads will step on each others’ toes. Okay – simple solution: have 1 thread operate on each 8 bits in a byte, and wrap up the computation in a for-loop that executes 8 times, once for each bit in the byte.

Okay, but how about if column 4 has a washer every other spot, and column 7 has no washers in it. How would one go about actually making the washers “fall?” Well, the simple solution would be to simply count up the number of washers in each column, call it x, and set the n-x values of the totem pole to be 1 and the remaining values to be 0. This presents a problem, however, as the range of that last loop is data dependent. In the example given, column 4 is going to loop from 1 to n/2, and column 7 is going to loop from 0 to 0. As with any GPU programming, all the threads must have the same execution path for best performance. So the GPU threads cannot do the actual dropping of the washers – the only thing they can do is count the number of washers on a totem pole.

This, however, is okay! I have developed a linear time algorithm to take the counts for the totem poles and recreate the original numbers. Say you have counts of [7, 5, 5, 5, 3, 2] (meaning totem pole 1 has 7 washers on it, and totem pole 5 has 3 washers on it). Because there are 6 numbers here and therefore 6 totem poles in a row that have washers on them, you know that the largest row in the triangle (the bottom row) has 6 washers in it. In addition, the smallest number in that list is a 2, so you know that there are 2 rows with 6 washers in them. The output list is now [6, 6]. Now, erase these two largest rows because they have already been counted. Now the counts are [5, 3, 3, 3, 1]. There are 5 numbers in this list, so you know that there is a row with 5 washers in it. Add this to the output, and erase that row from the triangle. Now the output is [6, 6, 5] and the counts are [4, 2, 2, 2]. After repeating this until the entire pyramid has been removed, the output is now [6, 6, 5, 4, 4, 1, 1], which is the correct sequence. With proper value caching, this algorithm can be reduced to linear time. I believe it is a dynamic programming algorithm, but to say for sure would take more study.

So, everything is linear time. In addition, the timing for this last step on the CPU is around 0.1% or the timing for getting the totem pole counts on the GPU, so this last step can be safely ignored.

After coding this up, I felt that I needed to optimize a bit. The algorithm itself does not use shared (or “local,” in OpenCL lingo) memory at all – I figured this clearly was a problem. So, I gave each computational block (or “work group”) its own chunk of shared memory, and modified the kernel so that the computer copies chunks of data into this cache, operates on the cache, and then copies the next chunk in. The idea is that copying a splotch of data from global memory to shared memory is quick, assuming you format your data correctly. The main rules that govern this is, as I understand it, that all base pointers should just be divisible by 32, and the copy should be a “thread x copies byte x” one-to-one copy. Once the data is quickly copied to local memory, access to it is exceptionally quick, especially if you are accessing that data multiple times like I am – each thread has to deal with each byte 8 times for each of the 8 bits in the byte (This is where you see a greater increase in speed – n*speedup is much more noticeable than just a speedup).

In addition, I had to make sure that there were no bank conflicts. Each byte in local memory is located inside a bank, and if two thread attempt to access the same bank, one thread has to wait for the other to finish, resulting in a slowdown. From what I understand, shared memory byte x is located in bank x mod 32 – the bytes roll down the banks. Okay, so if the number of threads in a workgroup is limited to 32, each one can operate on its own bank, assuming the base pointer is divisible by 32 and each thread in the workgroup offsets from the base pointer by that thread’s id. Limiting the number of threads per workgroup to 32 will still keep each multiprocessor at full load, because the underlying driver will put two workgroups on the same multiprocessor if they will fit, and in this case the numbers are small enough that multiple workgroups will fit on the same multiprocessor.

So I coded this up in OpenCL and unhappily found that this new, memory-optimized version of the kernel is, in fact, slightly slower than the original, unoptimized version. This set me back a little bit – All that work for nothing! Perhaps I coded the solution incorrectly, but even after looking over the code everything seemed to be in place. After thinking about it for a little while, here is the explanation I came up with:

In the unoptimized version, the bytes being read are being read quickly, for the same reason that the optimized version can copy in data from global memory very quickly – the memory is coalesced. In both versions of the program, both versions are reading in the same amount of data from global memory – that should not account for any time differences. Therefore, the slowest part of the kernel is the same for both versions. As OpenCL has the concept of automatic variables, where the system will determine in which part of memory (local, global, etc) to put the variable, I am assuming that OpenCL is actually doing the caching that I explicitly programmed in under the hood in the unoptimized version. It is surely an intelligent system.

Looking back on programming in OpenCL, a fair amount of work was done for me under the hood (like the optimization). The amount of code to set up the OpenCL environment seems a little large and overly verbose, even though I know what the function of each line of code I wrote is. It makes sense that all those lines need to be there, but the sheer amount of the code I had to write just to run a kernel was a little daunting. Writing the kernel, though I had lots of things to keep in mind as I did it (such as memory coalescing and how I broke up the problem), seemed to be fairly straightforward. All in all, programming for OpenCL is very similar to programming with CUDA, with a couple more function calls that are required. Looking over the program afterwards seems deceptively simple – finding good values for the constants that I use was nontrivial, and having to pad the data so that its size is a multiple of 32 in both dimensions is not exactly elegant, but the final product is fairly clean. As this is the first framework that I have used, I cannot do any performance comparisons, but my laptop sorts anything I can throw at it (Anything that fits in the 128MB that is addressable on the video card) in under half a second. It surely is a fairly speedy framework.

My (uncommented) code can be found here.

Knights Tour Revisted

I have a good friend, Matt, who shares my interest for interesting problems. We were talking about this problem, and he immediately described why I was having the problems that I was having with the Knight's Tour.

In my previous solution, I had been explicitly saving each encountered state in a stack. Therefore, this stack was simply a linearization of the entire search tree - something with d^b nodes.

Now, the first time I had programmed this algorithm, I had used a breadth-first search, something where using an explicit queue was necessary. I had quickly realized that in order to port this algorithm to a depth-first search, I should just simply use a stack.

Matt quickly made the comment that using an explicit stack to save each and every state is completely wasteful. In fact, it is possible (and simple!) to use the system stack with a depth-first search, and use recursion to push and pop states on and off the stack. With this approach, only one path through the search tree is ever in memory at a time, so instead of holding d^b nodes, you simply hold a maximum of d nodes. Clearly this is a better solution.

Last week I applied for a job at Google, and the interviewer actually started talking to me about this problem. He said that, at least for this specific problem, it is possible to only store a single state. The motivation for this is clear, I assume: Some games have extremely complicated, large states, and holding multiple states in memory may not be feasible. After some thought about what he described, I realized that it is simply an extension of what Matt had told me about.

The idea is this: all levels of recursion share the same representation of the board. Because only one level of recursion is literally executing at a time (the deepest level), the board only needs to have one value at any given point in time. When you want to make a move and recurse to search that subtree, you can make the move on the shared board. Then, if that recursion fails to find a solution - un-make that move on the shared board. That way the precondition will still hold for the next move to make.

Now, clearly this solution only works if it's possible to un-make a move. In a game where the moves are super complicated, this solution will not work.

Also, my first solution was written in Haskell. As Haskell is purely functional, all of the data structures are immutable. This means that a copy of the state is made every time a move is made - it is impossible to change the single board in memory. Therefore, this solution would clearly not work in Haskell.

Also, it should be noted that this algorithm assumes that only one search path is executing at any time. Therefore, this algorithm is unable to be parallelized in any obvious way. Different subtrees cannot be searched in parallel, because they have to be operating on different boards; however, there is only one board in memory. So, this algorithm also does not scale to multiple CPUs. It might be possible to split the entire search tree into a small amount of subtrees at one-level deep, and run the algorithm in parallel on each of those sub-trees. In this case, however, each thread would have to have their own board to search on, so there are now a few boards in memory instead of just one.

I have posted previously about my lack of thinking on the topic of complexities in programming. So, in the spirit of self-improvement, I thought about this algorithm before I coded it up. Clearly, this algorithm would solve the memory problem. However, the tree for a 8x8 chessboard still is 64 levels deep. The breadth of the tree starts out at 8, but this diminishes as possible moves may have already been made (Clearly, or else there would be a gigantic number of solutions as any solution that makes it all 64 levels deep is a successful solution). Therefore, the tree is fairly deep and starts out fairly broad. As I am a rather visual person, I imagine it as being something vaguely circular - broad at the top, but then tapering off as the number of unmade moves is limited toward the bottom. Therefore I expected that my solution to the problem will still be slow, even if it isn't large.

I then thought about my state representation. The solution to this problem is a sequence of (X, Y) pairs. I will call a particular entry in the solution (Xp, Yp) and the next entry (Xp+1, Yp+1). The solution can be entirely crafted as a sequence of these pairs where the first pair is (0, 0) and abs(Xp+1 - Xp) + abs(Yp+1 - Yp) = 3. Not only that, but no pair on the solution may be reused, so the solution is also a permutation of ([0..width], [0..height]). The word permutation immediately clued me in to the idea that this problem is NP-complete (even though I already knew it was, this sealed the deal).

The search tree successor function can simply find all the (Xp+1, Yp+1)s that fit the bill. This approach, however, does not lend to using a permutation generator, as finding a permutation that has this quality is non-trivial. This means that the solution can entirely be crafted in terms of itself, and the representation of the board itself is unnecessary. That being said, storing the board allows for O(1) lookup to see if a particular solution is already in the solution, making sure that the solution is, in fact, a permutation. This is necessary because the permutation generator approach does not find successors easily, so the permutative quality of the solution must be artificially kept.

So anyway, I coded this up in C. My program finds that there is no solution to a knight's tour on a 4x4 board in 0.009 seconds. It finds that there is no solution to a knight's tour on a 5x5 board in 0.168 seconds. It finds that there is a solution to a knight's tour on a 6x6 board in 52.473 seconds. I started running the program on a 7x7 board, but killed the process after it ran for around 24 hours. Note that I am running these on my MacBook Pro laptop.

So here I am, trying to think of a way to make this faster. As soon as I figure it out, I'll make another post.

My code can be found here.

Monday, November 9, 2009

Project Euler problem 66 in Haskell

I just got finished programming up Project Euler problem 66 in Haskell. I must say, the solution is extremely elegant. One thing is certain - I never felt this good about programming in any imperative language. Something about framing a solution in as simple and powerful terms as possible just gets the dopamine flowing. Anyway, here is my solution:


continuedFraction :: Integral a => a -> a -> a -> [a]
continuedFraction root num denom = whole : (continuedFraction root newdenom outdenom)
   where whole = floor (((sqrt $ fromIntegral root) + (fromIntegral num)) / (fromIntegral denom))
         newdenom = (denom * whole) - num
         outdenom = (root - newdenom*newdenom) `div` denom

evaluateConvergent :: Integral a => [a] -> (a, a)
evaluateConvergent terms = foldr (\ term (x, y) -> (y + term * x, x)) (1, 0) terms

getConvergents :: Integral a => a -> [(a, a)]
getConvergents root = [evaluateConvergent $ take x $ continuedFraction root 0 1 | x <- [1..]]

getX :: Integral a => a -> a
getX d = fst $ head $ filter (\ (x, y) -> x*x - d*y*y == 1) $ getConvergents d

isPerfectSquare :: Integral a => a -> Bool
isPerfectSquare val = val == squareroot * squareroot
   where squareroot = floor $ sqrt $ fromIntegral val

argmax :: Ord b => (a -> b) -> [a] -> a
argmax func list = fst $ foldl1 (\ (x1, y1) (x2, y2) -> if y1 > y2 then (x1, y1) else (x2, y2)) $ zip list (map func list)

main = print $ argmax getX (filter (not . isPerfectSquare) [2..1000])

And there you have it. Solution runs in about 0.4 seconds on my laptop.

Sunday, November 1, 2009

New Laptop

As of late, I have bought a MacBook Pro. At the time, I did it mainly for completeness - I have used all of the major operating systems except for OS X (All the NT-based Windowses, Many distributions of Linux, FreeBSD, OpenSolaris). I have a personal commitment to portable code, so I thought that I would have one more platform to make sure my code runs on it.

However, this laptop quickly became my choice computer to use. It seems to be the best medium between a low-level Unix environment and a high-level Windows environment. The command-line is a staple of any unix developer's toolbox, and I find Windows' lack of an integrated terminal (that doesn't blow) very disgruntling. A note on that, actually - Why the hell can't we make terminals wider than 80 characters on Windows? Literally every other operating system has figured out a way to make that work. On the other hand, OS X has many supported applications (Something in which Linux and Unix as a whole are lacking) as well as having a very intuitive user interface (to an extent - Those silly minimization animations are ridiculous)

Anyway, enough harping on Apple. I believe that I must be starting to drink the Kool-Aid or something - I'm liking this computer more and more. In fact, I like it so much that I've decided to get rid of my old laptop and use this one instead. I used my old one because it was a tablet and I found it useful for taking notes in math classes. However, I have found that that HP laptop as a whole is shoddily made. For example, the trackpad will move the mouse and click, even when i'm nowhere near the laptop. It makes watching movies excessively annoying when the laptop is set up on a desk, I'm 2 feet away from it watching peacefully, and the mouse moves to the pause button and pushes it, completely unprompted. I have found that on paper, the two laptops cost about the same, have comparable power, but the MacBook Pro just has so much more polish.

So I have finally began the process of wiping that old laptop to sell. But then, what was I to do if I was on the go and needed to use Windows? Lo and behold, the MacBook Pro has an Intel processor, so I installed Windows 7 on it with the help of Boot Camp. And of course, I installed some of the games that I own, just to try them out. I know for a fact these games are completely unplayable on my older laptop, but on the MacBook Pro they run completely fine! Once I set the screen resolution to a little bit smaller than native, I was getting a consistent playable 30fps in Left 4 Dead.

One thing is clear - Apple's engineering clearly outshines HP's. Though this may not be a fair comparison, as the HP was a tablet, the usability speaks for itself. In fact, the MacBook Pro runs games so well, I believe I could take it to a lan party and play all night, and not have to lug my giant heavy desktop gaming system. So much better.

Knight's Tour

I am taking a Graph Theory class at my school, and last class my teacher handed out a problem to the class. He isn't going to collect the problem, and we are not even expected to solve the problem, but he thought that some of the class would appreciate it.

The problem is fairly simple. On a chessboard, there are 64 squares arranged in an 8x8 grid. A knight is capable of moving in an "L" shape, 2 squares in one direction and 1 square in an orthogonal direction in a single move. A knight starts out at the top left corner of the chess board. It has been shown that it is possible for the knight to move to each and every square on the chessboard exactly one time and end up in the top left corner again. What is the sequence of moves that the knight has to go through in order to complete this tour of the chessboard?

Before I begin, I want to say that I have not solved this problem yet. This post is not about a solution to the problem. It is about my thought processes and my ability to learn lessons from my failed attempt at solving this problem.

It is extraordinarily simple to reduce this into a graph problem; in fact, my professor did it for us. Each square on the chessboard represents a vertex, and there are edges going from each square to each of the 8 other squares that a knight can jump to from that square. The question is simply asking for a Hamiltonian Cycle starting from the top-left vertex. Now, it is well-known that the Hamiltonian Path problem is NP-Complete. That means that there is no quick algorithm for solving the problem (More formally, all known algorithms for solving this problem involve a branching factor that is proportional to the data size - think of it as having complexity of C^N rather than N^C.) Therefore, there is no quick straightforward algorithm to solve this problem (in the generic case).

I'll admit it - I'm not very good at math. Perhaps I have been tainted by computers, but it takes me a long time to do a problem like this. I have been jaded with the idea of solving problems by hand when I have a giant machine next to me that can solve it for me. All I have to do is tell the machine the rules of the problem, and away it goes! It sounds so much easier than actually having to make deductions about a problem.

The general problem itself is NP-Complete, so I am assuming that there are some additional rules that can be added because of the specific layout of this graph. However, my attempt at solving this problem immediately disregarded these additional rules. Instead of making intelligent deductions, I threw a standard algorithm and the basic rules at the computer and told it to go, in much the way that I have described above.

The actual algorithm I used is one that is a simple case of an Artificial Intelligence search algorithm. The idea is to make all the possible moves (in a structured order) and hopefully one of the moves is the "winning" move. Given a board, there is a set of moves that the knight can make, which result in a set of "neighboring" boards. Then, from each of these boards, the knight can move again, and again, so on and so forth. It is solved in much the same way that a constraint satisfaction problem is solved - follow possible path the knight can take all the way until it can't move anymore, and if that's not the winning move, back up a little and try a similar path with only the ending different. If that doesn't work, back up again, etc. This is a stack-based approach, which leads to a depth-first traversal of the problem.

I have had to implement this algorithm many times. Once was in Artificial Intelligence class, once was on my own (I was making a SuDoKu solver because I'm bad at math, and had to get a computer to solve the problem for me), and once in Computer Science 4 class. So needless to say, in an hour or two the algorithm was coded up in Haskell (my current favorite language) and here is my code for you to peruse:

import qualified Data.Set

data Board = Board [[Bool]] Int Int deriving (Show, Eq, Ord)

set2DIndex :: [[a]] -> Int -> Int -> a -> [[a]]
set2DIndex l x y v =
   [l !! c | c <- [0..x-1]] ++
   [[(l !! x) !! c | c <- [0..y-1]] ++
      [v] ++ 
      [(l !! x) !! c | c <- [y+1..7]]] ++
   [l !! c | c <- [x+1..7]]

setVisited :: [Board] -> [Board]
setVisited boards =
   [Board (set2DIndex visited posx posy True) posx posy | Board visited posx posy <- boards]

getSuccessors :: Board -> [Board]
getSuccessors (Board visited posx posy) =
   setVisited $
   filter (\(Board visited x y) -> not ((visited !! x) !! y)) $
   filter (\(Board visited x y) -> x >= 0 && x < 8 && y >= 0 && y < 8) $
   [Board visited (x + posx) (y + posy) | x <- [-1, 1], y <- [-2, 2]] ++
   [Board visited (x + posx) (y + posy) | y <- [-1, 1], x <- [-2, 2]]

isWin :: Board -> Bool
isWin (Board visited _ _) = foldl1 (&&) $ foldl1 (++) visited

-- Receives a queue/stack of board histories and a set of processed boards, returns the sequence of boards that get the solution
findTour :: [[Board]] -> Data.Set.Set Board -> [Board]
findTour queue processed
   | isWin board = last queue
   | Data.Set.member board processed = findTour (init queue) processed
   | otherwise = findTour ((init queue) ++ (map ((++) (last queue)) [[a] | a <- getSuccessors board])) (Data.Set.insert board processed)
   where board = last $ last queue

main = putStrLn $ show $ findTour [[Board [[False | x <- [1..8]] | y <- [1..8]] 0 0]] Data.Set.empty

Sorry about the lack of comments - I was never really intending this to be shown to a large audience. Hopefully the code is not too unreadable.

So I hit the big proverbial "RUN" button, and watched the Activity Monitor as GHC started eating all my CPU and memory. Then, the CPU graph fell off and because stable at around 10%, with around 70% of that being system time. In addition, the system froze up, last.fm stopped playing music, and the mouse would respond only sporadically. Clearly, the system was swapping heavily.

While I'm watching my computer slowly toast itself alive, I took a step back and thought about the problem I was trying to solve. I thought about the other problems that I had solved with this same standard algorithm. SuDoKu ran quickly (About 0.1 seconds if I remember correctly) because the search tree could be pruned so easily. A naive interpretation of the data structure would be that there are 9x9 cells, and each one could have any of the number 1..9 in it, so that makes for 9^81 possible boards. However, most of those are not valid (e.g. any subtree of a board starting with 9,9 is invalid), so entire giant subtrees are taken out of the search path. This makes for a well-pruned search tree that can be traversed fairly quickly.

The knight's tour problem, however, does not lend itself very well to pruning. Sure, the knight cannot move to a spot that it has already moved to, but the probability of that happening is fairly low for most of the calculation. In fact, given the state representation that I chose (a 8x8 grid of booleans representing whether the knight has been at that location or not), not a single state is invalid. Every combination of those boolean values represents a valid board, unlike now a SuDoKu board has many invalid states. My algorithm eventually is going to have to check each of the 2^64 states that the board could be in, not to mention that two states can be different even if the boards are the same but the knight is in a different position. In fact, because there is no search pruning, my approach is no better than simply running through each of these possibilities one by one. Given that there are 10^19 possibilities, clearly this is not going to work.

As soon as I realized this, the Control-C came crashing down. Granted, the system took about 5 minutes to respond to the keypresses, but after everything that had been shunted to the disk was read back into memory, everything appears to be in working order now. Then, I ate some spaghetti.

During the meal, I thought to myself that this phenomena had not been the first time that this had occurred to me. There have been at least 2 or 3 problems (mostly Project Euler problems, as these are designed to exploit this problem) where I have immediately launched into coding up what I thought was a solution to the problem without realizing the complexity of the solution. In particular, even a linear-time algorithm can be slow if N is larger than about 10^9.

The underlying problem is that I do not have a very intuitive knowledge of how big 'big' is. I see a computationally-expensive problem, and think "Oh, that's more than I can do, but a computer can also do more than I can do, so surely a computer can do that task!" It takes a special kind of programmer to not just launch into a problem, but to take a step back and to realize the complexity of the problem, as well as other implicit problems such as data overrun problems and the like. I heard a story that Donald Knuth, in his day, would want to write a program, but instead of sitting down at a terminal, he would think about the problem for weeks in a row. Finally, when he was ready, he would sit down at his desk with paper and pencil, and write out his program from start to finish, in one swath. This means that before he wrote code, he thought about all the details of the algorithm, the complexity, the formulation, everything. Now, clearly this isn't incremental improvements (which are almost always a good thing), but I want to take a lesson from Knuth. I want to know what I'm writing before I write it, not as or after I write it.

Note: My Project Euler profile can be found here.

Friday, October 9, 2009

Closures in Python and JavaScript

Here is a program written in Python:

def clo():
   a = 3
   def geta():
       return a
   def seta(newa):
       a = newa
   return [geta, seta]

closure = clo()
closure[1](5)
print(closure[0]())

And here is the same program written in JavaScript:

function clo() {
   var a = 3;
   function geta() {
       return a;
   }
   function seta(newa) {
       a = newa;
   }
   return [geta, seta];
}

closure = clo()
closure[1](5)
print(closure[0]())

Here is the output of running the Python program:

tuba:~ litherum$ python --version
Python 2.5.1
tuba:~ litherum$ python test.py
3

And here is the output of running the JavaScript program:

tuba:~ litherum$ java -jar Downloads/rhino1_7R2/js.jar test.js
5

Here are a couple observations of this:

Both languages printed out a value, so clearly both of them use closures

Python printed out the original value of a, so it appears that it creates a new closure for each function invocation.

JavaScript printed out the new value of a, so it appears that the closure is consistent across function invocations

I just noticed this and thought that it's fairly interesting food for thought.

Saturday, October 3, 2009

Misattribution

Joel Spolsky gave a talk at his business of software conference on making a number one product. One of his main points was this concept of misattribution, or trying to get people to feel that their happiness is caused by your software product, even if it may not be.

My own understanding of misattribution goes something like this: Say I'm going out on a first date to a restaurant with a girl. I want this girl to like me, so on the drive to the restaurant, I put on some music that I know she likes. Music provides a line directly to the emotional part of the brain, something which ordinary speech can not do. This music starts off the date by putting her in a good mood. Now, assuming I don't do anything particularly fiendish during the date, her general good mood should stick around. On the way home, the music once again cheers her up. When she arrives at home, she thinks back about the night and believes that because she was in a good mood for the entire evening, the date must have gone well, and therefore the two of us might be at least somewhat compatible. The kicker is this: she misattributed her feelings of happiness from the music with her feelings of me.

Now take this to the software world. What is actually happening here, when you woo your girl by putting on some Barry White? You are creating an experience. Experience is the entire night out at the restaurant; experience starts when you first hear about a product and ends when you last think about it. The iPod has been lauded and heralded so much lately, but I must say that Apple really has the experience down right. The package your iPod comes in is crisp and clean. Your iPod pops out of the case like it is from the future. Apple understands that, to a general user, it does not matter how many gigawhatsits your new processor has or how many megawhosings your monitor supports. They realized the power of misattribution early.

Every geek knows the slight joy felt when one end of a wire fits perfectly into another end of another wire. The pieces were simply made to fit together, and when you plug them together everything fits perfectly. This is the joy of unwrapping your new iPod: everything just seems to fit together in a way that makes it seem right. I know I like rounded corners, which the iPod employs. This does not make it play music any better; I just like rounded corners. I enjoy using my iPod partially because of the overall experience. Apple has found a way to make me misattribute my happiness of things fitting together correctly for my feelings of using an iPod.

At first glance, the concept of misattribution seems like trickery: Make the user think that your product is good by putting something shiny next to it. However, when Joel Spolsky says that we should make our users misattribute their positive feelings to our products, I believe he means that we should create an entire experience for our product. Do not simply make a product with X, Y, and Z features. Instead, think about how the user is going to interact with your product and make a sort of meta-product: one which you are not selling directly but is nevertheless sold with your product.

Last summer I read a blog post about the distinguishing factor between enterprise software and non-enterprise software. The main point the post made was that, for standard commercial software, the burden of integration is on the user. Okay, Fred, if you want to use Microsoft IIS, you have to figure out where it is going to run, how it is going to run, what software is going to run on top of it, and you have to hope that using this software is going to solve your problem. In enterprise software, you must act much more similar to a contractor - you are no longer selling a product but now selling a solution to a problem. It is now no longer up to the company to figure out how your product is going to solve their problems; It is up to you to actually solve the company's problems with your product.

This ties in completely with the aforementioned topics of misattribution and experience. Enterprise software has to work at a higher level than simply selling a product; they have to create a meta-product to sell to their customer. If you solve a company's problem with minimal overhead, the entire experience will be positive for the company, and they will misattribute their positive feelings onto your product, even if it is not the best product out there.

This describes a fundamental shift of computer programs: the best programs are not products but are instead services. They create an entire experience for the user and solves the user's problem without the user's intervention. If they do their job well, users will love it, even if it lacks some features.

Which would you rather drive: a crappier, more fun car? Or a car that goes faster but smacks you in the face repeatedly and requires intense concentration to drive?

Evil Corporation logos

As far as logos go, I place an extremely high value on elegance. Making an elegant logo, however, is a difficult task. Here are a couple things that I think can help:

Smooth, clean lines
Repetition of a theme
Simplicity
Few colors

I was playing Prototype the other day and the thought struck me: Evil corporations in video games generally seem to have extremely elegant logos. Here are 3 examples:

Umbrella Corporation from Resident Evil:

Gentek from Prototype (This is the best one I could find):

Armacham Technology Corporation from F.E.A.R.:

Decide for yourself, but I think that if I ever have to make a logo, I will use these as a starting point. Hopefully I won't be in the process of creating my own evil corporation :)