Tuesday, November 24, 2009

Bead Sort in OpenCL

Recently (read: past couple of years) I have become interested in high performance computing. This includes everything from PlayStation 3s to GPGPUs to BlueGenes. It seems that each company has created it’s own framework and hardware with which to program a high performance program.

As I was researching all these different kinds of technologies, I thought to myself that each one was interesting, but programmed in a completely different way. For example, when programming with NVidia CUDA, you have to make sure that the execution path is the same for all the threads that you want to run (If you want the best performance). If you’re programming with the Cell Broadband Engine, you have to use their (arcane) mbox to pass messages between the multiple processors. If you’re using MPI, you have to have many computers that all have the same hardware, are set up with single sign on, and have a shared filesystem – certainly a non-trivial task.

Not only does the programming have many different quirks for each system, but the performance varies greatly as well. The lastest NVidia (295GTX) card boasts 289 gigaflops (http://en.wikipedia.org/wiki/GeForce_200_Series); the PlayStation 3 is reported to have 28 gigaflops (http://fah-web.stanford.edu/cgi-bin/main.py?qtype=osstats). A beowulf cluster simply has the addition of all the flops of its constituent parts. Now, I’ve heard amazing things about each of these technologies, and those numbers don’t seem to reflect the hype.

I thought I would take a survey of each of these technologies to try to make some sense of the mess out there. I believe if I carefully document the problems I encounter, the pros and cons that I have found for each system, performance, and cost, both the world and I may be a little happier.

The problem that I have chosen to implement on each of these technologies is a program called Bead Sort (http://en.wikipedia.org/wiki/Bead_sort). It is a sorting algorithm (hence the name) and I thought it would be particularly appropriate since computer scientists like myself seem to be obsessed with sorting algorithms. I’ll try to explain it quickly here (I know clicking on links is strenuous!):

Imagine, if you will, there are 10 totem poles in front of you, all lined up. Next to the totem poles is a pile of gigantic washers – big enough to fit around the totem poles. Now, you want to represent the number 5, so you pick up five washers and put one on each of the first five totem poles. The washers slide down the poles, and now you have 5 washers lying at the bottom of the first five totem poles. Now you want to represent the number 7, so you follow suit and drop 7 washers down the first 7 totem poles. Now, however, there are already 5 washers on the floor, so the first 5 of this set of 7 land on the washers that are already there, and the two left over drop down to be flush with the floor. The sort works by dropping each of the numbers you want to sort down on the totem poles, and you’re left with a nice little triangle of washers on your totem poles, with each horizontal level required to have less washers on it than the level below it. Then you simply read off the number of washers in each level in order, and your list is now sorted.

The benefit of this sorting method is that, if each of the totem poles are handled in parallel (as they are via gravity), it is an O(n) sort (sort of, see below). You just drop sets of washers down on these poles, and the number of sets you drop is the number of numbers you want to sort. Once you’re done dropping the sets, you’re done. “But wait!” you may be screaming, “converting the number 5 to 11111 is itself an O(n) loop! That turns the computation into an O(n*m) one, where n is the number of numbers and m is the size of the maximum number!” I do have answer for you, though it may not seem very tasteful.

In theoretical computer science, complexity is a function of the size of the input. The idea is that if an algorithm is O(n^2), doubling the size of the input should quadruple the processing time required. The input to a program is technically a sequence of characters – not numbers themselves. Therefore, doubling the size of the input of a program, technically, should be doubling the number of characters in the input. The value of a number and the representation for a number are not linearly related, however. If you take the number “123” and double its representation to get “123123”, you don’t get a number that’s twice as big – you get a number that’s roughly 1000 times as big.

So, the loop that converts that “5” to 11111 is really just a loop to represent a number in the format that it should be represented in – it does have the property where doubling the representation to 1111111111 doubles its value. As computer scientists, and even mathematicians, we don’t want to write all those tally marks to represent large numbers, so we have adopted a terminology that lets us logarithmically decrease the number of digits we have to write. Therefore, counting the loop to turn digits into sequences of bits should not be counted. I know this argument is weak, but I feel that it at least provides a feeble defense.

One of the benefits of using this algorithm is that sorting is a standard practice in computer science; so many different sorting algorithms can be easily swapped out for each other. In addition, the contract that a sorting function makes is very well defined. I can use this to my advantage – I could make a separate object file for each of the different kinds of sorts that I want to use, each of which includes a sort function of the same name and signature. Then I can use the same calling code with all methods – I just have to choose which object file I link this calling code with to choose which implementation to test. This will work to an extent – I can only do this for programs that will run on the same hardware. Other than that, I will have to build an entirely new binary for each entirely new system that this will run on (i.e. a PlayStation 3 cannot share the calling binary that my laptop uses).

Anyway, I have been programming with NVidia CUDA for maybe a year or so, so I thought that I have a fair amount of experience with graphics cards, but I wanted to try a new platform for the first implementation. I have heard about OpenCL, and thought that I would give it a try, so I watched Dave Gohara’s OpenCL tutorials (http://macresearch.org/opencl) to learn about the platform.

My initial reaction was that OpenCL was (almost) everything that GPU computing should be. The first thing that struck me was that OpenCL uses just-in-time compilation, so the binary form of your kernel is not actually made at compile time. When you run your OpenCL program, you have to give your program the text of the C code that describes your kernel, and your program compiles, builds, links it, and runs it. This means that there is a compiler and linker inside the binary of my bead sort program. Not only that, but I do not get the benefit of using a well-established compiler like G++ - I have to hope that the OpenCL compiler is fast and optimized. I realize that computers today (and certainly any computer that you would be using OpenCL on would have a decent graphics card and would therefore be newish) are fast, powerful, have large disk drives and large amounts of memory, but something like this would make my software engineering professor cry. It throws cohesion out the window.

Now, I can see the problem that they were trying to solve. First of all, my program should run on whatever kind of hardware it wants to, regardless of where it was compiled. By compiling the kernel at runtime, it can be tuned for the specific hardware that will be running it. In addition, it has the benefit that I can compile my OpenCL program with any old compiler and not have to worry about the compiler having support for making binaries for graphics cards. CUDA solved this problem by simply creating their own compiler that does know how to compile binaries for specific graphics cards. Both ways work, and both ways have pros and cons, so I’m going to delay judgment of this design decision. I will say, however, that CUDA’s solution makes a lot of sense given that the hardware is fixed – the programs only need to run on NVidia hardware, and the fact that they are attempting to keep their technology closed and proprietary. OpenCL appears to be attempting to keep the options open for what hardware the program will run on.

That being said, OpenCL is entirely extendable. In fact, it’s designed so that you can change the device your program runs on by simply changing one line of code – the line of code that selects the active device. This works well enough that you can run your program on a CPU or a GPU just by changing this line. Think about that – it means that it will generate entirely different binaries depending on where you want to run your calculation, on the fly. This is incredibly powerful, and I can see this technology being extended to any of the various HPC systems out there. There is a current effort to port OpenCL to the PlayStation 3, to make it much easier to program with (http://sites.google.com/site/openclps3/). According to that website, even IBM has released an OpenCL implementation for the cell architecture. I have a dream, that one day I will be able to program all HPC hardware in one unified way. OpenCL appears that, one day, it may be that way.

So, after learning how to use OpenCL, I started to actually write my program. This actually required a fair amount of thinking. Clearly, one thread (or work-unit, in OpenCL terminology) should handle each totem pole. In addition, the presence of a washer at a specific place on the totem pole clearly should be represented by a bit (1 = present, 0 = absent). Therefore, the totem poles would be represented as a two-dimensional array of bits. Each thread will crawl up one of the totem poles and make them “fall.” However, computers do not have the concept of a single bit being addressable, so you must use bytes. Then, 8 threads must be responsible for a single byte. But each thread is not aware of the other threads operating on the same byte, so thread 1 can set a specific byte to be something (say a washer is falling on totem pole 1) but thread 5 is going to set it to be something completely different (say a washer is not falling on totem pole 5). Since the memory is byte-addressable, the threads will step on each others’ toes. Okay – simple solution: have 1 thread operate on each 8 bits in a byte, and wrap up the computation in a for-loop that executes 8 times, once for each bit in the byte.

Okay, but how about if column 4 has a washer every other spot, and column 7 has no washers in it. How would one go about actually making the washers “fall?” Well, the simple solution would be to simply count up the number of washers in each column, call it x, and set the n-x values of the totem pole to be 1 and the remaining values to be 0. This presents a problem, however, as the range of that last loop is data dependent. In the example given, column 4 is going to loop from 1 to n/2, and column 7 is going to loop from 0 to 0. As with any GPU programming, all the threads must have the same execution path for best performance. So the GPU threads cannot do the actual dropping of the washers – the only thing they can do is count the number of washers on a totem pole.

This, however, is okay! I have developed a linear time algorithm to take the counts for the totem poles and recreate the original numbers. Say you have counts of [7, 5, 5, 5, 3, 2] (meaning totem pole 1 has 7 washers on it, and totem pole 5 has 3 washers on it). Because there are 6 numbers here and therefore 6 totem poles in a row that have washers on them, you know that the largest row in the triangle (the bottom row) has 6 washers in it. In addition, the smallest number in that list is a 2, so you know that there are 2 rows with 6 washers in them. The output list is now [6, 6]. Now, erase these two largest rows because they have already been counted. Now the counts are [5, 3, 3, 3, 1]. There are 5 numbers in this list, so you know that there is a row with 5 washers in it. Add this to the output, and erase that row from the triangle. Now the output is [6, 6, 5] and the counts are [4, 2, 2, 2]. After repeating this until the entire pyramid has been removed, the output is now [6, 6, 5, 4, 4, 1, 1], which is the correct sequence. With proper value caching, this algorithm can be reduced to linear time. I believe it is a dynamic programming algorithm, but to say for sure would take more study.

So, everything is linear time. In addition, the timing for this last step on the CPU is around 0.1% or the timing for getting the totem pole counts on the GPU, so this last step can be safely ignored.

After coding this up, I felt that I needed to optimize a bit. The algorithm itself does not use shared (or “local,” in OpenCL lingo) memory at all – I figured this clearly was a problem. So, I gave each computational block (or “work group”) its own chunk of shared memory, and modified the kernel so that the computer copies chunks of data into this cache, operates on the cache, and then copies the next chunk in. The idea is that copying a splotch of data from global memory to shared memory is quick, assuming you format your data correctly. The main rules that govern this is, as I understand it, that all base pointers should just be divisible by 32, and the copy should be a “thread x copies byte x” one-to-one copy. Once the data is quickly copied to local memory, access to it is exceptionally quick, especially if you are accessing that data multiple times like I am – each thread has to deal with each byte 8 times for each of the 8 bits in the byte (This is where you see a greater increase in speed – n*speedup is much more noticeable than just a speedup).

In addition, I had to make sure that there were no bank conflicts. Each byte in local memory is located inside a bank, and if two thread attempt to access the same bank, one thread has to wait for the other to finish, resulting in a slowdown. From what I understand, shared memory byte x is located in bank x mod 32 – the bytes roll down the banks. Okay, so if the number of threads in a workgroup is limited to 32, each one can operate on its own bank, assuming the base pointer is divisible by 32 and each thread in the workgroup offsets from the base pointer by that thread’s id. Limiting the number of threads per workgroup to 32 will still keep each multiprocessor at full load, because the underlying driver will put two workgroups on the same multiprocessor if they will fit, and in this case the numbers are small enough that multiple workgroups will fit on the same multiprocessor.

So I coded this up in OpenCL and unhappily found that this new, memory-optimized version of the kernel is, in fact, slightly slower than the original, unoptimized version. This set me back a little bit – All that work for nothing! Perhaps I coded the solution incorrectly, but even after looking over the code everything seemed to be in place. After thinking about it for a little while, here is the explanation I came up with:

In the unoptimized version, the bytes being read are being read quickly, for the same reason that the optimized version can copy in data from global memory very quickly – the memory is coalesced. In both versions of the program, both versions are reading in the same amount of data from global memory – that should not account for any time differences. Therefore, the slowest part of the kernel is the same for both versions. As OpenCL has the concept of automatic variables, where the system will determine in which part of memory (local, global, etc) to put the variable, I am assuming that OpenCL is actually doing the caching that I explicitly programmed in under the hood in the unoptimized version. It is surely an intelligent system.

Looking back on programming in OpenCL, a fair amount of work was done for me under the hood (like the optimization). The amount of code to set up the OpenCL environment seems a little large and overly verbose, even though I know what the function of each line of code I wrote is. It makes sense that all those lines need to be there, but the sheer amount of the code I had to write just to run a kernel was a little daunting. Writing the kernel, though I had lots of things to keep in mind as I did it (such as memory coalescing and how I broke up the problem), seemed to be fairly straightforward. All in all, programming for OpenCL is very similar to programming with CUDA, with a couple more function calls that are required. Looking over the program afterwards seems deceptively simple – finding good values for the constants that I use was nontrivial, and having to pad the data so that its size is a multiple of 32 in both dimensions is not exactly elegant, but the final product is fairly clean. As this is the first framework that I have used, I cannot do any performance comparisons, but my laptop sorts anything I can throw at it (Anything that fits in the 128MB that is addressable on the video card) in under half a second. It surely is a fairly speedy framework.

My (uncommented) code can be found here.

Knights Tour Revisted

I have a good friend, Matt, who shares my interest for interesting problems. We were talking about this problem, and he immediately described why I was having the problems that I was having with the Knight's Tour.

In my previous solution, I had been explicitly saving each encountered state in a stack. Therefore, this stack was simply a linearization of the entire search tree - something with d^b nodes.

Now, the first time I had programmed this algorithm, I had used a breadth-first search, something where using an explicit queue was necessary. I had quickly realized that in order to port this algorithm to a depth-first search, I should just simply use a stack.

Matt quickly made the comment that using an explicit stack to save each and every state is completely wasteful. In fact, it is possible (and simple!) to use the system stack with a depth-first search, and use recursion to push and pop states on and off the stack. With this approach, only one path through the search tree is ever in memory at a time, so instead of holding d^b nodes, you simply hold a maximum of d nodes. Clearly this is a better solution.

Last week I applied for a job at Google, and the interviewer actually started talking to me about this problem. He said that, at least for this specific problem, it is possible to only store a single state. The motivation for this is clear, I assume: Some games have extremely complicated, large states, and holding multiple states in memory may not be feasible. After some thought about what he described, I realized that it is simply an extension of what Matt had told me about.

The idea is this: all levels of recursion share the same representation of the board. Because only one level of recursion is literally executing at a time (the deepest level), the board only needs to have one value at any given point in time. When you want to make a move and recurse to search that subtree, you can make the move on the shared board. Then, if that recursion fails to find a solution - un-make that move on the shared board. That way the precondition will still hold for the next move to make.

Now, clearly this solution only works if it's possible to un-make a move. In a game where the moves are super complicated, this solution will not work.

Also, my first solution was written in Haskell. As Haskell is purely functional, all of the data structures are immutable. This means that a copy of the state is made every time a move is made - it is impossible to change the single board in memory. Therefore, this solution would clearly not work in Haskell.

Also, it should be noted that this algorithm assumes that only one search path is executing at any time. Therefore, this algorithm is unable to be parallelized in any obvious way. Different subtrees cannot be searched in parallel, because they have to be operating on different boards; however, there is only one board in memory. So, this algorithm also does not scale to multiple CPUs. It might be possible to split the entire search tree into a small amount of subtrees at one-level deep, and run the algorithm in parallel on each of those sub-trees. In this case, however, each thread would have to have their own board to search on, so there are now a few boards in memory instead of just one.

I have posted previously about my lack of thinking on the topic of complexities in programming. So, in the spirit of self-improvement, I thought about this algorithm before I coded it up. Clearly, this algorithm would solve the memory problem. However, the tree for a 8x8 chessboard still is 64 levels deep. The breadth of the tree starts out at 8, but this diminishes as possible moves may have already been made (Clearly, or else there would be a gigantic number of solutions as any solution that makes it all 64 levels deep is a successful solution). Therefore, the tree is fairly deep and starts out fairly broad. As I am a rather visual person, I imagine it as being something vaguely circular - broad at the top, but then tapering off as the number of unmade moves is limited toward the bottom. Therefore I expected that my solution to the problem will still be slow, even if it isn't large.

I then thought about my state representation. The solution to this problem is a sequence of (X, Y) pairs. I will call a particular entry in the solution (Xp, Yp) and the next entry (Xp+1, Yp+1). The solution can be entirely crafted as a sequence of these pairs where the first pair is (0, 0) and abs(Xp+1 - Xp) + abs(Yp+1 - Yp) = 3. Not only that, but no pair on the solution may be reused, so the solution is also a permutation of ([0..width], [0..height]). The word permutation immediately clued me in to the idea that this problem is NP-complete (even though I already knew it was, this sealed the deal).

The search tree successor function can simply find all the (Xp+1, Yp+1)s that fit the bill. This approach, however, does not lend to using a permutation generator, as finding a permutation that has this quality is non-trivial. This means that the solution can entirely be crafted in terms of itself, and the representation of the board itself is unnecessary. That being said, storing the board allows for O(1) lookup to see if a particular solution is already in the solution, making sure that the solution is, in fact, a permutation. This is necessary because the permutation generator approach does not find successors easily, so the permutative quality of the solution must be artificially kept.

So anyway, I coded this up in C. My program finds that there is no solution to a knight's tour on a 4x4 board in 0.009 seconds. It finds that there is no solution to a knight's tour on a 5x5 board in 0.168 seconds. It finds that there is a solution to a knight's tour on a 6x6 board in 52.473 seconds. I started running the program on a 7x7 board, but killed the process after it ran for around 24 hours. Note that I am running these on my MacBook Pro laptop.

So here I am, trying to think of a way to make this faster. As soon as I figure it out, I'll make another post.

My code can be found here.

Monday, November 9, 2009

Project Euler problem 66 in Haskell

I just got finished programming up Project Euler problem 66 in Haskell. I must say, the solution is extremely elegant. One thing is certain - I never felt this good about programming in any imperative language. Something about framing a solution in as simple and powerful terms as possible just gets the dopamine flowing. Anyway, here is my solution:

continuedFraction :: Integral a => a -> a -> a -> [a]
continuedFraction root num denom = whole : (continuedFraction root newdenom outdenom)
where whole = floor (((sqrt $ fromIntegral root) + (fromIntegral num)) / (fromIntegral denom))
newdenom = (denom * whole) - num
outdenom = (root - newdenom*newdenom) `div` denom

evaluateConvergent :: Integral a => [a] -> (a, a)
evaluateConvergent terms = foldr (\ term (x, y) -> (y + term * x, x)) (1, 0) terms

getConvergents :: Integral a => a -> [(a, a)]
getConvergents root = [evaluateConvergent $ take x $ continuedFraction root 0 1 | x <- [1..]]

getX :: Integral a => a -> a
getX d = fst $ head $ filter (\ (x, y) -> x*x - d*y*y == 1) $ getConvergents d

isPerfectSquare :: Integral a => a -> Bool
isPerfectSquare val = val == squareroot * squareroot
where squareroot = floor $ sqrt $ fromIntegral val

argmax :: Ord b => (a -> b) -> [a] -> a
argmax func list = fst $ foldl1 (\ (x1, y1) (x2, y2) -> if y1 > y2 then (x1, y1) else (x2, y2)) $ zip list (map func list)

main = print $ argmax getX (filter (not . isPerfectSquare) [2..1000])

And there you have it. Solution runs in about 0.4 seconds on my laptop.

Sunday, November 1, 2009

New Laptop

As of late, I have bought a MacBook Pro. At the time, I did it mainly for completeness - I have used all of the major operating systems except for OS X (All the NT-based Windowses, Many distributions of Linux, FreeBSD, OpenSolaris). I have a personal commitment to portable code, so I thought that I would have one more platform to make sure my code runs on it.

However, this laptop quickly became my choice computer to use. It seems to be the best medium between a low-level Unix environment and a high-level Windows environment. The command-line is a staple of any unix developer's toolbox, and I find Windows' lack of an integrated terminal (that doesn't blow) very disgruntling. A note on that, actually - Why the hell can't we make terminals wider than 80 characters on Windows? Literally every other operating system has figured out a way to make that work. On the other hand, OS X has many supported applications (Something in which Linux and Unix as a whole are lacking) as well as having a very intuitive user interface (to an extent - Those silly minimization animations are ridiculous)

Anyway, enough harping on Apple. I believe that I must be starting to drink the Kool-Aid or something - I'm liking this computer more and more. In fact, I like it so much that I've decided to get rid of my old laptop and use this one instead. I used my old one because it was a tablet and I found it useful for taking notes in math classes. However, I have found that that HP laptop as a whole is shoddily made. For example, the trackpad will move the mouse and click, even when i'm nowhere near the laptop. It makes watching movies excessively annoying when the laptop is set up on a desk, I'm 2 feet away from it watching peacefully, and the mouse moves to the pause button and pushes it, completely unprompted. I have found that on paper, the two laptops cost about the same, have comparable power, but the MacBook Pro just has so much more polish.

So I have finally began the process of wiping that old laptop to sell. But then, what was I to do if I was on the go and needed to use Windows? Lo and behold, the MacBook Pro has an Intel processor, so I installed Windows 7 on it with the help of Boot Camp. And of course, I installed some of the games that I own, just to try them out. I know for a fact these games are completely unplayable on my older laptop, but on the MacBook Pro they run completely fine! Once I set the screen resolution to a little bit smaller than native, I was getting a consistent playable 30fps in Left 4 Dead.



One thing is clear - Apple's engineering clearly outshines HP's. Though this may not be a fair comparison, as the HP was a tablet, the usability speaks for itself. In fact, the MacBook Pro runs games so well, I believe I could take it to a lan party and play all night, and not have to lug my giant heavy desktop gaming system. So much better.

Knight's Tour

I am taking a Graph Theory class at my school, and last class my teacher handed out a problem to the class. He isn't going to collect the problem, and we are not even expected to solve the problem, but he thought that some of the class would appreciate it.

The problem is fairly simple. On a chessboard, there are 64 squares arranged in an 8x8 grid. A knight is capable of moving in an "L" shape, 2 squares in one direction and 1 square in an orthogonal direction in a single move. A knight starts out at the top left corner of the chess board. It has been shown that it is possible for the knight to move to each and every square on the chessboard exactly one time and end up in the top left corner again. What is the sequence of moves that the knight has to go through in order to complete this tour of the chessboard?

Before I begin, I want to say that I have not solved this problem yet. This post is not about a solution to the problem. It is about my thought processes and my ability to learn lessons from my failed attempt at solving this problem.

It is extraordinarily simple to reduce this into a graph problem; in fact, my professor did it for us. Each square on the chessboard represents a vertex, and there are edges going from each square to each of the 8 other squares that a knight can jump to from that square. The question is simply asking for a Hamiltonian Cycle starting from the top-left vertex. Now, it is well-known that the Hamiltonian Path problem is NP-Complete. That means that there is no quick algorithm for solving the problem (More formally, all known algorithms for solving this problem involve a branching factor that is proportional to the data size - think of it as having complexity of C^N rather than N^C.) Therefore, there is no quick straightforward algorithm to solve this problem (in the generic case).

I'll admit it - I'm not very good at math. Perhaps I have been tainted by computers, but it takes me a long time to do a problem like this. I have been jaded with the idea of solving problems by hand when I have a giant machine next to me that can solve it for me. All I have to do is tell the machine the rules of the problem, and away it goes! It sounds so much easier than actually having to make deductions about a problem.

The general problem itself is NP-Complete, so I am assuming that there are some additional rules that can be added because of the specific layout of this graph. However, my attempt at solving this problem immediately disregarded these additional rules. Instead of making intelligent deductions, I threw a standard algorithm and the basic rules at the computer and told it to go, in much the way that I have described above.

The actual algorithm I used is one that is a simple case of an Artificial Intelligence search algorithm. The idea is to make all the possible moves (in a structured order) and hopefully one of the moves is the "winning" move. Given a board, there is a set of moves that the knight can make, which result in a set of "neighboring" boards. Then, from each of these boards, the knight can move again, and again, so on and so forth. It is solved in much the same way that a constraint satisfaction problem is solved - follow possible path the knight can take all the way until it can't move anymore, and if that's not the winning move, back up a little and try a similar path with only the ending different. If that doesn't work, back up again, etc. This is a stack-based approach, which leads to a depth-first traversal of the problem.

I have had to implement this algorithm many times. Once was in Artificial Intelligence class, once was on my own (I was making a SuDoKu solver because I'm bad at math, and had to get a computer to solve the problem for me), and once in Computer Science 4 class. So needless to say, in an hour or two the algorithm was coded up in Haskell (my current favorite language) and here is my code for you to peruse:
import qualified Data.Set

data Board = Board [[Bool]] Int Int deriving (Show, Eq, Ord)

set2DIndex :: [[a]] -> Int -> Int -> a -> [[a]]
set2DIndex l x y v =
   [l !! c | c <- [0..x-1]] ++
   [[(l !! x) !! c | c <- [0..y-1]] ++
      [v] ++
      [(l !! x) !! c | c <- [y+1..7]]] ++
   [l !! c | c <- [x+1..7]]

setVisited :: [Board] -> [Board]
setVisited boards =
   [Board (set2DIndex visited posx posy True) posx posy | Board visited posx posy <- boards]

getSuccessors :: Board -> [Board]
getSuccessors (Board visited posx posy) =
   setVisited $
   filter (\(Board visited x y) -> not ((visited !! x) !! y)) $
   filter (\(Board visited x y) -> x >= 0 && x < 8 && y >= 0 && y < 8) $
   [Board visited (x + posx) (y + posy) | x <- [-1, 1], y <- [-2, 2]] ++
   [Board visited (x + posx) (y + posy) | y <- [-1, 1], x <- [-2, 2]]

isWin :: Board -> Bool
isWin (Board visited _ _) = foldl1 (&&) $ foldl1 (++) visited

-- Receives a queue/stack of board histories and a set of processed boards, returns the sequence of boards that get the solution
findTour :: [[Board]] -> Data.Set.Set Board -> [Board]
findTour queue processed
   | isWin board = last queue
   | Data.Set.member board processed = findTour (init queue) processed
   | otherwise = findTour ((init queue) ++ (map ((++) (last queue)) [[a] | a <- getSuccessors board])) (Data.Set.insert board processed)
   where board = last $ last queue

main = putStrLn $ show $ findTour [[Board [[False | x <- [1..8]] | y <- [1..8]] 0 0]] Data.Set.empty

Sorry about the lack of comments - I was never really intending this to be shown to a large audience. Hopefully the code is not too unreadable.

So I hit the big proverbial "RUN" button, and watched the Activity Monitor as GHC started eating all my CPU and memory. Then, the CPU graph fell off and because stable at around 10%, with around 70% of that being system time. In addition, the system froze up, last.fm stopped playing music, and the mouse would respond only sporadically. Clearly, the system was swapping heavily.

While I'm watching my computer slowly toast itself alive, I took a step back and thought about the problem I was trying to solve. I thought about the other problems that I had solved with this same standard algorithm. SuDoKu ran quickly (About 0.1 seconds if I remember correctly) because the search tree could be pruned so easily. A naive interpretation of the data structure would be that there are 9x9 cells, and each one could have any of the number 1..9 in it, so that makes for 9^81 possible boards. However, most of those are not valid (e.g. any subtree of a board starting with 9,9 is invalid), so entire giant subtrees are taken out of the search path. This makes for a well-pruned search tree that can be traversed fairly quickly.

The knight's tour problem, however, does not lend itself very well to pruning. Sure, the knight cannot move to a spot that it has already moved to, but the probability of that happening is fairly low for most of the calculation. In fact, given the state representation that I chose (a 8x8 grid of booleans representing whether the knight has been at that location or not), not a single state is invalid. Every combination of those boolean values represents a valid board, unlike now a SuDoKu board has many invalid states. My algorithm eventually is going to have to check each of the 2^64 states that the board could be in, not to mention that two states can be different even if the boards are the same but the knight is in a different position. In fact, because there is no search pruning, my approach is no better than simply running through each of these possibilities one by one. Given that there are 10^19 possibilities, clearly this is not going to work.

As soon as I realized this, the Control-C came crashing down. Granted, the system took about 5 minutes to respond to the keypresses, but after everything that had been shunted to the disk was read back into memory, everything appears to be in working order now. Then, I ate some spaghetti.

During the meal, I thought to myself that this phenomena had not been the first time that this had occurred to me. There have been at least 2 or 3 problems (mostly Project Euler problems, as these are designed to exploit this problem) where I have immediately launched into coding up what I thought was a solution to the problem without realizing the complexity of the solution. In particular, even a linear-time algorithm can be slow if N is larger than about 10^9.

The underlying problem is that I do not have a very intuitive knowledge of how big 'big' is. I see a computationally-expensive problem, and think "Oh, that's more than I can do, but a computer can also do more than I can do, so surely a computer can do that task!" It takes a special kind of programmer to not just launch into a problem, but to take a step back and to realize the complexity of the problem, as well as other implicit problems such as data overrun problems and the like. I heard a story that Donald Knuth, in his day, would want to write a program, but instead of sitting down at a terminal, he would think about the problem for weeks in a row. Finally, when he was ready, he would sit down at his desk with paper and pencil, and write out his program from start to finish, in one swath. This means that before he wrote code, he thought about all the details of the algorithm, the complexity, the formulation, everything. Now, clearly this isn't incremental improvements (which are almost always a good thing), but I want to take a lesson from Knuth. I want to know what I'm writing before I write it, not as or after I write it.

Note: My Project Euler profile can be found here.