## Tuesday, February 23, 2010

I have been implementing Bead Sort on a variety of different hardware in order to get a feel for programming on each of these different platforms. I haven’t actually spent much thought about the actual algorithm, until recently. I realized after thinking for only a couple minutes that I was going about the whole thing very, very incorrectly.

The way I was doing this before was to line up each of the beads in a giant 2-D array in one stage, make the beads fall in another stage, then manipulate the new array into telling me the sorted numbers in a third stage. I realized that I could combine the first two stages, speed up the entire program, and cut down on the amount of code that need exist.

Instead of setting up a 2-D array of booleans ahead of time, and then making them all fall, I could simply make them fall as I put them on the columns. This decreases the code size drastically. The algorithm for the first two stages now simply goes as such:
1. Assign an index number to each column, from 0 to n.
2. Give each column a single int counter variable, which will refer to the number of beads that are stacked up on that pole
3. Have each column go through the numbers to be sorted, and, if the number to be sorted is greater than the index for the current column, increment the column’s count. Note: This can be done in a very tricky way by saying “count += num > index”.
4. Have each column output it’s count to an output array, in the index of its own column index. Call this output array the “counts” array.
That’s it! Much, much simpler code. What’s more, it decreases the complexity from O(m*n) to O(n), where n is the number of numbers to sort, and m is the maximum size of the numbers to sort.
Now, you have that nice little triangle of sorted numbers. How should one go about reading them off the columns? Well, here’s the “aha” moment. The first number in the sorted array is the number of columns with beads in the first position of the column. How many columns have this property? Well, all the columns with counts greater than 0 have this property. Okay, now, the second number in the sorted array is the number of columns with breads in the second position of the column. How many columns have this property? Well, all the columns with counts greater than 1. The next number in the sorted array is the number of columns with counts greater than 2.

Now, this is the exact same logic as used in the first phase of the bead sort algorithm! Now, each column goes through the counts array, and if the value in the counts array is greater than the column index, that column’s count gets incremented. The only difference is now, the values of m and n are switched. On this second pass, each column is going through the count array, which is of size m (the maximum size of a number to sort). The new output array will be of size n (it has to be, because n is the number of numbers to sort, and the new output array is the sorted numbers).

The real interesting part of this algorithm is that you can simply run the parallel part of the algorithm twice, with different arguments. This leads to very small, clean, readable code. It also turns the runtime of the algorithm from O(m*n) to O(max(m, n)).

My cleanest implementation of this code is probably written in CUDA, and can be found here.

### Memory Leaks in my 2-D Perlin Noise Generator

As noted previously, I have been in the process of optimizing my 2-Dimensional Perlin Noise generator. As of my last post, the noise generator had been optimized for time and not for space. I was hoping to see if I could optimize this time for space.

My first crack at the problem involved blindly assuming that Haskell was creating a gigantic thunk for the entire computation, and then executing the thunk. The logic was that a gigantic thunk would take up a gigantic amount of memory. Forcing Haskell to evaluate this thunk early would get rid of the thunk and would solve all my problems. So, as dictated by Real World Haskell, I went about using the seq function to force Haskell to evaluate my thunks as they were being created.

This didn’t actually turn out to solve my problem. The memory signature didn’t go down at all. After re-reading the relevant chapter in Real World Haskell, I found a paragraph that I had overlooked before. This paragraph stated that the seq command only evaluates down to the first constructor. This means that the seq command will evaluate 3 + 4 and get 7, but will not evaluate elements in a list. This is because the (:) operator is technically a constructor (The list type ([]) is defined as a : List | Empty). Therefore, calling seq on a list will not actually do anything – it will evaluate down to the first constructor, which is the first (:) in the list itself. Therfore, the strategy of using the seq command won’t work at all.

Then I took a step back. I had assumed that the problem was caused by a giant thunk; what if it wasn’t? I had no evidence to support either claim. Wouldn’t it be nice if Haskell had a profiling tool to tell me exactly where the memory leak is? I tweeted about the apparent lack of a profiling tool, and was responded to very quickly by a twitter-er named donsbot. He pointed me to the relevant chapter in Real World Haskell on profiling. As it turns out, Haskell has some very advanced tools for system profiling!

Later, I did some searching on donsbot, and I found that it referred to someone who goes by the handle dons, or Don S. A little bit later in the search, I found that this guy is a co-author of my very own Haskell book, Real World Haskell! I felt incredibly geeked out by learning this information – I blindly tweeted and the co-author of the book I’ve been using responded! I immediately started following him on twitter and that has led me to check out some of the really interesting work he’s doing, currently on the Vector libraries. He, too, is on Stack Overflow, and has some very thoughtful and inspiring answers on that site.

Anyway, after running some of these profiling tools, I found that almost all my memory was being put into lists of doubles. After looking through my code, I immediately found the main problem.

A Perlin Noise generator works by splitting the image into multiple levels. Each level is furthermore split into a grid of blocks, with keypoints pre-generated at the corners of each block. In my previous implementation, I was iterating through each level, generating each pixel in that level. I was then sorting the pixels to put them in the same order for each level, and using zip to combine all the pixels for a specific position into the final value. The problem was in the sorting step. In order to sort a list, the entire list has to be memory resident. Therefore, all the pixels in all the levels had to be memory resident.

The best way I came up with to solve this problem is to think of Haskell’s execution model as a streaming framework. If I wrote the program correctly, Haskell will be able to generate the value in a specific pixel, write it out to the image, have the garbage collector collect that pixel, and move on to the next pixel. This means that each pixel has to be independent of each other pixel. The way I could get around sorting the pixels is to always generate the pixels in the same order.

For example, in the old way, I would generate the pixels block by block. This would mean that, for two different levels with two different block structures, the pixels would be generated in these orders:

For a blocksize of 2x2, the pixels would be generated like this:
 11 12 15 16 9 10 13 14 3 4 7 8 1 2 5 6

For a blocksize of 4x4, the pixels would be generated like this:
 13 14 15 16 9 10 11 12 5 6 7 8 1 2 3 4

I had to alter this so that the pixels would be generated the same way, regardless of level. If I did this, Haskell could generate the same pixel in each level, apply the operator I used in zip, get the final pixel, and write it out to the image, without ever having to generate any more pixels. This turns the entire program into what can be thought of as a stream – it streams each pixel, one by one.

Modifying the program to do this was fairly straightforward. Loop through the X values, if you come across a new block, generate a new Hermite vector, else, use the existing Hermite vector. If you come to the end of a line, go to the next line. If the next line is in a new block, update the current line and block keypoints. Then, generate the Hermite vector for this new block. This function is even tail-recursive.

The new program certainly has a smaller memory footprint. It is still not quite where I would like to see it, however. The current memory graph shows that it doesn’t quite run in linear space. If I had programmed it correctly, it should run in linear space, so clearly I still have at least one unanticipated bug. It appears that a fair amount of memory is being held in the list data structure itself, which means that it is being held in the 'next' pointer for linked lists. Perhaps I could alleviate this problem by using Vectors?

Also, I bit the bullet and made the damn thing use LibGD directly.

Here is the current memory graph for the new program:

My code can be found here.