Thursday, January 28, 2010

More Assorted Closures

As I am learning more and more about functional programming, I am increasingly coming across the idea of closures. Closures deal with the concept of free variables, or variables that are not explicitly defined in the arguments to the function. The idea of a closure is that the binding of free variables in a function stays with that function. It can be thought of as saving the context a specific function runs in with the function itself.

Let's say you have a function with a free variable. Let's call it multbya, and let's define it to take an argument x, and return x * a, where a is a free value.

All languages will recognize a as a free variable. Now, different languages will do different things with this information. Many languages will simply freak out, saying "Oh God Oh God, you referenced a variable that I don't know anything about!" In some situations, this makes sense: Strongly typed languages (e.g. Scala freaks out) guarantee that the multiplication operator will work, but the language cannot guarantee that the multiplication operator is defined for any old symbol which may or may not be defined at runtime. Other languages (e.g. JavaScript doesn't freak out) may simply say "Well, I don't know what a is now, but fortunately I don't have to run the function now. I'm going to hope that a exists at runtime, and if it does, I'll use it. Also: hopefully the multiplication operator will exist for a."

Note that this concept only really applies in a REPL, as opposed to a file. In a file, nothing really is declared before or after anything else, because all the definitions are there in the file. The compiler can make 2 passes to the file, one to find symbols, and the other to define symbols. That way, symbols that are used can be defined after their use in the file. This is what Haskell appears to do.

This general idea, however, is a little different than what happens in a REPL. In Haskell and Prolog, the "program" is really what is referred to as a "knowledge base." The program doesn't flow: there are no variables and nothing changes. You define your program as a tree of transformations which transforms the input to the output. The "knowledge base" is just a list of those transformations. There are no variables - only arguments.

Okay, back to business. Let's say, for longevity's sake, that we define the variable a ahead of time. Let's set it equal to an integer equal to 3. Now, when you define multbya after defining a, all languages should be happy. Now, you can call the function and it should work just great.

Now, let's try to mask the variable a. Let's make another function, testit, with one variable, a. Inside this function, let's call multbya.

Now, in a language without closures, you could think of the body of the function simply being plopped down in place of the function call. Therefore, in a non-closured language, the a inside multbya is going to refer to the highest-level a, which is going to be the argument to testit. Therefore, in a non-closured language, the output of this function is going to be the argument to testit times the argument that testit called multbya with.

In a language with closures, the bindings associated free variables are passed around with the function. In technical language, the function is closed around its free variables. This means that, even though the highest level a is the argument to testit, multbya remembers its own binding to a. So, multbya remember that it was defined with a free variable of a, and that this free variable refers to the global a. This works anywhere, not just with global variables: functions remember what variables are closed under it so that a caller can't mask the callee's variables and change the function's functionality.

Not all languages use closures the same way, however. In our little example, let's pretend that we then change the value of the global a, then re-run testit. Now, this only works in languages where you can change values, or re-bind values.

Now, a language could deal with this in one of a few ways. Either a language lets you change values, or a language doesn't let you change values. Also, a closured language will either save values in closures, it will save references to values, or it will save symbolic references and a scope identifier.

If a language saves values in a closure, then changing the global value of a should have no effect on the output of testit.

If a language saves references in a closure, and if the language lets you change values, then the value returned from testit should change accordingly.

If references are saved, but the language simply re-binds the global name a to a new value, then the output should not be changed, because the reference still refers to the old value. Keeping this reference around makes sure that the garbage collector doesn't collect the value.

A way to get around this is to save a symbolic reference and a scope identifier. This is like saving the idea of "I want a variable a, and I want the one that's in the global namespace." Thus, no matter how references are handled under the hood, the output value is updated to use that variable.

All the languages that I have tried, except Haskell, have used either references or the last idea, because they have updated their output when I change a. As Haskell is purely functional, it might be doing some caching under the hood because values are not supposed to change.

Here is the Scala code I used:
hani:~ litherum$ scala
Welcome to Scala version (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_17).
Type in expressions to have them evaluated.
Type :help for more information.

scala> var a = 3
a: Int = 3

scala> def multbya(x : Int) : Int = x * a
multbya: (Int)Int

scala> def testit(a : Int) : Int = multbya(6)
test: (Int)Int

scala> testit(5)
res0: Int = 18

scala> a = 4
a: Int = 4

scala> testit(5)
res2: Int = 24

Here is the Clojure program I used:

hani:~ litherum$ clj
Clojure 1.1.0
user=> (def a 3)
user=> (defn multbya [x] (* x a))
user=> (defn testit [a] (multbya 6))
user=> (testit 5)
user=> (def a 4)
user=> (testit 5)

Here is the Common Lisp program I used:

hani:~ litherum$ clisp
i i i i i i i ooooo o ooooooo ooooo ooooo
I I I I I I I 8 8 8 8 8 o 8 8
I \ `+' / I 8 8 8 8 8 8
\ `-+-' / 8 8 8 ooooo 8oooo
`-__|__-' 8 8 8 8 8
| 8 o 8 8 o 8 8
------+------ ooooo 8oooooo ooo8ooo ooooo 8

Welcome to GNU CLISP 2.48 (2009-07-28)

Copyright (c) Bruno Haible, Michael Stoll 1992, 1993
Copyright (c) Bruno Haible, Marcus Daniels 1994-1997
Copyright (c) Bruno Haible, Pierpaolo Bernardi, Sam Steingold 1998
Copyright (c) Bruno Haible, Sam Steingold 1999-2000
Copyright (c) Sam Steingold, Bruno Haible 2001-2009

Type :h and hit Enter for context help.

[1]> (setf a 3)
[2]> (defun multbya (x) (* a x))
[3]> (multbya 4)
[4]> (defun testit (a) (multbya 6))
[5]> (testit 5)
[6]> (setf a 4)
[7]> (testit 5)

Here is the JavaScript program I used:

hani:rhino1_7R2 litherum$ java -jar js.jar
Rhino 1.7 release 2 2009 03 22
js> var a = 3
js> function multbya(x) {
> return x * a;
> }
js> function testit(a) {
> return multbya(6);
> }
js> testit(5)
js> a = 4
js> testit(5)

Here is another JavaScript program I used:

hani:v8-read-only litherum$ ./shell
V8 version
> var a = 3
> function multbya(x) { return x * a; }
> function testit(a) { return multbya(6); }
> testit(5)
> a = 4
> testit(5)

Here is the Python program I used:

hani:~ litherum$ python3.1
Python 3.1.1 (r311:74480, Sep 25 2009, 09:54:21)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a = 3
>>> def multbya(x):
... return a * x
>>> def testit(a):
... return multbya(6)
>>> testit(5)
>>> a = 4
>>> testit(5)

Here is the Haskell program I used:

hani:~ litherum$ ghci
GHCi, version 6.10.4: :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer ... linking ... done.
Loading package base ... linking ... done.
Prelude> let a = 3
Prelude> let multbya x = x * a
Prelude> let testit a = multbya 6
Prelude> testit 5
Prelude> let a = 4
Prelude> testit 5

Friday, January 15, 2010

Google Interview #7

I just came from a particularly challenging interview at Google’s headquarters in Mountain View, California. I forget the name of the interviewer I had for interview # 7 (Edit: I now remember, but I won't recognize him by name), but it was a particularly interesting interview. Before I begin talking about it, I should probably say that I did extremely poorly on this interview, but I did end up learning quite a bit from the interviewer (who, I’m assuming, took pity on me and ended up telling me the answers to his two questions). I’m going to recreate the two questions as best I can here.

Question #1: Suppose you have 10 identical machines, each of which has:
  • 4GB memory
  • 500GB hard drive, with 400GB full of (Apache) log files.
How do you find the top 1 million most commonly visited pages in the log files, with a total computation time of under 24 hours?

This is roughly how the interview went:

Me: Well, because the data is unordered, each machine has to read all of the data on its disk, sequentially. It can then make a hash map where the key is a hash of the URL, and the value is the frequency count. As each computer reads a line, it extracts the URL, hashes it, and looks up the hash in the hash map. If the hash is in the hash map, it extracts the existing associated value, increments it, and stores the value back in the map under the same hash. Otherwise, it stores the hash in the hash map with a frequency of 1.

Interviewer: Okay, how would you then combine the results into one single top 1 million list?

Me: Well, you could combine the data hierarchically, where machines 0 and 1 combine their data, machines 2 and 3 combine their data, and the two combinations are then combined, etc, until one master list is obtained. However, combining the entire hash map would be too big - the last combination would be combining every single entry into a gigantic list. I guess you would then only take the top, say, 2 million entries from each hash map and combine those. That doesn't guarantee correctness, but it's probably good enough.

Interviewer: It's not good enough. Let's say a particular website is #2,000,001 in each of the log files. It is feasible that that website could be the most popular website, if, say, spots #1 - #2,000,000 are a tie. How could you guarantee correctness?

Me: Hrmph. Perhaps you could fit the whole hash map on one machine?

Him: You blow. That wouldn't fit in memory, causing that machine to swap heavily, slowing it down by 2 orders of magnitude, causing you to miss your deadline, and exploding the world.

Me: Yeah, I do blow. Please hire me!

Him: But think about this! It wouldn't fit in 4GB, but sitting in front of you, you have a total of 40GB of memory!

Me: Aha! You could make a distributed hash table. You could partition your hash space into 10 different spaces, and map each space to a different PC.

Him: Something like PC ID = hash(URL) mod 10.

Me: Yeah. Thanks for the hint.

Him: You bet. So how would you combine the results now?

Me: Well, because all the counts for each individual URL are in one place, you just have to combine the hash maps hierarchically, as I described before, except you only need to keep the top 1 million entries in the combination. This should go fairly quickly.

Him: Okay, fair enough. (Looks disappointed) Now, say the nice folks over at Microsoft are able to reverse engineer your algorithm completely. They then hit our website repeatedly for crafted URLs, so that significantly more than 1/n of the input URLs in the log files map to the same machine. Pretend that, now, 100% of the URLs in the log files have the quality where hash(URL) mod 10 == c, where c is a constant. What does this mean?

Me: Well, this is going to overload the hash table on one of the particular computers. It'll make the hash table swap out to disk, slowing down the computer, making us lose our deadline and ending in the world blowing up.

Him: So how would you solve this problem?

Me: You could take a hash of a hash of a hash...

Him: Pretend that 100% of your algorithm is known by the hacker.

Me: You could have a dynamic partitioning system, where, if the hash map on one machine becomes too big, it offloads part of its map to the next machine...

Him: That would be too complicated to program. Also, it would result in you missing your deadline. If you do it that way, you have to have one of the PCs dedicated to keeping track of what keys go where. All the other PCs have to hit this PC for each URL it encounters, introducing a bottleneck in the system, slowing everything down. Also, it would be slow to have two machines handling the mapping of key to machine, because they have to have consistent data, which introduces lots of cross traffic, and the whole thing is basically a giant clusterfuck.

Me: So, you can't change your hashing algorithm, and you can't artificially move your data around .... Seems like your stuck.

Him: WOW. You REALLY blow. Even harder than I thought!

--We stare at each other for a minute--

Him: Have you ever heard of universal hashing?

Me: Nope. By the way, did you know that I blow?

Him: I had gathered. Do you know what a salt is?

Me: No.

Him: A salt is a constant value you append to the end to the input of your hash function. If you choose a salt at random, the folks at Microsoft won't be able to know it (because it's data, not your algorithm). Now, they don't know where the hashes end up. You can then change your salt daily so they don't have time to guess your salt.

Me: Gee, that sure is smart.

Him: You blow.

Me: Sorry.

Question #2: Google has a web crawler. This web crawler periodically crawls the same sites repeatedly to keep track of changes. However, people get mad if Google crawls their website too often (for whatever reason). So, the crawler has to remember the last time it crawled a website, and can then decide whether or not it should crawl again. This means we need some sort of database system that maps URLs to a timestamp, and, for purposes of this interview, a 1-byte enumeration of a site's status. This will be something like {online, 404, 403, timeout, etc}. The crawler is going to want to query the system for a given URL, and we need to return the timestamp and the site status. When the crawler does end up crawling a site, it's going to update the information for a given URL. Also, the crawler may insert data for a new URL. Assume that this system is going to hold information for 200 billion websites, and we allow a one in a million failure rate (giving incorrect information is a failure). How would you design this system? And, how many machines would you need?

Me: Well the most straightforward solution would be to use a traditional database system, like MySQL or something.

Him: You suck. You didn't even ask me the rate of queries! This is the most obvious thing you should have asked me. The query rate is 50,000 queries / second.

Me: So that's too much for a traditional database system.

Him: Yep.

Me: Well, you could use the same system that we described in the last question, where you use a distributed hash map, where the keys are the url, and the value is a tuple of the timestamp and the status. You could use the salt thingy to make sure the data is distributed on the machines evenly.

Him: Okay. Sounds good. How many machines would you need, given the memory constraints of saying each computer has 16GB of usable memory?

Me: Well it's probably more than 1 machine.

Him: You really have no clue, do you?

Me: Yeah, I blow.

Him: Well, think about it this way. How many bits do you need in your hash?

Me: Well, the space of the hash should be big enough to provide a 1 in a million failure rate, so 2^n = 10^6

Him: That would be the space of a hash if you wanted to hash a single element.

Me: (Making this up as I go along) So I guess I multiply the right side by 20 billion?

Him: What would that make, then?

Me: Well,
2^n = 10^6 * 200 * 10^9
2^n = 200 * 10^15
2^n = 2 * 10^17
2^n = 2 * (2^10)^(17/3)
2^n = 2 * (2^10)^6
2^n = 2 * (2^60)
2^n = 2^61
n = 61 bits, rounding up to 64.

Him: 64 is nice and round. So how much space do you need for the timestamp?

Me: Well, Unix timestamps are 64-bits, so that sounds like a good place to start. You could compress that by subtracting the value for the first timestamp recorded. Also, if you don't need such high resolution, you could divide by some constant.

Him: That's fine. We'll just make it 56 bits. We've already said that the status is 1 byte, or 8 bits. This makes a total of how much data per row?

Me: 64 + 56 + 8 = 128

Him: Not quite. Hash maps have internal data structures to handle collisions.

Me: (I actually didn't understand this, as the problem statement said that hash collisions are okay if they're infrequent (1 in a million), but I went with it anyway) Okay, so give each hash a linked list. Therefore, the tuple of (timestamp, status) now becomes a triple of (timestamp, status, nextptr). Give each pointer 32 bits (Thanks, Matt, for pointing out that if the machines have 16GB memory (as described later), this should be 64 bits for a pointer). This brings the total up to 128 + 32 = 160 bits = 20 bytes.

Him: Sounds good so far. How many machines would you need then?

Me: Well, each row has 20 bytes, and there are 200 billion rows, so that's 4TB of memory. If each machine has 16GB of memory, that's 4000 GB / (16 GB / Machine) = 250 machines.

Him: Sounds pretty good. Now, if we distribute the 50,000 queries / second, do you think our 250 machines will be able to handle the load?

Me: Well, that means that each machine handles (50,000 queries / second) / (250 machines) = 200 queries / second / machine. That sounds pretty reasonable.

Him: I agree. Sounds good.

Me: Fantastic.

Edit: I have been offered (and accepted!) this job, even though this interview went atrociously. If you have had a similar terrible interview, don't give up hope!

Here are the pictures i've taken from Google's Campus.

2-Dimensional Perlin Noise Generator, Revisited

The 2-Dimensional Perlin Noise Generator I wrote had a major flaw – it was excessively slow. Most of the computation was due to the function that finds the four bounding keypoints given a particular point to calculate. The function that finds the color for a specific pixel was called in two nested loops. This function needed to find out the four bounding keypoints, so it filtered the list of keypoints a bunch of different ways.

I got to thinking about this and discovered that the bounding box hardly changes between a particular pixel and the next pixel. Because the loop iterates over pixels that are next to each other, I could keep track of the bounding box between calls, and now only need to check to see if the bounding box changes. However, once an entire row of pixels has been generated, the sequence of bounding boxes repeats.

It would be nice to think about the problem backwards and, instead of finding the bounding box that a particular pixel is in, find all the pixels that are in a specific bounding box. That way, I could iterate over the bounding boxes, and just sort the result at the end (to re-order the values into line-by-line order). Iterating over the bounding boxes is easy in two dimensions – split the list of keypoints into a matrix of keypoints, and iterate across the y dimension 2 at a time, and inside that, iterate across the x dimension 2 at a time. If the matrix is implemented as a list of lists, the iteration over y can be done by iterating over the main list, and the iteration over x can be done by iterating over the lists inside the main list. Because this is done iteratively, it can be done quickly. Iterating over the pixels in a block is similarly easy.

Then, I got to thinking about the bicubic interpolation performed at each step. It occurred to me that I was finding a dot product between a vector of coefficients and a Hermite vector. Looking at the code to generate the Hermite vector, I realized that this code only depends on the values of the local bounding box. That is to say, all the values within a single block share a Hermite vector. An exploitation of this fact was easy to calculate given the new structure of the program – a new Hermite vector was calculated inside the iteration over local blocks, and this vector was passed to the interpolation routine for each pixel in the block.

I also noticed that the output images were looking quite blocky, as you would expect from filling in a matrix of blocks of values. I realized, though, that one way to somewhat alleviate this problem would be to actually calculate derivatives. Previously, I was assuming that the derivatives at each keypoint were 0. This means that if successive keypoints have values of 0, 3, 6, 9, and 12, the output would look blocky because I was pretending the derivatives at the keypoints were all 0. This would lead to an output that was flat at each of the keypoints, and grew quickly between keypoints.

Calculating derivatives was a bit tricky. Three derivatives are necessary for each point – the partial x derivative, the partial y derivative, and the cross derivative. Early on, I decided that the cross derivative would be too difficult to determine, so I left that one 0. Thus, all that was left were the two partial derivatives. The partial x derivative for a point P should be calculated as rise/run for the two points immediately adjacent to P in the x dimension. However, if P is the left-most point in the row, P should be used instead of a point immediately to the left of P (because such a point does not exist). The opposite is true for the right-most point. I created a tricky function to do this.

Finding all the partial x derivatives involves turning the list of keypoints into a matrix, then mapping the iteration across each row of the matrix. Finding the partial y derivatives involves exactly the same thing, except done with the transpose of the matrix. Zipping everything back together is a bit tricky, but it isn’t too bad to make sure everything is done correctly.

The resulting image is still fairly blocky, but it appears to be a bit better. According to my calculations, introducing the optimizations increases the speed of the program by more than 1000x. This is more than I ever hoped.

However, there is a problem. Generating an 800x600 image with a depth of 4 levels of keypoints takes 10 gigabytes of memory (Good thing I have 12 in my PC!). I am not quite sure what all that memory is for, but I believe that either:
  • Haskell is building a gigantic thunk, and then evaluating it, or
  • Because the program is not tail recursive, it is holding an intermediate level of 480,000 pixel values at each 480,000 levels of recursion (one for each pixel in 800x600)
I plan to fix this problem next.

Ray Tracer

I have been working for some time on a ray tracer that runs on many different kinds of hardware. As such, I have decided to program it in C, because this seems to be the language that is most widely accessible between devices.

The idea is that a single CPU would generate a list of the primary rays as individual data items. This CPU function would then break up this list into n pieces, where n is a ray-tracing strategy. In this example, one strategy would be to use NVidia CUDA, and another strategy would be to use the CELL SPEs. Each of these strategy functions would then deal with strategy-specific code. Usually, these functions would inspect the current computer to find, for example, the number of CUDA-enabled devices in the system.

This function would then break up the list of individual data items even further, so that each device has an equal amount of data items to process. Usually it would then sets up m pthreads, where m is the number of devices in the system of that particular strategy (number of CUDA-enabled devices). Each of these m threads then runs a device-specific function on its own individual section of the list of data items.

This structure allows one to easily take advantage of all the compute devices in a system, because the system is hierarchically broken up into strategies and further into individual devices.

Because I designed this ray tracer to run with NVidia CUDA, I had to make some changes to the ray tracing code. First of all, NVIDIA CUDA does not support recursion. Ray tracing is, however, fundamentally a recursive process. The color of a ray is determined by the color of the intersected material and lighting, but also by the color of the reflected and refracted ray. In order to eliminate recursion, I had to implement a tree-based ray dependency tree. Any particular ray depends on both the reflected ray and the refracted ray, as determined by intersecting that particular ray with the world. Therefore, the complete computation can be viewed as a tree, where each node is a ray, and the children for a particular node are the reflected and refracted rays, generated by that ray. As I wanted to make the implementation for this tree as lightweight as possible, I chose to use an array, where the root of the tree exists in the array at index 1, the left child of a particular node n exists at index 2*n, and the right child exists at index 2*n+1.

Generating this tree involves running through the array, starting at index 1. For each element, intersect the particular ray with the world to find an intersection point, then use the intersection point to generate the reflected and refracted rays, which are put into index 2*n and 2*n+1, respectively. Using this method, the entire dependency tree can be populated with a simple, non-recursive, for loop. The loop should run across all the elements of the array, in increasing order, but stop halfway through the array – as the body of the for loop generates rays with index 2*n+1, it is important we stop when 2*n+1 is the last element in the array.

Then, actually evaluating the computation involves running through the array backwards, converting each ray to a color. This requires another array of the same size, except this one is filled with output colors. As with any ray tracer, a function exists to find the immediate color of a particular ray by intersecting it with the world and running shading calculation on the intersection point. This function is applied across the last half of the dependency tree array, because the recursion is stopped at the maximum depth of the recursion. The second half of the array is populated with rays that correspond to the maximum depth of the recursion. Then, the first half of the array is processed, going backwards. For any particular ray, its immediate color must be calculated, and then mixed with the color in slot 2*n (the color of the reflected ray) and the color in slot 2*n+1 (the color of the transmitted ray). Once this loop has completed, the final color is the color in index 1 of the color array.

In this model, special attention must be delivered to the rays that do not intersect the world. Because of the nature of the color conversion, each element in the ray dependency tree must be initialized. Therefore, if a ray does not intersect the world, both of that ray’s children can simply be copies of that ray itself. This guarantees that the entire subtree for this ray is populated with rays that do not intersect the world. When converting the rays to colors, if a particular ray does not intersect the world, a default background color must be used.

Note that this is not a particularly optimized ray tracer – spatial partitioning can be used to greatly speed up computation.

Unfortunately, I cannot get the program to run on the graphics card in my server (8800gt) or the graphics card in my laptop (9400m). Calculating the lighting values for a black and white image works, and calculating an unshaded color works, but the one line that multiplies the target color with the shading color makes the entire scene go black. My only guess is that the smaller lower-end graphics cards in my server and laptop just can’t handle the large kernel required to trace rays. That being said, my 2 GTX 295 cards run the kernel just fine. Using all 4 GPUs in the GTX 295s is about 7 times faster than using all 8 (pseudo-)cores in the Core i7.