## Friday, January 15, 2010

I just came from a particularly challenging interview at Google’s headquarters in Mountain View, California. I forget the name of the interviewer I had for interview # 7 (Edit: I now remember, but I won't recognize him by name), but it was a particularly interesting interview. Before I begin talking about it, I should probably say that I did extremely poorly on this interview, but I did end up learning quite a bit from the interviewer (who, I’m assuming, took pity on me and ended up telling me the answers to his two questions). I’m going to recreate the two questions as best I can here.

Question #1: Suppose you have 10 identical machines, each of which has:
• 4GB memory
• 500GB hard drive, with 400GB full of (Apache) log files.
How do you find the top 1 million most commonly visited pages in the log files, with a total computation time of under 24 hours?

This is roughly how the interview went:

Me: Well, because the data is unordered, each machine has to read all of the data on its disk, sequentially. It can then make a hash map where the key is a hash of the URL, and the value is the frequency count. As each computer reads a line, it extracts the URL, hashes it, and looks up the hash in the hash map. If the hash is in the hash map, it extracts the existing associated value, increments it, and stores the value back in the map under the same hash. Otherwise, it stores the hash in the hash map with a frequency of 1.

Interviewer: Okay, how would you then combine the results into one single top 1 million list?

Me: Well, you could combine the data hierarchically, where machines 0 and 1 combine their data, machines 2 and 3 combine their data, and the two combinations are then combined, etc, until one master list is obtained. However, combining the entire hash map would be too big - the last combination would be combining every single entry into a gigantic list. I guess you would then only take the top, say, 2 million entries from each hash map and combine those. That doesn't guarantee correctness, but it's probably good enough.

Interviewer: It's not good enough. Let's say a particular website is #2,000,001 in each of the log files. It is feasible that that website could be the most popular website, if, say, spots #1 - #2,000,000 are a tie. How could you guarantee correctness?

Me: Hrmph. Perhaps you could fit the whole hash map on one machine?

Him: You blow. That wouldn't fit in memory, causing that machine to swap heavily, slowing it down by 2 orders of magnitude, causing you to miss your deadline, and exploding the world.

Me: Yeah, I do blow. Please hire me!

Him: But think about this! It wouldn't fit in 4GB, but sitting in front of you, you have a total of 40GB of memory!

Me: Aha! You could make a distributed hash table. You could partition your hash space into 10 different spaces, and map each space to a different PC.

Him: Something like PC ID = hash(URL) mod 10.

Me: Yeah. Thanks for the hint.

Him: You bet. So how would you combine the results now?

Me: Well, because all the counts for each individual URL are in one place, you just have to combine the hash maps hierarchically, as I described before, except you only need to keep the top 1 million entries in the combination. This should go fairly quickly.

Him: Okay, fair enough. (Looks disappointed) Now, say the nice folks over at Microsoft are able to reverse engineer your algorithm completely. They then hit our website repeatedly for crafted URLs, so that significantly more than 1/n of the input URLs in the log files map to the same machine. Pretend that, now, 100% of the URLs in the log files have the quality where hash(URL) mod 10 == c, where c is a constant. What does this mean?

Me: Well, this is going to overload the hash table on one of the particular computers. It'll make the hash table swap out to disk, slowing down the computer, making us lose our deadline and ending in the world blowing up.

Him: So how would you solve this problem?

Me: You could take a hash of a hash of a hash...

Him: Pretend that 100% of your algorithm is known by the hacker.

Me: You could have a dynamic partitioning system, where, if the hash map on one machine becomes too big, it offloads part of its map to the next machine...

Him: That would be too complicated to program. Also, it would result in you missing your deadline. If you do it that way, you have to have one of the PCs dedicated to keeping track of what keys go where. All the other PCs have to hit this PC for each URL it encounters, introducing a bottleneck in the system, slowing everything down. Also, it would be slow to have two machines handling the mapping of key to machine, because they have to have consistent data, which introduces lots of cross traffic, and the whole thing is basically a giant clusterfuck.

Me: So, you can't change your hashing algorithm, and you can't artificially move your data around .... Seems like your stuck.

Him: WOW. You REALLY blow. Even harder than I thought!

--We stare at each other for a minute--

Him: Have you ever heard of universal hashing?

Me: Nope. By the way, did you know that I blow?

Him: I had gathered. Do you know what a salt is?

Me: No.

Him: A salt is a constant value you append to the end to the input of your hash function. If you choose a salt at random, the folks at Microsoft won't be able to know it (because it's data, not your algorithm). Now, they don't know where the hashes end up. You can then change your salt daily so they don't have time to guess your salt.

Me: Gee, that sure is smart.

Him: You blow.

Me: Sorry.

Question #2: Google has a web crawler. This web crawler periodically crawls the same sites repeatedly to keep track of changes. However, people get mad if Google crawls their website too often (for whatever reason). So, the crawler has to remember the last time it crawled a website, and can then decide whether or not it should crawl again. This means we need some sort of database system that maps URLs to a timestamp, and, for purposes of this interview, a 1-byte enumeration of a site's status. This will be something like {online, 404, 403, timeout, etc}. The crawler is going to want to query the system for a given URL, and we need to return the timestamp and the site status. When the crawler does end up crawling a site, it's going to update the information for a given URL. Also, the crawler may insert data for a new URL. Assume that this system is going to hold information for 200 billion websites, and we allow a one in a million failure rate (giving incorrect information is a failure). How would you design this system? And, how many machines would you need?

Me: Well the most straightforward solution would be to use a traditional database system, like MySQL or something.

Him: You suck. You didn't even ask me the rate of queries! This is the most obvious thing you should have asked me. The query rate is 50,000 queries / second.

Me: So that's too much for a traditional database system.

Him: Yep.

Me: Well, you could use the same system that we described in the last question, where you use a distributed hash map, where the keys are the url, and the value is a tuple of the timestamp and the status. You could use the salt thingy to make sure the data is distributed on the machines evenly.

Him: Okay. Sounds good. How many machines would you need, given the memory constraints of saying each computer has 16GB of usable memory?

Me: Well it's probably more than 1 machine.

Him: You really have no clue, do you?

Me: Yeah, I blow.

Him: Well, think about it this way. How many bits do you need in your hash?

Me: Well, the space of the hash should be big enough to provide a 1 in a million failure rate, so 2^n = 10^6

Him: That would be the space of a hash if you wanted to hash a single element.

Me: (Making this up as I go along) So I guess I multiply the right side by 20 billion?

Him: What would that make, then?

Me: Well,
2^n = 10^6 * 200 * 10^9
2^n = 200 * 10^15
2^n = 2 * 10^17
2^n = 2 * (2^10)^(17/3)
2^n = 2 * (2^10)^6
2^n = 2 * (2^60)
2^n = 2^61
n = 61 bits, rounding up to 64.

Him: 64 is nice and round. So how much space do you need for the timestamp?

Me: Well, Unix timestamps are 64-bits, so that sounds like a good place to start. You could compress that by subtracting the value for the first timestamp recorded. Also, if you don't need such high resolution, you could divide by some constant.

Him: That's fine. We'll just make it 56 bits. We've already said that the status is 1 byte, or 8 bits. This makes a total of how much data per row?

Me: 64 + 56 + 8 = 128

Him: Not quite. Hash maps have internal data structures to handle collisions.

Me: (I actually didn't understand this, as the problem statement said that hash collisions are okay if they're infrequent (1 in a million), but I went with it anyway) Okay, so give each hash a linked list. Therefore, the tuple of (timestamp, status) now becomes a triple of (timestamp, status, nextptr). Give each pointer 32 bits (Thanks, Matt, for pointing out that if the machines have 16GB memory (as described later), this should be 64 bits for a pointer). This brings the total up to 128 + 32 = 160 bits = 20 bytes.

Him: Sounds good so far. How many machines would you need then?

Me: Well, each row has 20 bytes, and there are 200 billion rows, so that's 4TB of memory. If each machine has 16GB of memory, that's 4000 GB / (16 GB / Machine) = 250 machines.

Him: Sounds pretty good. Now, if we distribute the 50,000 queries / second, do you think our 250 machines will be able to handle the load?

Me: Well, that means that each machine handles (50,000 queries / second) / (250 machines) = 200 queries / second / machine. That sounds pretty reasonable.

Him: I agree. Sounds good.

Me: Fantastic.

Edit: I have been offered (and accepted!) this job, even though this interview went atrociously. If you have had a similar terrible interview, don't give up hope!

Here are the pictures i've taken from Google's Campus.