One of the questions that I like to interview candidates for my team with involves hashing strings. I was talking with someone about it, and they insisted that computing a hash function takes constant time, (O(1)). I then asked how good the has function could really be if it didn't look at all the characters. He insisted that you could look at all the characters, but still have the function be constant time.
I decided that this would be a good opportunity to see how it's really done. The new C++ standard, C++11, includes std::hash, which does the computation that we're interested in. The implementation that FreeBSD's libstdc++ is also open source, so I can just go look at the source.
The symbol is defined in string.h, so let's take a look at that file. We can immediately see that hash<basic_string<_CharT, _Traits, _Allocator> >::operator() delegates to __do_string_hash() which delegates to __murmur2_or_cityhash(). That function is defined in include/memory, and has a really interesting trick to it, using templates:
// We use murmur2 when size_t is 32 bits, and cityhash64 when size_t
// is 64 bits. This is because cityhash64 uses 64bit x 64bit
// multiplication, which can be very slow on 32-bit systems.
template <class _Size, size_t = sizeof(_Size)*__CHAR_BIT__>
struct __murmur2_or_cityhash;
As you can see, we're creating a new templated value called size_t. We can then use template specialization to define various implementations of the function. In particular, we have these two implementations:
template <class _Size>
struct __murmur2_or_cityhash<_Size, 32>
{
_Size operator()(const void* __key, _Size __len);
};
// murmur2
template <class _Size>
_Size
__murmur2_or_cityhash<_Size, 32>::operator()(const void* __key, _Size __len)
We've also got a cityhash64 implementation.
template <class _Size>
struct __murmur2_or_cityhash<_Size, 64>
{
_Size operator()(const void* __key, _Size __len);
...
// cityhash64
template <class _Size>
_Size
__murmur2_or_cityhash<_Size, 64>::operator()(const void* __key, _Size __len)
Cool! Alright, now let's read those functions to try to see what the running time is. First, let's take murmor. The main loop in this function looks like this:
for (; __len >= 4; __data += 4, __len -= 4)
In addition, __len isn't modified in the loop. Well, that makes it simple; this is clearly O(n). What about cityhash? Its main loop looks like this:
do {
...
__len -= 64;
} while (__len != 0);
__len isn't modified elsewhere within the loop. Well, looks like we've got an answer! std::hash is O(n).
the __murmur2_or_cityhash function is actually in include/utility not include/memory
ReplyDelete