Saturday, August 02, 2008

Resolving Hash Collisions

Two main techniques for resolving hash collisions:

[1] Direct Chaining: When a hashed location is already occupied, the new key and its data are put into a linked list (chain) which is pointed to from the hash table's location. In case there are a large number of collisions, chaining can result in a linear access time to get to the data key that we want to get to. Variants of this technique use a balanced search tree instead of a linked list, thereby reducing the search time to O(log n).

[2] Open Addressing: Instead of having a dynamic data structure (and thereby reducing memory allocation calls and being more cache friendly), this technique
uses the other slots in the hash table (presumably an cache friendly array) to resolve collisions.

There are various ways to resolve hash collisions, using open addressing, which have different behavior in terms of reducing probability of collisions (and thereby having closer to constant time while searching for entries in the hash table) and also in terms of cache friendliness (spatial locality). Some common techniques to resolve hash collisions are:

(1) Double Hashing: It applies two hash functions to resolve collisions. Initially a key is searched using a single hash function. If there is a collision, another hash function is used in conjunction with the first hash function. Initially to search for key k, apply h1(k) and in case Key(entry(h1(k)) != k, then for every subsequent search (say it is denoted by i) try the entry given by : h(i,k) = (h1(k) + i*h2(k)) (mod m) where m is the size of the hash table.

(2) Linear Probing: It uses a single hash function and to resolve collisions, uses consecutive slots in the hash table. Given that we have had i-1 collisions, on the ith probe (search for a free entry while insertion and for the key's satellite data while querying the table), it examines the entry given by:
h(i,k) = (h(k) + c*i) (mod m), where m is the size of the table and c is a constant.

(3) Quadratic Probing: It again uses a single hash function, but every time a collision occurs, unlike linear probing, it uses a quadratic function of the (number of collisions occurred so far) to find the next slot. Given that we have had i-1 collisions, on the ith probe (search for a free entry while insertion and for the key's satellite data while querying the table), it examines the entry given by: h(i,k) = (h(k) + c1*i + c2*i*i)(mod m), where c1 and c2 are constants while m is the size of the hash table.

Double hashing spreads data more uniformly (depending of course on the choice of h2(k)) around in the hash table than linear probing which , when it encounters a collision, uses the immediate next entry (consecutive locations are used for resolving collisions) to fill in the key and its satellite data. This means that linear probing is likely to cause more collisions (since the keys which are genuinely hashed to a particular hash table entry may get offset by some number of slots due to a long chain of entries corresponding to keys that collided earlier. This phenomenon is called clustering). But if the first hash function is really good (in terms of spreading data uniformly randomly across the table) and the load factor is sufficiently small, then linear probing is likely to be more cache friendly than double hashing. Quadratic probing spreads data (ie, all the keys that collided) more uniformly through out the table than linear probing, although linear is the most cache friendly of the three hash collision resolution techniques (however, note that this may be offset by the increased likelihood of collisions). Double hashing also may be computationally more expensive since it requires the use of two hash functions.

Open Addressing (double hashing in particular, why ?) generally poses problems while deletion, since the deleted slot occupies some an unwanted spot (tombstone) which can be filled in only by a subsequent insert. Imagine a situation where there are a lot of deletes in the hash table. Now search for an entry may not be constant time since there are a lot of empty slots that one may have to go through before finding the right entry. In contrast, in case of direct chaining the deleted entries do not consume any space in the table per se (only the chain which is a linked list/tree is resized, which as a consequence of being a dynamic data structure is more space efficient).

Question: What technique is commonly used to resolve hash collisions in many (known) C++ projects ?

* LLVM (a powerful compiler system for C, C++ and many other languages) uses quadratically probed hash table in implementing StringMap (a map from C++ strings to objects) and in DenseMap (a map usually used to hold pairs of values which are small in size, like for eg, pointer to a pointer which is space efficient) as described here

* SGI's hash_map (distributed in Linux as __gnu_cxx::hash_map) uses chaining to resolve hash collisions (linked list ?)

* Google Sparse hash (a couple of dense and sparse hash maps/sets from Google) uses quadratic probing as described here (as triangular numbers)

Labels: ,

2 Comments:

At 8:41 PM , Blogger bvk-chaitanya said...

Nice write-up prakash. I will keep it as a reference, seriously :-)

I think, deletion could be serious problem for double hashing because, you need to have access to keys of other elements in the collision chain, not just the deleted element.

For example, say four keys k1, k2, k3, k4 collied into same bucket. Now delete for k1 needs to either move k2, k3, k4 into their proper place or we should mark k1 bucket as empty. For linear probing and quadratic probing finding positions for k2, k3, k4 elements is very easy, some f(c,i). Where as for double hashing theirs positions are f(k2,i), f(k3,i), f(k4,i). Most probably you won't have access to k2, k3, k4 when you are deleting k1.

Does it sound reasonable?

 
At 9:48 AM , Blogger Prakash said...

Yeah Chaitu..you are right...instead of trying to move items (which i guess is really inefficient), usually a special marker called "delete" is used (to distinguish it from "empty" so that a search for a valid item would not be unsuccessful). However this also has its added set of problems like too many deletes can result in a under-utilized table which need to be garbage collected (all entries rehashed). In general, deletion seems to be a pain in all open addressing schemes because of these issues. But open addressing schemes are used mostly in compilers, LLVM being a case in point, where deletion is usually rare (one eg is their use in symbol tables).

I found most of this information here: http://www.cs.pitt.edu/~kirk/cs1501/notes/hash.html

Thanks :)

 

Post a Comment

Subscribe to Post Comments [Atom]

<< Home