Sunday, August 17, 2008

OpenMp for parallelism

Using OpenMp is one way to exploit multi-cores using shared memory parallelism (as against Message Passing Interface which is more suitable for distributed shared memory across different nodes of a tightly coupled cluster, where each node is an individual processor (uni/multi)). OpenMp is usually used to exploit loop level parallelism, where threads are spawned just before a time consuming loop and each thread operates on different iterations of the loop and all the threads (this is the usual case, there are ways to allow completed threads to continue) then join at the end of the loop. This is in contrast to pthreads which is usually suitable to expressing parallelism at at a much higher level like a function/whole program (ie, using pthreads a process can be used to spawn off multiple threads at the start of the execution, since the starting point of each pthread corresponds to function entry point). Some features of OpenMp:

- Parallelism is expressed using compiler preprocessor directives inserted before structured blocks of code which are to be parallelized (usually loops). In particular, it uses the #pragma directive (a short form for "pragmatic" which was initially used to support compiler and platform specific features in the compiler) to specify various aspects of parallelism like the number of threads, insertion of barriers, variables to be privatized by each thread, synchronization methods, various of kinds of reductions (sum etc) etc. There are many consequences of this:

Firstly, to get performance using parallelism, the compiler has to have support for recognizing these OpenMp directives. However, if the compiler does not have support for OpenMp-specific directives, the program will still compile and run as a single thread (as against pthreads which will need the program to be strewn with #ifdef directives which depending on how one looks at it may be worse or similiar to the #pragmas of OpenMp, mantainance-wise). That way, sequential equivalence is preserved.

Secondly, when the number of threads is left unspecified, the compiler is free to choose the optimal/best number of threads for a given architecture. This way, the goal of incremental/scalable parallelism is also achieved to a certain extent (scaling a program to newer architectures with higher number of cores without changing much of the source code).

Thirdly, the compiler is free to experiment with various loop parallelization techniques like DOALL (individual iterations of a loop executed on separate cores) or other forms of pipelined parallelism like DSWP etc. This is a good reason for many optimizing compilers (vendors) to openly adopt OpenMp as a parallelism technology for future many-cores (see here). Support for auto-parallelization techniques wrt OpenMp is (was?) planned in gcc.

- OpenMp supports mainly fork-join kind of parallelism, where there is a single master thread which spawns off many worker threads, and finally all the worker threads join at a barrier once they complete and the master thread continues. Also, it possible to make this worksharing paradigm dynamic using the schedule clause in the openmp directive (ie, prefer a more runtime-directed dynamic scheduling for mapping the iteration space of a loop to the underlying threads rather than have a static partitioning of the iterations to threads). This permits experiments with different kind of scheduling algorithms to be tried out in the OpenMp runtime library (that way various workstealing algorithms used by systems like Cilk can be implemented, although Cilk is more geared toward task-parallelism as against loop/data parallelism exposed in OpenMp programs))

- A limitation of OpenMp loops is that needs to have structured blocks of code for amenable parallelism and no branches (only program exits are permitted) out of the block are permitted. Not sure if these restrictions still exist.

- It is quite possible to use pthreads, OpenMp and MPI together in a single program to express parallelism at various levels (and also provide a nightmarish test case for all static analysis tools that are devised specifically to find concurrency related bugs :-)). MPI can be used at the highest level to express cluster level parallelism, pthreads to express task/thread-level parallelism while OpenMp used to provide loop-level parallelism. It may well turn out that the overhead of using the parallel constructs (creating threads, communication and synchronization) may far outweigh the benefits and one experiences the not too uncommon case of a slowdown instead of a speedup.

[Ref] Patterns for Parallel Programming

Saturday, August 02, 2008

Resolving Hash Collisions

Two main techniques for resolving hash collisions:

[1] Direct Chaining: When a hashed location is already occupied, the new key and its data are put into a linked list (chain) which is pointed to from the hash table's location. In case there are a large number of collisions, chaining can result in a linear access time to get to the data key that we want to get to. Variants of this technique use a balanced search tree instead of a linked list, thereby reducing the search time to O(log n).

[2] Open Addressing: Instead of having a dynamic data structure (and thereby reducing memory allocation calls and being more cache friendly), this technique
uses the other slots in the hash table (presumably an cache friendly array) to resolve collisions.

There are various ways to resolve hash collisions, using open addressing, which have different behavior in terms of reducing probability of collisions (and thereby having closer to constant time while searching for entries in the hash table) and also in terms of cache friendliness (spatial locality). Some common techniques to resolve hash collisions are:

(1) Double Hashing: It applies two hash functions to resolve collisions. Initially a key is searched using a single hash function. If there is a collision, another hash function is used in conjunction with the first hash function. Initially to search for key k, apply h1(k) and in case Key(entry(h1(k)) != k, then for every subsequent search (say it is denoted by i) try the entry given by : h(i,k) = (h1(k) + i*h2(k)) (mod m) where m is the size of the hash table.

(2) Linear Probing: It uses a single hash function and to resolve collisions, uses consecutive slots in the hash table. Given that we have had i-1 collisions, on the ith probe (search for a free entry while insertion and for the key's satellite data while querying the table), it examines the entry given by:
h(i,k) = (h(k) + c*i) (mod m), where m is the size of the table and c is a constant.

(3) Quadratic Probing: It again uses a single hash function, but every time a collision occurs, unlike linear probing, it uses a quadratic function of the (number of collisions occurred so far) to find the next slot. Given that we have had i-1 collisions, on the ith probe (search for a free entry while insertion and for the key's satellite data while querying the table), it examines the entry given by: h(i,k) = (h(k) + c1*i + c2*i*i)(mod m), where c1 and c2 are constants while m is the size of the hash table.

Double hashing spreads data more uniformly (depending of course on the choice of h2(k)) around in the hash table than linear probing which , when it encounters a collision, uses the immediate next entry (consecutive locations are used for resolving collisions) to fill in the key and its satellite data. This means that linear probing is likely to cause more collisions (since the keys which are genuinely hashed to a particular hash table entry may get offset by some number of slots due to a long chain of entries corresponding to keys that collided earlier. This phenomenon is called clustering). But if the first hash function is really good (in terms of spreading data uniformly randomly across the table) and the load factor is sufficiently small, then linear probing is likely to be more cache friendly than double hashing. Quadratic probing spreads data (ie, all the keys that collided) more uniformly through out the table than linear probing, although linear is the most cache friendly of the three hash collision resolution techniques (however, note that this may be offset by the increased likelihood of collisions). Double hashing also may be computationally more expensive since it requires the use of two hash functions.

Open Addressing (double hashing in particular, why ?) generally poses problems while deletion, since the deleted slot occupies some an unwanted spot (tombstone) which can be filled in only by a subsequent insert. Imagine a situation where there are a lot of deletes in the hash table. Now search for an entry may not be constant time since there are a lot of empty slots that one may have to go through before finding the right entry. In contrast, in case of direct chaining the deleted entries do not consume any space in the table per se (only the chain which is a linked list/tree is resized, which as a consequence of being a dynamic data structure is more space efficient).

Question: What technique is commonly used to resolve hash collisions in many (known) C++ projects ?

* LLVM (a powerful compiler system for C, C++ and many other languages) uses quadratically probed hash table in implementing StringMap (a map from C++ strings to objects) and in DenseMap (a map usually used to hold pairs of values which are small in size, like for eg, pointer to a pointer which is space efficient) as described here

* SGI's hash_map (distributed in Linux as __gnu_cxx::hash_map) uses chaining to resolve hash collisions (linked list ?)

* Google Sparse hash (a couple of dense and sparse hash maps/sets from Google) uses quadratic probing as described here (as triangular numbers)

Labels: ,