Zhava !: OpenMp for parallelism

Using OpenMp is one way to exploit multi-cores using shared memory parallelism (as against Message Passing Interface which is more suitable for distributed shared memory across different nodes of a tightly coupled cluster, where each node is an individual processor (uni/multi)). OpenMp is usually used to exploit loop level parallelism, where threads are spawned just before a time consuming loop and each thread operates on different iterations of the loop and all the threads (this is the usual case, there are ways to allow completed threads to continue) then join at the end of the loop. This is in contrast to pthreads which is usually suitable to expressing parallelism at at a much higher level like a function/whole program (ie, using pthreads a process can be used to spawn off multiple threads at the start of the execution, since the starting point of each pthread corresponds to function entry point). Some features of OpenMp:

- Parallelism is expressed using compiler preprocessor directives inserted before structured blocks of code which are to be parallelized (usually loops). In particular, it uses the #pragma directive (a short form for "pragmatic" which was initially used to support compiler and platform specific features in the compiler) to specify various aspects of parallelism like the number of threads, insertion of barriers, variables to be privatized by each thread, synchronization methods, various of kinds of reductions (sum etc) etc. There are many consequences of this:

Firstly, to get performance using parallelism, the compiler has to have support for recognizing these OpenMp directives. However, if the compiler does not have support for OpenMp-specific directives, the program will still compile and run as a single thread (as against pthreads which will need the program to be strewn with #ifdef directives which depending on how one looks at it may be worse or similiar to the #pragmas of OpenMp, mantainance-wise). That way, sequential equivalence is preserved.

Secondly, when the number of threads is left unspecified, the compiler is free to choose the optimal/best number of threads for a given architecture. This way, the goal of incremental/scalable parallelism is also achieved to a certain extent (scaling a program to newer architectures with higher number of cores without changing much of the source code).

Thirdly, the compiler is free to experiment with various loop parallelization techniques like DOALL (individual iterations of a loop executed on separate cores) or other forms of pipelined parallelism like DSWP etc. This is a good reason for many optimizing compilers (vendors) to openly adopt OpenMp as a parallelism technology for future many-cores (see here). Support for auto-parallelization techniques wrt OpenMp is (was?) planned in gcc.

- OpenMp supports mainly fork-join kind of parallelism, where there is a single master thread which spawns off many worker threads, and finally all the worker threads join at a barrier once they complete and the master thread continues. Also, it possible to make this worksharing paradigm dynamic using the schedule clause in the openmp directive (ie, prefer a more runtime-directed dynamic scheduling for mapping the iteration space of a loop to the underlying threads rather than have a static partitioning of the iterations to threads). This permits experiments with different kind of scheduling algorithms to be tried out in the OpenMp runtime library (that way various workstealing algorithms used by systems like Cilk can be implemented, although Cilk is more geared toward task-parallelism as against loop/data parallelism exposed in OpenMp programs))

- A limitation of OpenMp loops is that needs to have structured blocks of code for amenable parallelism and no branches (only program exits are permitted) out of the block are permitted. Not sure if these restrictions still exist.

- It is quite possible to use pthreads, OpenMp and MPI together in a single program to express parallelism at various levels (and also provide a nightmarish test case for all static analysis tools that are devised specifically to find concurrency related bugs :-)). MPI can be used at the highest level to express cluster level parallelism, pthreads to express task/thread-level parallelism while OpenMp used to provide loop-level parallelism. It may well turn out that the overhead of using the parallel constructs (creating threads, communication and synchronization) may far outweigh the benefits and one experiences the not too uncommon case of a slowdown instead of a speedup.

[Ref] Patterns for Parallel Programming

1 comment:

Anonymous11:32 AM
Nice overview of OpenMP capabilities, but your reference to Cilk is archaic. The paper you refer to was the first of many, many papers on the Cilk project. A better reference is this paper,
which won the 2008 PLDI award for Best Paper of 1998. Other papers on the Cilk project are available here.

Incidentally, the MIT Cilk project was spun off in 2006 to Cilk Arts, a start-up which provides Cilk++ extensions for C++, including a "cilk_for" construct for loop parallelism. We've also produced a free e-book on multicore software which addresses some of the same issues you do.

Nice to see the multicore-software space heating up! Thanks for your blog post!

Sunday, August 17, 2008

OpenMp for parallelism

1 comment: