Sunday, August 17, 2008

OpenMp for parallelism

Using OpenMp is one way to exploit multi-cores using shared memory parallelism (as against Message Passing Interface which is more suitable for distributed shared memory across different nodes of a tightly coupled cluster, where each node is an individual processor (uni/multi)). OpenMp is usually used to exploit loop level parallelism, where threads are spawned just before a time consuming loop and each thread operates on different iterations of the loop and all the threads (this is the usual case, there are ways to allow completed threads to continue) then join at the end of the loop. This is in contrast to pthreads which is usually suitable to expressing parallelism at at a much higher level like a function/whole program (ie, using pthreads a process can be used to spawn off multiple threads at the start of the execution, since the starting point of each pthread corresponds to function entry point). Some features of OpenMp:

- Parallelism is expressed using compiler preprocessor directives inserted before structured blocks of code which are to be parallelized (usually loops). In particular, it uses the #pragma directive (a short form for "pragmatic" which was initially used to support compiler and platform specific features in the compiler) to specify various aspects of parallelism like the number of threads, insertion of barriers, variables to be privatized by each thread, synchronization methods, various of kinds of reductions (sum etc) etc. There are many consequences of this:

Firstly, to get performance using parallelism, the compiler has to have support for recognizing these OpenMp directives. However, if the compiler does not have support for OpenMp-specific directives, the program will still compile and run as a single thread (as against pthreads which will need the program to be strewn with #ifdef directives which depending on how one looks at it may be worse or similiar to the #pragmas of OpenMp, mantainance-wise). That way, sequential equivalence is preserved.

Secondly, when the number of threads is left unspecified, the compiler is free to choose the optimal/best number of threads for a given architecture. This way, the goal of incremental/scalable parallelism is also achieved to a certain extent (scaling a program to newer architectures with higher number of cores without changing much of the source code).

Thirdly, the compiler is free to experiment with various loop parallelization techniques like DOALL (individual iterations of a loop executed on separate cores) or other forms of pipelined parallelism like DSWP etc. This is a good reason for many optimizing compilers (vendors) to openly adopt OpenMp as a parallelism technology for future many-cores (see here). Support for auto-parallelization techniques wrt OpenMp is (was?) planned in gcc.

- OpenMp supports mainly fork-join kind of parallelism, where there is a single master thread which spawns off many worker threads, and finally all the worker threads join at a barrier once they complete and the master thread continues. Also, it possible to make this worksharing paradigm dynamic using the schedule clause in the openmp directive (ie, prefer a more runtime-directed dynamic scheduling for mapping the iteration space of a loop to the underlying threads rather than have a static partitioning of the iterations to threads). This permits experiments with different kind of scheduling algorithms to be tried out in the OpenMp runtime library (that way various workstealing algorithms used by systems like Cilk can be implemented, although Cilk is more geared toward task-parallelism as against loop/data parallelism exposed in OpenMp programs))

- A limitation of OpenMp loops is that needs to have structured blocks of code for amenable parallelism and no branches (only program exits are permitted) out of the block are permitted. Not sure if these restrictions still exist.

- It is quite possible to use pthreads, OpenMp and MPI together in a single program to express parallelism at various levels (and also provide a nightmarish test case for all static analysis tools that are devised specifically to find concurrency related bugs :-)). MPI can be used at the highest level to express cluster level parallelism, pthreads to express task/thread-level parallelism while OpenMp used to provide loop-level parallelism. It may well turn out that the overhead of using the parallel constructs (creating threads, communication and synchronization) may far outweigh the benefits and one experiences the not too uncommon case of a slowdown instead of a speedup.

[Ref] Patterns for Parallel Programming

Saturday, August 02, 2008

Resolving Hash Collisions

Two main techniques for resolving hash collisions:

[1] Direct Chaining: When a hashed location is already occupied, the new key and its data are put into a linked list (chain) which is pointed to from the hash table's location. In case there are a large number of collisions, chaining can result in a linear access time to get to the data key that we want to get to. Variants of this technique use a balanced search tree instead of a linked list, thereby reducing the search time to O(log n).

[2] Open Addressing: Instead of having a dynamic data structure (and thereby reducing memory allocation calls and being more cache friendly), this technique
uses the other slots in the hash table (presumably an cache friendly array) to resolve collisions.

There are various ways to resolve hash collisions, using open addressing, which have different behavior in terms of reducing probability of collisions (and thereby having closer to constant time while searching for entries in the hash table) and also in terms of cache friendliness (spatial locality). Some common techniques to resolve hash collisions are:

(1) Double Hashing: It applies two hash functions to resolve collisions. Initially a key is searched using a single hash function. If there is a collision, another hash function is used in conjunction with the first hash function. Initially to search for key k, apply h1(k) and in case Key(entry(h1(k)) != k, then for every subsequent search (say it is denoted by i) try the entry given by : h(i,k) = (h1(k) + i*h2(k)) (mod m) where m is the size of the hash table.

(2) Linear Probing: It uses a single hash function and to resolve collisions, uses consecutive slots in the hash table. Given that we have had i-1 collisions, on the ith probe (search for a free entry while insertion and for the key's satellite data while querying the table), it examines the entry given by:
h(i,k) = (h(k) + c*i) (mod m), where m is the size of the table and c is a constant.

(3) Quadratic Probing: It again uses a single hash function, but every time a collision occurs, unlike linear probing, it uses a quadratic function of the (number of collisions occurred so far) to find the next slot. Given that we have had i-1 collisions, on the ith probe (search for a free entry while insertion and for the key's satellite data while querying the table), it examines the entry given by: h(i,k) = (h(k) + c1*i + c2*i*i)(mod m), where c1 and c2 are constants while m is the size of the hash table.

Double hashing spreads data more uniformly (depending of course on the choice of h2(k)) around in the hash table than linear probing which , when it encounters a collision, uses the immediate next entry (consecutive locations are used for resolving collisions) to fill in the key and its satellite data. This means that linear probing is likely to cause more collisions (since the keys which are genuinely hashed to a particular hash table entry may get offset by some number of slots due to a long chain of entries corresponding to keys that collided earlier. This phenomenon is called clustering). But if the first hash function is really good (in terms of spreading data uniformly randomly across the table) and the load factor is sufficiently small, then linear probing is likely to be more cache friendly than double hashing. Quadratic probing spreads data (ie, all the keys that collided) more uniformly through out the table than linear probing, although linear is the most cache friendly of the three hash collision resolution techniques (however, note that this may be offset by the increased likelihood of collisions). Double hashing also may be computationally more expensive since it requires the use of two hash functions.

Open Addressing (double hashing in particular, why ?) generally poses problems while deletion, since the deleted slot occupies some an unwanted spot (tombstone) which can be filled in only by a subsequent insert. Imagine a situation where there are a lot of deletes in the hash table. Now search for an entry may not be constant time since there are a lot of empty slots that one may have to go through before finding the right entry. In contrast, in case of direct chaining the deleted entries do not consume any space in the table per se (only the chain which is a linked list/tree is resized, which as a consequence of being a dynamic data structure is more space efficient).

Question: What technique is commonly used to resolve hash collisions in many (known) C++ projects ?

* LLVM (a powerful compiler system for C, C++ and many other languages) uses quadratically probed hash table in implementing StringMap (a map from C++ strings to objects) and in DenseMap (a map usually used to hold pairs of values which are small in size, like for eg, pointer to a pointer which is space efficient) as described here

* SGI's hash_map (distributed in Linux as __gnu_cxx::hash_map) uses chaining to resolve hash collisions (linked list ?)

* Google Sparse hash (a couple of dense and sparse hash maps/sets from Google) uses quadratic probing as described here (as triangular numbers)

Labels: ,

Friday, June 13, 2008

Use of setjmp/longjmp

Basically these were calls that are used to recover from exceptional situations for a C program running on UNIX. It works as follows:

Suppose you have a long call sequence and when some exceptional condition is detected deep in the call chain you want to handle the situation somewhere high up in the call chain (ignoring all the intermediate call sites and the stack frames). It can be done as follows:

(1) First, call setjmp at the place in the higher position in the call chain with a global environment variable. It saves the program state (pc, sp, gen. purpose regs etc) onto the environment and returns to the current program point (returning a 0). Execution continues

(2) When a exceptional situation occurs deep in the call chain , call longjmp with the environment variable. This just restores the values of the registers from the environment and returns from the point where setjmp was called originally (with a 1 now).

Update from Chaitu: As can be seen from above, jumping from somwhere deep down the call chain to some point way higher up can be a recipe for memory leaks, since it is no way to deallocate any heap objects that create in the intermediate functions, unless a custom memory manager is used and some hack (or technique ?) is put in place to identify all the objects created by functions that were called by the handling function.
[Reference]

Tuesday, June 10, 2008

C++: Buffer Growth and STL vector

What's the best strategy for managing Buffer Growth ?

For example, in STL vector class, initially the vector may be allocated some space, when there its run out of space and you want to add new elements, it has to do a *realloc* and then copy all the existing elements over from the old space into the newly allocated space. If at every addition , the vector is just grown by 1, then on an average for a eventual vector of size N, there would have been O(N) reallocs/allocation and each element on an average would have been copied O(N) [Why?]. The same holds whenever the buffer grows by a fixed amount ( say 64 bytes or so ) with some small constant factor advantage. However, if on the other hand, when you run out of space, you realloc (X + X/2) elements, where X is the current size of vector, then it turns out that in the long run, on an average O(logN) reallocs happen and each element is copied on an average twice, where N is the final size of the vector.

n = S(k) = (3/2)* S(k-1). Solve this recurrence to figure out k.

k is the number of iterations/reallocs to reach the size of n

[Reference: More Exceptional C++, Herb Sutter]

Thunks

* A thunk is a small subroutine used for a variety of purposes like :
- Implementing call by name in Algol
- Providing a way of matching actual parameters with formal parameters ( especially useful when calling conventions are different for incomptabile modules ).

* As a side effect of the above two, thunks provide a way of lazy evaluation, ie, if you are passing some parameters (which are expressions themselves ), to functions, eager evaluation usually evaluates the whole expression/parameter list and then passes it onto the called function irrespective of whether it is used at all ( at runtime ). In case of a thunk , the first time a parameter is used, the thunk translates/evaluates the expression into an address/value at that address depending on whether the parameter is used as a lvalue/rvalue and then returns it. This evaluation takes place only once and every subsequent time, the thunk has the evaluated expression.

* This is especially useful in functional programming, where the parameters passed are functions themselves, and evaluation of functions are done only when necessary. This also helps in dealing with infinite data/ handling termination of recursive functions passed as parameters [ Call by need ]

* A thunk is implemented as a list of three items :
- Thunk Evaluator
- Expression ( denoting the actual parameter)
- Environment ( in which the thunk is evaluated ).

Once the thunk evaluates the expression in the environment, it replaces the 'Thunk Evaluator' with the value so that subsequent calls will give the value directly. In some sense, thunks are closures specifically for parameter passing.

[Reference 1]
[Reference 2]
[Reference 3]

Labels: ,

Behind malloc() and free()

* Whenever we use malloc(), if it finds that memory is not available within the data segment of the process, (as a last step ) it internally calls brk() system call . This system call increases the data segment address (virtual) space of the current process by the requested amount. Actually it is quite possible that although brk() returns sucessfully, physical memory might not be quite available.

* When we use free(), it returns the specific chunk of memory to allocation library ( not to the OS ). Next time you call malloc(), the allocation library may return the freed space without calling the brk() system call. Thus the OS just increases the virtual address space of the Data Segment (by brk()), but the management of the memory that is returned is up to the allocation library . Also, i dont see a system call to explicitly reduce the virtual address space of a process ( a kind of reverse brk() ). So once VA increases, it is upto the allocation library to manage this memory. Update: You did not look around enough ! In fact, calling brk() with a negative argument would reduce the VM address space of the process by that amount, only thing is that the allocation library can't do this unless it is sure that last contigous portion of VM (that is requested to be returned to the OS) is not used by the process.

* All the memory management policies like first fit, best fit and maintaining a free list, storing the size within a header of allocated block and so on goes on in the allocation library with the memory got by brk(). Only when there is no memory available, is malloc() going to go to the OS.

* Some of the popular allocators are Doug Lea Malloc, BSD Malloc and Hoard

* Also, memory management can be split between the program and the memory management by using reference counts ( for data which is read-only in some cases or even for shared data ) and then use garbage collection ( no explicit free - helps in the fact that we dont need to keep track of pointers that point to a data chunk and worry about freeing pointers twice/ using a freed pointer ).

* Another memory management technique is memory pools. For eg, the Apache web server has different memory pools, one that lasts a request, one that lasts a connection, one that lasts a thread and so on. GNU's obstack implementation is one way of using allocation pools [ provides reuse of similar sized data chunks ].

[Reference]

Labels: ,

Worst case Number of Phi Nodes in a SSA graph

Q. What is the worst case number of phi functions in a SSA graph ?

It could be O(N^2). Consider this program:

while (...)
{
while (...)
{
while (...)
{
i = ...
}
i = ...
}
i = ...
}
... = i

A minimal phi insertion (confirm this !) has 5 phi nodes. This is around (N/2). For V variables, number of phi functions would be (N/2)*V or around O(N^2) phi functions.

i0 = ...
while (...)
{
i1 = phi(i0, i7)
while (...)
{
i2 = phi(i1, i5)
while (...)
{
i3 = i2;
}
i4 = phi(i3,i2)
i5 = i4
}
i6 = phi(i4, i1)
i7 = i6
}
i8 = phi(i7, i0)

There is some interesting discussion on this in comp.compilers here

Labels: ,

The alloca routine: Stack allocation

The alloca routine allows a C program to allocate memory on the stack instead of the heap. The advantage is that this memory is cleaned up once the allocating routine exits and does not need to explicitly do a ' free'. So probably you would not need to worry about memory leaks (but you are obviously restricting the life time of the object to the method boundary -- an escape analysis could use this method to convert all malloc calls to alloca calls for captured objects).

How could this be implemented ?

It is just a manipulation of the stack pointer (esp) using the size passed as an argument to the alloca routine (in a register). Since the base pointer (ebp) holds the base of the current function's stack frame, when the function returns, the stack is unwound using ebp and the de-allocation is automatically taken care of.

What could be the potential disadvantages ?

Size is obviously a restriction as the stack space grows linearly and upto a limit. There are other disadvantages mentioned in this Wine mailing list thread.

A long list of pros and cons are discussed in the gdb developer's mailing list. See here

Labels: ,

Pointer Analysis: C vs Java

* Stack variables and Heap : In case of C/C++, objects are allocated on the stack (not just scalar locals/pointer variables). So, we can make use of this property to eliminate any points-to edges from the heap objects to objects on the stack once a function returns. This can be done only when we are sure that the given program either is well-written and has no dangling references (which itself would require another analysis i guess) and pointer references in the program respect the lifetime of the pointed-to objects. Alternately, we can track this points-to edge (from heap to stack object) and if there is a de-reference of this pointer, you can flag an error/warning in the compiler. In case of Java, objects are not allocated on the stack and hence there can't be any reference to an object on the stack from a field of a heap object at all (since you can't take the address of a local scalar variable and an assignment always assigns a 'reference' to another heap object and not the stack variable). So, aliasing between heap and stack objects/variables never arises in Java. Also the same applies to pointers from Global Variables to Stack Variables

* Direct Manipulation of Pointers: In C/C++, it is possible to take the address of a variable (using & operator), add, subtract values (offsets) to it, and then indirectly access the variable at the new address (using *), which makes it possible for a pointer to directly point to the middle of a (logical) data structure. So, things like alignment, byte-level reasoning in the analysis become more important to safely infer points-to sets. It is also possible to arbitrarily cast pointers from one type to another. This means declared types are usually useless from the analysis perspective. A lot of this could be solved by representing such operations by special cast instructions in the IR, specifying the alignments (for things like unions) explicitly in the IR. All these are done in the LLVM IR. Java, on the other hand, is strongly typed and permits only indirect manipulation of pointers which are type safe (or explicit exception is thrown as in the case of casts). This means the declared types not only can safely be used in the analysis but also can be used to make the analysis more precise (scalars of two different classes which are not related in anyway by the inheritance hierarchy can't point to the same object) .

Labels: , ,

Just In Time Compilation

A Just in time compiler is one which is somewhat mid-way between an interpreter and a static compiler. Historically, JITs were design to overcome the slow performance of interpreters by caching native code of frequently interpreted source statements. But nowadays, with the advent of multicore, JITs are playing an interesting role in leveraging the extra cores for a variety of tasks like (data, target cpu) specialized generation of code, much beyond the realm of traditional static compilers. Some comparisons with other models of compilation/execution:
  • JIT vs Interpretation: Both usually operate on same low level IR (bytecode), translation to native code is done at runtime. However, an interpreter translates one instruction at a time and usually does not cache the generated native code to prevent re-translation. On the other hand, a JIT compiler translates a much coarser level entity like a function/module at a time and also the translated native code is cached to speed up execution.
  • JIT vs Static Compilation: Static compilers translate high level code like C source code to low level IR and low level IR in turn to machine code at the same (compile) time. However in a JIT, this is broken down into two steps: Translation from C source code to Low Level IR is done at compile time while the translation from Low Level IR to Machine Code is done at runtime (just before execution). Some of the potential advantages of JIT over static compilation are: Target CPU specialization (where at runtime, it is determined whether the CPU supports certain type of instructions, like for eg, vector instructions, and emitting code using those instructions), Whole program dynamic optimization (inlining dynamic library calls) and Runtime data specialization (where a JIT can generate different versions of a function depending upon the certain variable values, known only at runtime, which may in turn lead to optimizations like memoization)
  • JIT vs Dynamic Binary Compilation: A JIT usually has more semantic information (eg, low level IR with types, basic block information etc) and recompilation is usually done from the low level IR (usually called bytecode). A dynamic binary translator, on the other hand, does re-translation of machine code to machine code by going through byte-code, ie, Machine Code → ByteCode → Machine Code or even directly re-translates Machine Code → Machine Code. Because machine code has very little semantic information, re-translation is usually done for smaller compilation units, like a basic block and unlike a JIT, whole program optimization is usually not possible.

Labels: , ,

Monday, October 30, 2006

Questions and more questions

1. Which one of the following order would result in a better code generation :
(a) Instruction Scheduling and then Register Allocation
(b) Register Allocation and then Instruction Scheduling

Tuesday, October 04, 2005

GCJ, StringBuilder & Escape Analysis

Sun has recently introduced the 'StringBuilder' class which is meant to be
an unsynchronized version of the StringBuffer class in JDK 1.5. Whenever there are strings which are going to be used by single thread, you use a StringBuilder rather than a StringBuffer. They claim that StringBuilder is almost always faster than StringBuffer.

Consider this problem : You have lots of code in a Java application ( possibly a ERP web application ) which has uses of StringBuffers one every 10 lines of code ( we know that, don't we with Application Frameworks appending URLs etc , javascript targets ) and what with Threads being a absolute taboo in web applications , would'nt there be a lot to gain if we used StringBuilders instead of StringBuffers ?

Now how exactly do we find StringBuffers that are used which are local to a thread/method and replace them with StringBuilders ? Just think of the millions of lines of code that we need to check to make sure only a single thread/method accesses a StringBuffer.

Here is where Escape Analysis could help. And this is what the GCJ guys are discussing here.
But the one point at which I am not clear is that why should inheritance pose a (big) problem ? Can't we first do a Class Heirarchy Analysis (CHA) / Rapid Type Analysis (RTA) to constrain the list of methods that could be possibly called ?

Friday, September 16, 2005

Computing Square Root, Fixed Point Iteration & Lisp

A function is said to have reached a fixed point when f(y) = y.
We can use this to find the square root of a function.

  • Guess a value for the square root , say y
  • Find the value of Avg(y, x/y ). If it is same as y, then we are done.
  • Else assign Avg(x/y, y ) to y ( the new improved value ). Go back to Step 2.

Note that when y = Avg(x/y, y) the value of y that is possible is sqrt(x).So when the fixed point for the above function is reached, the value y is the square root.

So, if we have a black box which computes the fixed point of any function, then
we can pass to this black box , the function Avg(x/y, y ) & find it's fixed point.

Thus, in terms of a functional language , if we have a function for computing the fixed point, we can pass any function ( Avg in this case ) to it to be able to get the desired value. In terms of LISP, this would imply we write a higher-order procedure which takes in another procedure as input [ function here being a first class value ] and returns the fixed point.

Note that fixed point iteration that we use here is exactly the same as that used for data flow analysis in any control flow graph [ lattice of functions ? ]

As an aside , the kinds of abstraction that LISP supports : Black Box Abstraction (Higher order procedures ), Conventional Interfaces ( Generics, OOPs ), Meta-linguistic Abstraction ( being able to write other programming languages in LISP ,eg logic programming language, register machines )

[Reference]