Friday, June 13, 2008

Use of setjmp/longjmp

Basically these were calls that are used to recover from exceptional situations for a C program running on UNIX. It works as follows:

Suppose you have a long call sequence and when some exceptional condition is detected deep in the call chain you want to handle the situation somewhere high up in the call chain (ignoring all the intermediate call sites and the stack frames). It can be done as follows:

(1) First, call setjmp at the place in the higher position in the call chain with a global environment variable. It saves the program state (pc, sp, gen. purpose regs etc) onto the environment and returns to the current program point (returning a 0). Execution continues

(2) When a exceptional situation occurs deep in the call chain , call longjmp with the environment variable. This just restores the values of the registers from the environment and returns from the point where setjmp was called originally (with a 1 now).

Update from Chaitu: As can be seen from above, jumping from somwhere deep down the call chain to some point way higher up can be a recipe for memory leaks, since it is no way to deallocate any heap objects that create in the intermediate functions, unless a custom memory manager is used and some hack (or technique ?) is put in place to identify all the objects created by functions that were called by the handling function.
[Reference]

Tuesday, June 10, 2008

C++: Buffer Growth and STL vector

What's the best strategy for managing Buffer Growth ?

For example, in STL vector class, initially the vector may be allocated some space, when there its run out of space and you want to add new elements, it has to do a *realloc* and then copy all the existing elements over from the old space into the newly allocated space. If at every addition , the vector is just grown by 1, then on an average for a eventual vector of size N, there would have been O(N) reallocs/allocation and each element on an average would have been copied O(N) [Why?]. The same holds whenever the buffer grows by a fixed amount ( say 64 bytes or so ) with some small constant factor advantage. However, if on the other hand, when you run out of space, you realloc (X + X/2) elements, where X is the current size of vector, then it turns out that in the long run, on an average O(logN) reallocs happen and each element is copied on an average twice, where N is the final size of the vector.

n = S(k) = (3/2)* S(k-1). Solve this recurrence to figure out k.

k is the number of iterations/reallocs to reach the size of n

[Reference: More Exceptional C++, Herb Sutter]

Thunks

* A thunk is a small subroutine used for a variety of purposes like :
- Implementing call by name in Algol
- Providing a way of matching actual parameters with formal parameters ( especially useful when calling conventions are different for incomptabile modules ).

* As a side effect of the above two, thunks provide a way of lazy evaluation, ie, if you are passing some parameters (which are expressions themselves ), to functions, eager evaluation usually evaluates the whole expression/parameter list and then passes it onto the called function irrespective of whether it is used at all ( at runtime ). In case of a thunk , the first time a parameter is used, the thunk translates/evaluates the expression into an address/value at that address depending on whether the parameter is used as a lvalue/rvalue and then returns it. This evaluation takes place only once and every subsequent time, the thunk has the evaluated expression.

* This is especially useful in functional programming, where the parameters passed are functions themselves, and evaluation of functions are done only when necessary. This also helps in dealing with infinite data/ handling termination of recursive functions passed as parameters [ Call by need ]

* A thunk is implemented as a list of three items :
- Thunk Evaluator
- Expression ( denoting the actual parameter)
- Environment ( in which the thunk is evaluated ).

Once the thunk evaluates the expression in the environment, it replaces the 'Thunk Evaluator' with the value so that subsequent calls will give the value directly. In some sense, thunks are closures specifically for parameter passing.

[Reference 1]
[Reference 2]
[Reference 3]

Labels: ,

Behind malloc() and free()

* Whenever we use malloc(), if it finds that memory is not available within the data segment of the process, (as a last step ) it internally calls brk() system call . This system call increases the data segment address (virtual) space of the current process by the requested amount. Actually it is quite possible that although brk() returns sucessfully, physical memory might not be quite available.

* When we use free(), it returns the specific chunk of memory to allocation library ( not to the OS ). Next time you call malloc(), the allocation library may return the freed space without calling the brk() system call. Thus the OS just increases the virtual address space of the Data Segment (by brk()), but the management of the memory that is returned is up to the allocation library . Also, i dont see a system call to explicitly reduce the virtual address space of a process ( a kind of reverse brk() ). So once VA increases, it is upto the allocation library to manage this memory. Update: You did not look around enough ! In fact, calling brk() with a negative argument would reduce the VM address space of the process by that amount, only thing is that the allocation library can't do this unless it is sure that last contigous portion of VM (that is requested to be returned to the OS) is not used by the process.

* All the memory management policies like first fit, best fit and maintaining a free list, storing the size within a header of allocated block and so on goes on in the allocation library with the memory got by brk(). Only when there is no memory available, is malloc() going to go to the OS.

* Some of the popular allocators are Doug Lea Malloc, BSD Malloc and Hoard

* Also, memory management can be split between the program and the memory management by using reference counts ( for data which is read-only in some cases or even for shared data ) and then use garbage collection ( no explicit free - helps in the fact that we dont need to keep track of pointers that point to a data chunk and worry about freeing pointers twice/ using a freed pointer ).

* Another memory management technique is memory pools. For eg, the Apache web server has different memory pools, one that lasts a request, one that lasts a connection, one that lasts a thread and so on. GNU's obstack implementation is one way of using allocation pools [ provides reuse of similar sized data chunks ].

[Reference]

Labels: ,

Worst case Number of Phi Nodes in a SSA graph

Q. What is the worst case number of phi functions in a SSA graph ?

It could be O(N^2). Consider this program:

while (...)
{
while (...)
{
while (...)
{
i = ...
}
i = ...
}
i = ...
}
... = i

A minimal phi insertion (confirm this !) has 5 phi nodes. This is around (N/2). For V variables, number of phi functions would be (N/2)*V or around O(N^2) phi functions.

i0 = ...
while (...)
{
i1 = phi(i0, i7)
while (...)
{
i2 = phi(i1, i5)
while (...)
{
i3 = i2;
}
i4 = phi(i3,i2)
i5 = i4
}
i6 = phi(i4, i1)
i7 = i6
}
i8 = phi(i7, i0)

There is some interesting discussion on this in comp.compilers here

Labels: ,

The alloca routine: Stack allocation

The alloca routine allows a C program to allocate memory on the stack instead of the heap. The advantage is that this memory is cleaned up once the allocating routine exits and does not need to explicitly do a ' free'. So probably you would not need to worry about memory leaks (but you are obviously restricting the life time of the object to the method boundary -- an escape analysis could use this method to convert all malloc calls to alloca calls for captured objects).

How could this be implemented ?

It is just a manipulation of the stack pointer (esp) using the size passed as an argument to the alloca routine (in a register). Since the base pointer (ebp) holds the base of the current function's stack frame, when the function returns, the stack is unwound using ebp and the de-allocation is automatically taken care of.

What could be the potential disadvantages ?

Size is obviously a restriction as the stack space grows linearly and upto a limit. There are other disadvantages mentioned in this Wine mailing list thread.

A long list of pros and cons are discussed in the gdb developer's mailing list. See here

Labels: ,

Pointer Analysis: C vs Java

* Stack variables and Heap : In case of C/C++, objects are allocated on the stack (not just scalar locals/pointer variables). So, we can make use of this property to eliminate any points-to edges from the heap objects to objects on the stack once a function returns. This can be done only when we are sure that the given program either is well-written and has no dangling references (which itself would require another analysis i guess) and pointer references in the program respect the lifetime of the pointed-to objects. Alternately, we can track this points-to edge (from heap to stack object) and if there is a de-reference of this pointer, you can flag an error/warning in the compiler. In case of Java, objects are not allocated on the stack and hence there can't be any reference to an object on the stack from a field of a heap object at all (since you can't take the address of a local scalar variable and an assignment always assigns a 'reference' to another heap object and not the stack variable). So, aliasing between heap and stack objects/variables never arises in Java. Also the same applies to pointers from Global Variables to Stack Variables

* Direct Manipulation of Pointers: In C/C++, it is possible to take the address of a variable (using & operator), add, subtract values (offsets) to it, and then indirectly access the variable at the new address (using *), which makes it possible for a pointer to directly point to the middle of a (logical) data structure. So, things like alignment, byte-level reasoning in the analysis become more important to safely infer points-to sets. It is also possible to arbitrarily cast pointers from one type to another. This means declared types are usually useless from the analysis perspective. A lot of this could be solved by representing such operations by special cast instructions in the IR, specifying the alignments (for things like unions) explicitly in the IR. All these are done in the LLVM IR. Java, on the other hand, is strongly typed and permits only indirect manipulation of pointers which are type safe (or explicit exception is thrown as in the case of casts). This means the declared types not only can safely be used in the analysis but also can be used to make the analysis more precise (scalars of two different classes which are not related in anyway by the inheritance hierarchy can't point to the same object) .

Labels: , ,

Just In Time Compilation

A Just in time compiler is one which is somewhat mid-way between an interpreter and a static compiler. Historically, JITs were design to overcome the slow performance of interpreters by caching native code of frequently interpreted source statements. But nowadays, with the advent of multicore, JITs are playing an interesting role in leveraging the extra cores for a variety of tasks like (data, target cpu) specialized generation of code, much beyond the realm of traditional static compilers. Some comparisons with other models of compilation/execution:
  • JIT vs Interpretation: Both usually operate on same low level IR (bytecode), translation to native code is done at runtime. However, an interpreter translates one instruction at a time and usually does not cache the generated native code to prevent re-translation. On the other hand, a JIT compiler translates a much coarser level entity like a function/module at a time and also the translated native code is cached to speed up execution.
  • JIT vs Static Compilation: Static compilers translate high level code like C source code to low level IR and low level IR in turn to machine code at the same (compile) time. However in a JIT, this is broken down into two steps: Translation from C source code to Low Level IR is done at compile time while the translation from Low Level IR to Machine Code is done at runtime (just before execution). Some of the potential advantages of JIT over static compilation are: Target CPU specialization (where at runtime, it is determined whether the CPU supports certain type of instructions, like for eg, vector instructions, and emitting code using those instructions), Whole program dynamic optimization (inlining dynamic library calls) and Runtime data specialization (where a JIT can generate different versions of a function depending upon the certain variable values, known only at runtime, which may in turn lead to optimizations like memoization)
  • JIT vs Dynamic Binary Compilation: A JIT usually has more semantic information (eg, low level IR with types, basic block information etc) and recompilation is usually done from the low level IR (usually called bytecode). A dynamic binary translator, on the other hand, does re-translation of machine code to machine code by going through byte-code, ie, Machine Code → ByteCode → Machine Code or even directly re-translates Machine Code → Machine Code. Because machine code has very little semantic information, re-translation is usually done for smaller compilation units, like a basic block and unlike a JIT, whole program optimization is usually not possible.

Labels: , ,