Saturday, 12 May 2012

Garbage collection

Garbage collection !!


This is the most critical as well as the most mysterious issue. Having Nikhil one of the experts on this topic as co-writer of this blog, I am going to take a look at Garbage collection, its strategy and its implications.
Before I start with the memory allocation discussion, I would like to know about the different memory types.
Any program/process has its own memory structure allocated. The memory structure will be divided in below types:
  • Code Segment
  • Data Segment
  • Stack
  • Heap


    1. Code Segment :
Code segment contains the compiled code of the program. Whenever a program executes, the instructions are loaded in memory and executed. This can be verified by taking a look at the .exe file you are trying to execute or the .class file created of your java program. It shows some instructions in binary format. The code segment contains that data.

         2.  Data Segment :
This area stores the static/global variables.

The next two sections that we are going to discuss are used to store data. Stack and Heap. The stack is a place in the computer memory where all the variables that are declared and initialized before runtime are stored. The heap is the section of computer memory where all the variables created or initialized at runtime are stored.

        3.  Stack :
The stack is the section of memory that is allocated for automatic variables within functions.
Data is stored in stack using the Last In First Out (LIFO) method. This means that storage in the memory is allocated and de-allocated at only one end of the memory called the top of the stack. Thus we can remove the data item that is most recently stored.

         4. Heap :
The heap is the area used to allocate memory at the runtime [Dynamic Memory Allocation]. Unlike stack, the data item is allocated/de-allocated arbitrarily depending on the program’s need. The pattern of allocation and size of blocks is not known until run time.

Today, almost all the programming languages use dynamic memory allocation. This allows object to be allocated/deallocated even if their total size was not known at the time that the program was compiled and its life time too. A dynamically allocated object is stored on a heap, rather than on the stack or statically (data segment). Heap allows to -
a.       Choose dynamically the size of new objects
b.      Define and use recursive data structures such as lists, trees and maps
c.       Return newly created objects to the parent procedure
d.      Return a function as the result of another function
Heap allocated objects are accessed through references. Typically, a reference is a pointer to the object.

Why do I need Automatic Memory Management?

This is the question that comes in mind when we think about our conventional ways of handling memory allocation and automatic memory management. I have written this code and I am completely aware of what is going on into the code, then what is the need to rely on automatic memory management? Let’s take a look at the possible issues with Explicit memory allocation.
       If I need to handle the memory management, I can use operators like free or delete to free the memory. Still there are some issues that can arise like dangling pointers. Sometimes the memory can be freed prematurely even if it is being referenced. This causes the program to go unstable. If the program uses the pointer reference to go further with the execution, the results are unpredictable. If you are lucky, the program will crash immediately and then the debugging can begin. In other cases, the program can lead further with the execution and provide you the incorrect result. This makes the situation worst as you are not exactly aware of what’s happening behind the scene.
The next issue is that the developer fails to free the memory- causing Memory Leak. For the small applications, this loss can be benign but in case of large applications, memory leak can show its significant appearance and effect. The Memory leak can further lead to memory management issues for allocating the memory and further may cause your application to go Out Of Memory.
       This issue further becomes more serious when we talk about multithreaded applications or multiprocessor machines. What if the particular object is being shared by two different processes? Or even in worst case, the object is being shared by multiple threads. Now there is the need for more advanced algorithms to make the objects thread-safe. This further brings me to the quote-“Liveness is the global property.” But the way we deal with freeing the memory is local one. Underlines the need of Automatic Memory Management.


Automatic Memory Management:

          The issues stated above can be addressed by the Garbage Collection. The garbage collector reclaims memory only if the object does not hold any references. Thus, Dangling Pointer issue is down now. Garbage collector traces down the objects and identifies the unreachable objects (the ones which are not being referenced - Garbage) and then are reclaimed. In case of Automatic memory management, there will be only one collector running to reclaim the memory, thus making the memory management thread-safe.
Garbage collection marks every object currently not being referenced and frees the object. Thus reducing the possibility of Memory Leak. Garbage Collector does not guarantee that Memory Leak will not happen but still ensures that it brings down its possibility. Every data structure that is not being referenced is released. Also if it has children associated then they are also reclaimed. However it cannot do anything if a particular data structure is going endlessly. Memory management cannot be the issue of only Garbage Collector, but it is also responsibility of Software Engineer.

               
Garbage Collection Parameters:
      Going forward we will be taking a look at few collectors, any strategy/algorithm for garbage collections has its own tradeoffs. We will list down and take a look at each performance parameter for Garbage Collector.
Often, we are not able to find the perfect algorithm for Garbage Collection. Every algorithm comes with its pros and cons. Thus we need to trade-off on some parameters depending on our requirements. It has been observed that every algorithm has its own standpoints and is 15% faster than others on its few designated parameters.
   
     1. Safety :
Most important parameter for garbage collection. Garbage Collection should be safe. The garbage collector must never reclaim the live objects. However this imposes a performance cost on the implementation.

        2. Throughput :
Every user wants his program to run faster. Its one of the aspect concerned with garbage collection is that the garbage collection should take very less time. This is the ratio of marking/sweeping time. However the user is more interested to spend a very less time for memory management task. Thus the complete process of memory allocation/de-allocation should take less time. It is observed that the memory allocation takes the more time as compared to collection. Thus there are few approaches to reduce the memory allocation time. The approaches like mark-sweep-compact, compacts the memory after collecting the objects to reduce the fragmentation.

     3. Completeness and promptness :
Looking at garbage collections ideally from the completeness perspective, the garbage collection should collect all garbage in the memory, though it is not always possible or desirable. However it is never advised to collect the complete heap from performance perspective. Thus the heap can be divided in generations and then collect each generation.
        Taking  a look further, the object which turns garbage after a garbage collection is started is collected in next cycle. This object is called as floating garbage. Thus completeness cannot be the only property to look.

     4. Pause Time :
           It is always desired to have less interference of garbage collector to the program execution. Many garbage collectors introduce a pause time in the program execution as they stop the memory operation during collection. It is always expected to have very less pause time. Thus one of the approaches is to have generational garbage collector where the heap is divided in generations. The more frequent and quick collection takes place on the nursery generation and then very few collections on old generation. This allows reducing the pause time. Still you need to deal with the sizes of nursery and old generation to make sure that we achieve min pause time.
          Parallel garbage collectors stop the world to perform collection but reduce the pause time by employing multiple threads. Concurrent and incremental collectors aim to reduce pause times still further by occasionally performing a small quantum of collection work interleaved or in parallel with memory allocations.


We will take a look at other Garbage Collection algorithms/policies in forthcoming post.