Mittwoch, 18. Dezember 2013

Memory leaks even with WeakReferences

A crash report arrived at my desk the other day. The system crashed because it ran out of memory. And the major memory consumer was a WeakHashMap. Very interesting, since WeakHashMaps are usually used to allow to free memory when it's needed.

First some background. Imagine you've build a JSP page which generates a HTML page with a lot of URLs on it. You construct those URLs from different parameters. You use URLEncoder since you need valid URLs independent of the URL parameter contents you print. Once everything works fine, you realize that your URLs share may strings. So URLEncoder is called very often unnecessarily. You try to optimize the situation by creating a cache for URLEncoder:

class CachedUrlEncoder {
 static private Map<String, String> encodedMap = new HashMap<String,String>();

 public String encode(String str) {
  String encodedStr = encodedMap.get(str);
  if (encodedStr == null) {
   encodedStr = URLEncoder.encode(str);
   encodedMap.put(str, encodedStr);
  }
  return encodedStr;
 }
}

(The example is not thread save by purpose. We are not talking about concurrency, are we? ;) Also notice, that the one parameter encode method is now deprecated, because it uses system default encoding to encode the string.)

This cache would fill up the memory very quickly. It's never cleared after all. But there is also no special point in time when the cache should be cleared. The cached data never becomes outdated. Actually a cache should use a lot of memory if memory is not required by other subsystems, and free the memory if it becomes required. For this purpose the Java runtime has the SoftReference, WeakReference and the utility classes which use them. WeakHashMap f.e. is the perfect match for this scenario. A WeakHashMap references the values using normal hard references, and the keys using weak references. As soon as the key is not referenced any more (soft or hard) the whole entry will be freed. Here's the example rewritten to use WeakHashMap:


class CachedUrlEncoder {
 private static Map<String, String> encodedMap = new WeakHashMap<String,String>();

 public String encode(String str) {
  String encodedStr = encodedMap.get(str);
  if (encodedStr == null) {
   encodedStr = URLEncoder.encode(str);
   encodedMap.put(str, encodedStr);
  }
  return encodedStr;
 }
}

That was easy. Sadly you will notice at runtime that this code contains a memory leak. A not so obvious one. Let's analyse the situation.

As I mentioned already the map entries will be freed as soon as the key is not referenced by hard or soft references any more. In our case this should be immediately. After we've written the encoded string to the output stream of the JSP page, the string is not referenced any more. Still we're experiencing a memory leak. As so often the devil is in the details. The Sun implementation of URLEncoder.encode() tries to optimize by returning the reference to the string it received, if there is no encoding work to do. This is clever. It saves resources. But in this case this bit us really bad. If encode() returns the same reference the code will call Map.put() with the same reference as key and value. It'd look like:

encodedMap.put(str, str);

After that line the map has an entry with a weak reference to str and a hard reference to str. The entry itself prevents that it is garbage collected!

That's mean.

The fix is simple, once one knows the cause. We render the optimization of URLEncoder useless:

class CachedUrlEncoder {
 private static Map<String, String> encodedMap = new WeakHashMap<String,String>();

 public String encode(String str) {
  String encodedStr = encodedMap.get(str);
  if (encodedStr == null) {
   encodedStr = URLEncoder.encode(str);
   if (str == encodedStr) {
    encodedStr = new String(encodedStr);
   }
   encodedMap.put(str, encodedStr);
  }
  return encodedStr;
 }
}

Luckily the implementation of the string constructor is also smart. It does not copy the char data. A new string object is created and references the same char array as the old string. This is safe since strings are immutable. So the fix does create some overhead but not that much.

Montag, 16. Dezember 2013

A Executor is not a Thread - or: correct ThreadPoolExecutor error handling

Java 1.5 introduced the Executor framework. In summery: if you have some tasks (let's say: 20) and you want them to be processed in parallel by a couple of threads (let's say: 6), then a Executor is the solution you want to look for. But there are some surprising caveats which may lead to problems. And they are hard to diagnose.

Introduction

Executor is a interface. The ExecutorService interface extends Executor. One of the standard implementations of ExecutorService is ThreadPoolExecutor. This one manages some threads in a pool and executes the tasks you give it. You do this in form of a list of Runnable instances. A typical Runnable implementation looks like this:

public class MyWorker implements Runnable {
    private final Object data;

    public MyWorker(final Object data) {
        this.data = data;
    }

    @Override
    public void run() {
        process();
    }

    private void process() {
        // process data
    }
}

You initialize a ThreadPoolExecutor like this:

int nThreads = 8;
Executor executor = new ThreadPoolExecutor(nThread, nThreads, 0, TimeUnit.SECONDS, new LinkedBlockingQueue<Runnable>());

And since this is so cumbersome, there is a helper class for that:

int nThreads = 8;
Executor executor = Executors.newFixedThreadPool(nThreads);

This is how to use a Executor wrapped in a method:

public void executeRunnables(final List<Runnable> runnables) {
    int nThreads = 8;
    Executor executor = Executors.newFixedThreadPool(nThreads);
    for (final Runnable command : runnables) {
        executor.execute(command);
    }
}

This method would return to the caller immediately after creating the executor. But usually you want to wait until all the tasks have been processed before returning to the caller. For this purpose ExecutorService defines the methods shutdown() and awaitTermination():

public void executeRunnables(final List<Runnable> runnables) throws InterruptedException {
    final int nThreads = 8;
    final ExecutorService executor = Executors.newFixedThreadPool(nThreads);

    for (final Runnable command : runnables) {
        executor.execute(command);
    }

    executor.shutdown();

    executor.awaitTermination(2, TimeUnit.SECONDS);
}

(notice the change of executor type from Executor to ExecutorService) shutdown() puts the executor in "finish your work" mode. In this mode the executor will not accept new tasks. awaitTermination() waits until all threads processed all tasks. The time out takes care that your program doesn't wait forever if something goes wrong.

Error handling

So what if something goes wrong? What if one of your tasks throws an exception? How do you handle that? How do you even know something gone wrong? One possible approach is to catch and collect all exception inside the Runnable implementation:

public class MyRunnable implements Runnable {
    private final Object            data;
    private final List<Exception>   exceptions;

    public MyRunnable(final Object data, final List<Throwable> exceptions) {
        this.data = data;
        this.exceptions = exceptions;
    }

    @Override
    public void run() {
        try {
            process();
        } catch (final Exception ex) {
            exceptions.add(ex);
        }
    }

    private void process() {
        // process data
    }
}

But this approach has two major drawbacks: First, it would also catch the InterruptedException which may be used to control the thread under normal conditions. And second, the responsibility of error handling is moved to each Runnable implementation. You are going to implement a lot of Runnables and it's easy to forget something. Executor error handling calls for a generic solution.

Java 1.5 adds a method to register exception handlers for exactly this purpose: Thread.setUncaughtExceptionHandler(). Exceptions which are not handled by the run() method of the Thread implementation, will be forwarded to this handler. Let's modify the above example to use a exception handler:

/**
 * Implementation of a UncaughtExceptionHandler, which stores all exceptions in a List.
 */
private static class ExceptionCollector implements UncaughtExceptionHandler {
    final List<Throwable> exceptions = Collections.synchronizedList(new LinkedList<Throwable>());

    @Override
    public void uncaughtException(final Thread t, final Throwable e) {
        exceptions.add(e);
    }
}

/**
 * A ThreadFactory, which registers a UncaughtExceptionHandler.
 */
private static class ThreadWithUncaughtExHandlerFactory implements ThreadFactory {
    private final UncaughtExceptionHandler    exHandler;

    public ThreadWithUncaughtExHandlerFactory(final UncaughtExceptionHandler exHandler) {
        this.exHandler = exHandler;
    }

    @Override
    public Thread newThread(final Runnable r) {
        final Thread t = new Thread(r);
        t.setUncaughtExceptionHandler(exHandler);
        return t;
    }
}

public void executeRunnables(final List<Runnable> runnables) throws InterruptedException {
    final int nThreads = 8;
    // UnhandledExceptionHandler which will collect Exceptions:
    final ExceptionCollector exHandler = new ExceptionCollector();
    // create a executor with a custom Thread factory:
    final ExecutorService executor = Executors.newFixedThreadPool(nThreads,
                                       new ThreadWithUncaughtExHandlerFactory(exHandler));

    for (final Runnable command : runnables) {
        executor.execute(command);
    }

    executor.shutdown();

    executor.awaitTermination(2, TimeUnit.SECONDS);

    if (exHandler.exceptions.size() > 0) {
        // rethrow first of the collected exceptions
        throw new RuntimeException(exHandler.exceptions.size() +
                    " exceptions occured. First exception:", exHandler.exceptions.get(0));
    }
}


Now this looks good! If a exception is not handled by the Runnable implementations the Thread will get it and the Thread will pass it on to the uncaught exception handler. The handler will store it in the list for later usage. executeRunnables() waits until all Threads are done with work and checks the exception list then. If there is a entry it will be wrapped in a RuntimeException and rethrown. Instead of just passing the first exception, it's also possible to append the whole exception list to the thrown exception. This way the caller will be notified.

Or won't it?

Well, this article would probably not exist if everything would be that simple like it looks. The truth is: it doesn't work. At least not always. Which makes it even worse. Sometimes it works and sometimes it doesn't.

Problem analysis

A Thread is capable of processing only one Runable in general. When the Thread.run() method exits the Thread dies. The ThreadPoolExecutor implements a trick to make a Thread process multiple Runnables: it uses a own Runnable implementation. The threads are being started with a Runnable implementation which fetches other Runanbles (your Runnables) from the ExecutorService and executes them: ThreadPoolExecutor -> Thread -> Worker -> YourRunnable. When a uncaught exception occurs in your Runnable implementation it ends up in the finally block of Worker.run(). In this finally block the Worker class tells the ThreadPoolExecutor that it "finished" the work. The exception not yet arrived at the Thread class but ThreadPoolExecutor already registered the worker as idle.

And here's where the fun begins. The awaitTermination() method will be invoked when all Runnables have been passed to the Executor. This happens very quickly so that probably not any of the Runnables finished their work. A Worker will switch to "idle" if a exception occurs, before the Exception reaches the Thread class. If the situation is similar for the other threads (or if they finished their work), all Workers signal "idle" and awaitTermination() returns. The main thread reaches the code line where it checks the size of the collected exception list. And this may happen before any (or some) of the Threads had the chance to call the UncaughtExceptionHandler. It depends on the order of execution if or how many exceptions will be added to the list of uncaught exceptions, before the main thread reads it.

A very unexpected behaviour. But I won't leave you without a working solution. So let's make it work.

Correct solution

We are lucky that the ThreadPoolExecutor class was designed for extendibility. There is a empty protected method afterExecute(Runnable r, Throwable t). This will be invoked directly after the run() method of our Runnable before the worker signals that it finished the work. The correct solution is to extend the ThreadPoolExecutor to handle uncaught exceptions:

public class ExceptionAwareThreadPoolExecutor extends ThreadPoolExecutor {
    private final List<Throwable> uncaughtExceptions = 
                    Collections.synchronizedList(new LinkedList<Throwable>());

    @Override
    protected void afterExecute(final Runnable r, final Throwable t) {
        if (t != null) uncaughtExceptions.add(t);
    }

    public List<Throwable> getUncaughtExceptions() {
        return Collections.unmodifiableList(uncaughtExceptions);
    }
}




Sonntag, 15. Dezember 2013

Resolving circular dependencies in c++

I stumbled several times already upon situations where I created a circular dependency between two classes. Probably like every c++ developer. A circular dependency is bad design in general, it increases coupling between classes and thus makes changes more difficult (Circular dependency - Wikipedia). But in some languages (for example Java) it's possible to create classes which depend on each other without compiler errors. So very often circular dependencies are created and used without noticing. Differently for C++. The strict processing order of C++ compilers will spill out a bunch of errors when you try to make two classes depend on each other.
But you surely had a reason for this attempt. So how you achieve your aim whilst avoiding a circular dependency? See this example of a thread safe, usage based, load balanced object pool implementation:

// File: ObjectPool.h

#pragma once

#include <list>
#include <stdlib.h>
using namespace std;

// provides basic locking capabilities
#include "Lock.h"

// object type supplied by the pool
#include "Element.h"

class ObjectPool {
private:
 list<Element*> elements_;
 Lock lock_;

public:
 ObjectPool(size_t size) {
  for (size_t i = 0; i < size; i++) {
   elements_.push_back(new Element());
  }
 }

 Element *take() {
  // acquires the pool lock. AutoLockAndRelease will release the lock when scope is left
  AutoLockAndRelease autoLock(lock_);

  Element *chosen = getLeastUsedElement();
  chosen->incUsageCount();
  return chosen;
 }

 void release(Element *element) {
  AutoLockAndRelease autoLock(lock_);

  assertThisIsOurElement(element);
  element->decUsageCount();
 }

private:
 // checks if this element belongs to this pool
 void assertThisIsOurElement(Element *element) {
  // implementation details omitted
 }

 Element *getLeastUsedElement() {
  // implementation details omitted
 }
};




// File: Element.h

#pragma once

class Element {
private:
 int usageCount_;

public:
 Element():usageCount_(0) {
 }

 void incUsageCount() {
  usageCount_++;
 }

 void decUsageCount() {
  usageCount_++;
 }

 int getUsageCount() {
  return usageCount_;
 }
};


This is how it'd be used:

ObjectPool pool(5);
Element *element = pool.take();
// use element
pool.release(element);

I spare you the implementation details of Lock.h. Just assume it contains a lock implementation capable of synchronizing threads.

I'd like to modify this object pool and add the capability to the element to release itself. Id' like to use the pool like this:

ObjectPool pool(5);
Element *element = pool.take();
// use element
element.release();

Here are the code changes implementing this feature:

// File: ObjectPool.h

 ...
public:
 ObjectPool(size_t size) {
  for (size_t i = 0; i < size; i++) {
   elements_.push_back(new Element(*this));
  }
 }

 ...




// File: Element.h
...
class Element {
private:
 int usageCount_;
 ObjectPool &parent_;

public:
 Element(ObjectPool &parent):usageCount_(0), parent_(parent) {
 }

 ...

 void release() {
  parent_.release(this);
 }
};

This example suffers from two problems:
  1. A a very subtle problem: The reference to the ObjectPool instance is made public to the Entry instance before the constructor of the ObjectPool class finished initializing the instance. This is known as early reference leak and should be avoided (Java theory and practice: Safe construction techniques, Should you use the this pointer in the constructor?).
  2. The obvious one: The example doesn't compile. My compiler issues the error "'ObjectPool' does not name a type". This arises from the attempt to create a circular dependency: ObjectPool depends on Element, and Element depends on ObjectPool.
Let's first have a look at what we actually are trying to achieve here. The ObjectPool class manages a specified amount of entries. For each entry a usage count is maintained. The user can call take() to receive the entry with the lowest usage count. The ObjectPool class will use a lock to synchronize take() and release() invocations so it can be used safely in a threaded context. Once the user is done using a entry he invokes release() on the entry which then invokes release() on its parent which does a synchronized release. So how do we resolve the circular dependency while preserving the logic?

The solution is to decouple the ObjectPool and the Element class. I'll show here how to decouple both using a interface. The idea is, the Element class doesn't has to know the whole implementation of the ObjectPool class. It only needs to know the release() method. So we create a interface containing the release method and let ObjectPool implement the interface. Then we tell Entry only about the interface but not the ObjectPool class.

Here is the interface:

// File: Releaser.h

#pragma once

template<typename T>
class Releaser {
public:
 virtual ~Releaser() {};
 virtual void release(T *element) = 0;
};

It defines a single release method with a template type. The template type is essential here. Without the use of a template the interface would need to include Element type to use it as parameter type. But this would again create a circular dependency: ObjectPool -> Releaser -> Element -> Releaser. Using the template trick we break the cycle. Here are the modified implementations of ObjectPool and Element.

// File: ObjectPool.h

#pragma once

#include 
#include 
using namespace std;

#include "Lock.h"
#include "Element.h"
#include "Releaser.h"

class ObjectPool: public Releaser {
private:
 list elements_;
 Lock lock_;

public:
 ObjectPool(size_t size) {
  for (size_t i = 0; i < size; i++) {
   elements_.push_back(new Element(*this));
  }
 }

 Element *take() {
  AutoLockAndRelease autoLock(lock_);

  Element *chosen = getLeastUsedElement();
  chosen->incUsageCount();
  return chosen;
 }

 void release(Element *element) {
  AutoLockAndRelease autoLock(lock_);

  assertThisIsOurElement(element);
  element->decUsageCount();
 }

private:
 void assertThisIsOurElement(Element *element) {
  // implementation details omitted
 }

 Element *getLeastUsedElement() {
  // implementation details omitted
 }
};



// File: Element.h

#pragma once

#include "Releaser.h"

class Element {
private:
 int usageCount_;
 Releaser &releaser_;

public:
 Element(Releaser &releaser):usageCount_(0), releaser_(releaser) {
 }

 void incUsageCount() {
  usageCount_++;
 }

 void decUsageCount() {
  usageCount_++;
 }

 int getUsageCount() {
  return usageCount_;
 }

 void release() {
  releaser_.release(this);
 }
};


Now Element only references Releaser and the dependency cycle is broken.

Mittwoch, 20. November 2013

Linux Process Memory Layout

This article describes how the memory structure of each Linux process does look like.
Each Linux process starts with several memory blocks. A code block to hold the executable code of the program, stack blocks (one for each thread) and several data blocks (one for constants of the program, one for dynamic usage). Sometimes those blocks are called segments. In case of the data blocks we'll call them arenas in this article. This initial arena block is called the main arena. You'll see it named as "[heap]" in the contents of /proc/xxx/maps of any Linux process (replace xxx by PID).

Basics


A Linux program can use the system functions brk()/sbrk() to change the size of it's main arena. It can also use mmap() to get new arenas from the system. But usually progams will use a memory management library instead of the system functions. The memory management library which is used by default is the libc library. One can override this, but usually all linux processes use this library. It exports functions like malloc(), realloc(), calloc(), free() to the program. The program uses those to allocate and free memory. And the memory management library itself uses brk(), sbrk(), mmap() to allocate the memory from the system.

libc creates data structures inside of the arenas to split the arena in smaller blocks. We'll call this structures "heap". So arenas are huge blocks issued by the system to libc. Heaps are the data structures inside this arenas (yes, there are several heaps) to manage smaller blocks.

Java/JNI memory layout

But a process doesn't have to use the libc functions. Java for example has it's own memory management functions. Java starts just like every other linux process. But then it uses mmap() to allocate memory for it's own Java heap (and for the other Java memory areas, like PermGenSpace for classes or code cache for the JIT compiler) according to the size settings ("-Xmx" and others). All the memory which is used by Java objects and classes is *not* managed by libc. But Java has also parts which are implemented in native (JNI) code. If this native code requires memory it will use libc functions. Also (JNI) libraries loaded by Java are just like the Java native part, they also use libc functions.

Difference vss/rss

When using several Linux tools, which do report memory usage (ps/top), you will stumble upon the two TLAs: vss, rss. (aliases are: vsz, rsz) They mean:
  • Virtual Set Size 
  • Resident Set Size 
They exist to name two different memory allocation types. vss names the reserved address space. Rss names the physical allocated memory (simplefied, see Details For The Hard Core Developer 1). One could (theoretically) allocate 16.777.216 terabytes of address space without using even one byte of physical memory. Only when the program asks the system to allocate physical memory, physical memory is allocated (simplified, see Dnowledge For The Hard Core Developer 2). How does the system tell between address space allocation and physical memory allocation? When the process allocates memory from the system it passes access permission to mmap():


// mmap() syntax: void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

void *block1;
// this will allocate 200 bytes without any access permissions at a starting address choosen by the system
block1 = mmap(NULL, 200, PROT_NONE, MAP_PRIVATE, 0, 0);

This is just a address space reservation. The address space starting at block1 and spanning 200 bytes will not be issued at any other mmap() call. How does a process allocate physical memory? By changing the access permissions:

// syntax of mprotect(): int mprotect(const void *addr, size_t len, int prot);
mprotect(block1, 100, PROT_READ | PROT_WRITE); 

After this invocation the first 100 bytes will be available for read and write. The remaining 100 bytes will still be just reserved address space.

But why we need to reserve address space anyway? If you are curious, see "Details For The Hard Core Developer 3".

How libc Manages Memory

As already said, the system allocates the main arena block for each process at process start up. libc creates a heap data structure inside of this main arena block. This is the main heap. libc supports multi threaded programs. So invoking the memory functions in parallel is possible. Like any other program, libc has to use locking to guarantee data structure integrity while several threads call libc functions at the same time. To increase performance, libc tries to reduce lock contention (lock blocking) by having one lock per heap and by creating more heaps. When a memory request arrives at libc, it tries to lock the heap which was last used by the thread. If the thread did not use any heap yet, libc tries to lock the main heap. If locking fails, libc tries the next existing heap. If all heaps have been tested, libc creates a new heap. A heap is created by requesting a new arena from the system (mmap()) and writing heap structures into it:

void *arena;
// reserves 64MB address space. libc heaps start always at 64MB size.
arena = mmap(NULL, 64*1024*1024, PROT_NONE, MAP_PRIVATE, 0, 0);
// allocate necessary memory at the beginning of heap
mprotect(arena, 1000, PROT_READ | PROT_WRITE);
// after this invocation libc can use the first 1000 bytes of the reserved address space 

When a heap was successfully locked, libc tries to find a free block which does satisfy the requested size. If a block was found, it is returned. If no block was found, libc tries to increase the size of the heap. It can do this in two ways.
  1. Allocating bytes in the reserved address space (by calling mprotect() and increasing the area which has read/write permissions in the arena). This can only succeed if there is still room to expand.
  2. Increasing address space and allocating bytes there (By calling mremap(), which will assign more address space. Heaps are resized in 64MB steps.). This can only succeed if the address space bordering the end of the current arena is still free.
If both fail (arena is full, arena cannot be resized) a new heap will be created. So heaps are created in two cases: to satisfy memory needs, to reduce lock contention.

How can we use this knowledge to optimize the memory footprint of our program? Let's assume we have a process with 100 threads. And let's assume there is heavy load, so all 100 threads call libc at the same time. In this case libc will create 100 heaps, each of 64MB in size. (vss will instantly increase to 6,4GB, but rss will remain low, because the physical memory is not allocated yet.) This sounds not so bad yet (rss is still low). But let's continue this horror scenario. Let's assume this heavy load led to full exhaustion of the available memory. All 100 heaps are full. But the load stopped and the used memory is freed. Nearly everything is freed just some small blocks remain (which for example are used as a thread local buffer). Unluckily these buffers were allocated late, so that they lay at the "end" of the heap. The heap is nearly empty, just at the end there is this small block. libc is capable of shrinking the heaps to return memory to the system. If the free block at the end of the heap is big enough, libc shrinks the heap. But since there is a used block, libc cannot do this. The heap cannot have gaps. So even if the memory is not used by the program the rss and vss of the program will remain at 6,4GB.

So what do we learn from this? Don't use more threads than necessary. After all not all of those threads can run truly parallel. The system is limited to a much smaller number of CPU cores. Often waiting for resources is compensated by having more threads. If it's network resources, try to switch to asynchronous processing so the thread can do some other work, while the network interface is processing data. This way you will be able to reduce the number of required threads. The other thing we learned: If you use buffers, do free those buffers sometimes (This only applies for unmanaged languages like C. Java is able to move it's memory blocks to reduce fragmentation.).

/proc/xxx/maps

Everyone can have a look at the blocks issued by the sytem to the program. The file maps in the proc folder of each process contains this information. the contents look like this (example of bash process):

00400000-004e1000 r-xp 00000000 fc:00 524291                             /bin/bash
006e0000-006e1000 r--p 000e0000 fc:00 524291                             /bin/bash
006e1000-006ea000 rw-p 000e1000 fc:00 524291                             /bin/bash
006ea000-006f0000 rw-p 00000000 00:00 0
00ec7000-01224000 rw-p 00000000 00:00 0                                  [heap]
7fc285b08000-7fc285b14000 r-xp 00000000 fc:00 6419                       /lib/x86_64-linux-gnu/libnss_files-2.15.so
7fc285b14000-7fc285d13000 ---p 0000c000 fc:00 6419                       /lib/x86_64-linux-gnu/libnss_files-2.15.so
7fc285d13000-7fc285d14000 r--p 0000b000 fc:00 6419                       /lib/x86_64-linux-gnu/libnss_files-2.15.so
7fc285d14000-7fc285d15000 rw-p 0000c000 fc:00 6419                       /lib/x86_64-linux-gnu/libnss_files-2.15.so
7fc285d15000-7fc285d1f000 r-xp 00000000 fc:00 5919                       /lib/x86_64-linux-gnu/libnss_nis-2.15.so
7fc285d1f000-7fc285f1f000 ---p 0000a000 fc:00 5919                       /lib/x86_64-linux-gnu/libnss_nis-2.15.so
7fc285f1f000-7fc285f20000 r--p 0000a000 fc:00 5919                       /lib/x86_64-linux-gnu/libnss_nis-2.15.so
7fc285f20000-7fc285f21000 rw-p 0000b000 fc:00 5919                       /lib/x86_64-linux-gnu/libnss_nis-2.15.so
7fc285f21000-7fc285f38000 r-xp 00000000 fc:00 6429                       /lib/x86_64-linux-gnu/libnsl-2.15.so
7fc285f38000-7fc286137000 ---p 00017000 fc:00 6429                       /lib/x86_64-linux-gnu/libnsl-2.15.so
7fc286137000-7fc286138000 r--p 00016000 fc:00 6429                       /lib/x86_64-linux-gnu/libnsl-2.15.so
7fc286138000-7fc286139000 rw-p 00017000 fc:00 6429                       /lib/x86_64-linux-gnu/libnsl-2.15.so
7fc286139000-7fc28613b000 rw-p 00000000 00:00 0
7fc28613b000-7fc286143000 r-xp 00000000 fc:00 6433                       /lib/x86_64-linux-gnu/libnss_compat-2.15.so
7fc286143000-7fc286342000 ---p 00008000 fc:00 6433                       /lib/x86_64-linux-gnu/libnss_compat-2.15.so
7fc286342000-7fc286343000 r--p 00007000 fc:00 6433                       /lib/x86_64-linux-gnu/libnss_compat-2.15.so
7fc286343000-7fc286344000 rw-p 00008000 fc:00 6433                       /lib/x86_64-linux-gnu/libnss_compat-2.15.so
7fc286344000-7fc2867c2000 r--p 00000000 fc:00 531094                     /usr/lib/locale/locale-archive
7fc2867c2000-7fc286977000 r-xp 00000000 fc:00 6434                       /lib/x86_64-linux-gnu/libc-2.15.so
7fc286977000-7fc286b76000 ---p 001b5000 fc:00 6434                       /lib/x86_64-linux-gnu/libc-2.15.so
7fc286b76000-7fc286b7a000 r--p 001b4000 fc:00 6434                       /lib/x86_64-linux-gnu/libc-2.15.so
7fc286b7a000-7fc286b7c000 rw-p 001b8000 fc:00 6434                       /lib/x86_64-linux-gnu/libc-2.15.so
7fc286b7c000-7fc286b81000 rw-p 00000000 00:00 0
7fc286b81000-7fc286b83000 r-xp 00000000 fc:00 6432                       /lib/x86_64-linux-gnu/libdl-2.15.so
7fc286b83000-7fc286d83000 ---p 00002000 fc:00 6432                       /lib/x86_64-linux-gnu/libdl-2.15.so
7fc286d83000-7fc286d84000 r--p 00002000 fc:00 6432                       /lib/x86_64-linux-gnu/libdl-2.15.so
7fc286d84000-7fc286d85000 rw-p 00003000 fc:00 6432                       /lib/x86_64-linux-gnu/libdl-2.15.so
7fc286d85000-7fc286da9000 r-xp 00000000 fc:00 340                        /lib/x86_64-linux-gnu/libtinfo.so.5.9
7fc286da9000-7fc286fa8000 ---p 00024000 fc:00 340                        /lib/x86_64-linux-gnu/libtinfo.so.5.9
7fc286fa8000-7fc286fac000 r--p 00023000 fc:00 340                        /lib/x86_64-linux-gnu/libtinfo.so.5.9
7fc286fac000-7fc286fad000 rw-p 00027000 fc:00 340                        /lib/x86_64-linux-gnu/libtinfo.so.5.9
7fc286fad000-7fc286fcf000 r-xp 00000000 fc:00 6422                       /lib/x86_64-linux-gnu/ld-2.15.so
7fc2871b3000-7fc2871c1000 r--p 00000000 fc:00 265459                     /usr/share/locale-langpack/de/LC_MESSAGES/bash.mo
7fc2871c1000-7fc2871c4000 rw-p 00000000 00:00 0
7fc2871c6000-7fc2871cd000 r--s 00000000 fc:00 535727                     /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache
7fc2871cd000-7fc2871cf000 rw-p 00000000 00:00 0
7fc2871cf000-7fc2871d0000 r--p 00022000 fc:00 6422                       /lib/x86_64-linux-gnu/ld-2.15.so
7fc2871d0000-7fc2871d2000 rw-p 00023000 fc:00 6422                       /lib/x86_64-linux-gnu/ld-2.15.so
7fff98beb000-7fff98c0c000 rw-p 00000000 00:00 0                          [stack]
7fff98d48000-7fff98d49000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

The columns are:
  • Address range of the block
  • access permissions
  • offset into the file, if block is a memory mapped file
  • device number (of mapped file)
  • inode (of mapped file)
  • name of block or name of file (if mapped)
    The tool pmap can display the same information in a more readable format. About identifying single blocks: "Details For The Hard Core Developer 4".

    Tools

    I've written a tool to read the maps of a java process and guess the role of the different blocks. The tools is committed here: javaJniMemUsage.pl. It runs intrusion less on a server with very limited prerequisites (only Perl and Linux).

    javaJniMemUsage.pl [OPTIONS] PID|proc-maps-file
      OPTIONS
       -c - print output as CSV
       -h - print CSV header line before data line
      PID - process id of process to retrieve memory maps information.
      proc-maps-file - contains memory maps information. Can be directly /proc/PID/maps.
    

    One can invoke the tool with a pid, in which case the tool will read /proc/PID/maps, or a file which contains a maps dump. Without any other parameters the tool will dump a human readable output.


    Here is a sample output of a busy java application server:

    7f2096bf9000 -      5066752 (   4M 852K), rw-p, 0,
    7f20973b6000 -      5066752 (   4M 852K), rw-p, 0,
    7f2097b73000 -      5066752 (   4M 852K), rw-p, 0,
    7f2098330000 -      5066752 (   4M 852K), rw-p, 0,
    7f2098aed000 -      5066752 (   4M 852K), rw-p, 0,
    7f20bc2fc000 -      5066752 (   4M 852K), rw-p, 0,
    7f20d8327000 -      5066752 (   4M 852K), rw-p, 0,
    7f20f8327000 -      5066752 (   4M 852K), rw-p, 0,
    7f21161fb000 -      2703360 (   2M 592K), rw-p, 0,
    7f2116778000 -      2703360 (   2M 592K), rw-p, 0,
    7f2116cf5000 -      2703360 (   2M 592K), rw-p, 0,
    7f2117272000 -      2703360 (   2M 592K), rw-p, 0,
    7f21177ef000 -      2703360 (   2M 592K), rw-p, 0,
    7f2117d6c000 -      2703360 (   2M 592K), rw-p, 0,
    7f211c551000 -      2703360 (   2M 592K), rw-p, 0,
    7f211cce3000 -      2703360 (   2M 592K), rw-p, 0,
    7f21205d6000 -         8192 (        8K), rw-p, 0,
    7f21211ef000 -        12288 (       12K), ---p, 0,
    7f21211f2000 -      2088960 (  1M 1016K), rw-p, 0, [stack:28219]
    7f2121ed4000 -        16384 (       16K), rw-p, 0,
    7f2122182000 -         4096 (        4K), rw-p, 0,
    7f2122604000 -         8192 (        8K), rw-p, 0,
    7f2122f6f000 -         8192 (        8K), rw-p, 0,
    7f21236c6000 -        12288 (       12K), rw-p, 0,
    7f2123b89000 -       151552 (      148K), rw-p, 0,
    7f2128000000 -         4096 (        4K), r--p, 0,
    7f212801f000 -      1540096 (   1M 480K), rw-p, 0,
    7f2128d82000 -         4096 (        4K), ---p, 0,
    7f2128d83000 -     10731520 (  10M 240K), rw-p, 0, [stack:28198]
    7f212a058000 -         4096 (        4K), ---p, 0,
    7f212a059000 -      1208320 (   1M 156K), rw-p, 0, [stack:28190]
    7f212a180000 -       118784 (      116K), ---p, 0,
    7f212a19d000 -       770048 (      752K), rw-p, 0,
    7f212a259000 -      8003584 (   7M 648K), rw-p, 0,
    7f212a9fb000 -        36864 (       36K), ---p, 0,
    7f212aa04000 -       348160 (      340K), rw-p, 0,
    7f212aa59000 -       159744 (      156K), rw-p, 0,
    7f212aa80000 -       118784 (      116K), ---p, 0,
    7f212aa9d000 -       770048 (      752K), rw-p, 0,
    7f212ab59000 -      8003584 (   7M 648K), rw-p, 0,
    7f212b2fb000 -        36864 (       36K), ---p, 0,
    7f212b304000 -       348160 (      340K), rw-p, 0,
    7f212b359000 -      3526656 (   3M 372K), rw-p, 0,
    7f212b6b6000 -       667648 (      652K), ---p, 0,
    7f212b759000 -         4096 (        4K), rw-p, 0,
    7f212b75a000 -     17432576 (  16M 640K), rwxp, 0,
    7f212c7fa000 -     32899072 (  31M 384K), rw-p, 0,
    7f212f196000 -         8192 (        8K), rw-p, 0,
    7f212ff6b000 -        86016 (       84K), rw-p, 0,
    7f2130b39000 -       167936 (      164K), rw-p, 0,
    7f2130ee7000 -        20480 (       20K), rw-p, 0,
    7f213150c000 -        16384 (       16K), rw-p, 0,
    7f2131747000 -       438272 (      428K), rw-p, 0,
    7f21317b2000 -       512000 (      500K), rw-p, 0,
    7f2131837000 -        12288 (       12K), ---p, 0,
    7f213183a000 -      1060864 (    1M 12K), rw-p, 0, [stack:28189]
    7f2131942000 -         4096 (        4K), rw-p, 0,
    7f2131943000 -         4096 (        4K), r--p, 0,
    7f2131944000 -         8192 (        8K), rw-p, 0,
    7f2131948000 -         4096 (        4K), rw-p, 0,
    7fff955d3000 -       135168 (      132K), rw-p, 0, [stack]
    Java-Blocks =
           count=8
            addr=[660000000,664df0000,668650000,680000000,774220000,7755e0000,780000000,7eb830000]
             rss= 6G 108M 64K
             vsz= 6G 512M
    sizeInMappedFiles =  142M 536K
    sizeInSystemMappings =  8K
    main-arena =  1G 528M 244K
    stacks =
         1M
           count=8
             rss= 8M
             vsz= 8M 32K
      1016K
           count=72
             rss= 71M 448K
             vsz= 72M 288K
         8M
           count=179
             rss= 1G 408M
             vsz= 1G 408M 716K
    libc arenas =
       128M
           count=18
            addr=[7f1f00000000,7f1f90000000,7f1fa8000000,7f1fb0000000,7f1fc8000000,7f1fd8000000,7f1fe0000000,7f1ff8000000,7f2000000000,7f2008000000,7f2010000000,7f2030000000,7f2048000000,7f2050000000,7f2058000000,7f2064000000,7f20dc000000,7f20fc000000]
             rss= 2G 192M 556K
             vsz= 2G 256M
        64M
           count=65
            addr=[7f1ef8000000,7f1f08000000,7f1f10000000,7f1f14000000,7f1f18000000,7f1f1c000000,7f1f20000000,7f1f28000000,7f1f2c000000,7f1f30000000,7f1f34000000,7f1f38000000,7f1f3c000000,7f1f40000000,7f1f44000000,7f1f48000000,7f1f4c000000,7f1f54000000,7f1f58000000,7f1f5c000000,7f1f60000000,7f1f64000000,7f1f6c000000,7f1f74000000,7f1f78000000,7f1f7c000000,7f1f80000000,7f1f84000000,7f1f88000000,7f1f8c000000,7f1f98000000,7f1f9c000000,7f1fa0000000,7f1fa4000000,7f1fb8000000,7f1fc0000000,7f1fd0000000,7f1fd4000000,7f1fe8000000,7f1ff0000000,7f1ff4000000,7f2018000000,7f201c000000,7f2020000000,7f2024000000,7f2028000000,7f202c000000,7f2038000000,7f203c000000,7f2040000000,7f2060000000,7f206c000000,7f2070000000,7f2084000000,7f20b8000000,7f20d4000000,7f20e4000000,7f20e8000000,7f20ec000000,7f20f4000000,7f2104000000,7f2108000000,7f2110000000,7f2118000000,7f2124000000]
             rss= 3G 861M 148K
             vsz= 4G 64M
    unknown rss = 145M 616K, vsz =  146M 580K
    sum rss = 15G 275M 28K, vsz = 16G 90M 356K
    

    First it dumps a list of blocks it was not able to identify. (The format changes slightly. Instead of the end address, the size of the block is printed.) The size of this blocks is summed up at the bottom: "unknown rss". This should be a small number. After that, it dumps categories of blocks it did identify. Each category is grouped into paragraphs of same size. Each size paragraph contains the number of blocks found and the size sum occupied by those blocks. Size sum is always split into vsz (vss) and rss. vsz includes rss, this is why it will be always bigger. Some size paragraphs print also addresses of the identified blocks. They can be usually ignored. "sum rss" and "sum vsz" finally is the sum of all the blocks. They should be the same like reported by ps/top.

    In the sample output before, we see that the 15GB rss are mainly used in this categories:
    • Java takes 6GB, exactly as specified by java parameters (-Xmx6g).
    • Thread stacks consume 1,5GB!
    • And finally the native heaps consume 7,5GB split into 83 heaps and one main arena/heap.

      Details For The Hard Core Developer

      1. Rss includes the size of shared libraries mapped into memory. While this really uses physical memory, this memory is shared between processes using the same shared library. So if one process loades a 5MB shared library, rss for this process will be 5MB + the process ofwn stuff. If 10 processes load the same 5MB shared library, the rss for each process will be 5MB + the process stuff. But only 5MB of physical memory was really spent. So rss is not exactly physical memory size.
      2. Even if the system grants "physical" memory to the process it does not waste the memory yet. Two processes could allocate each the size of the whole system memory (assumed there is enough swap space) without anything bad happening. The system splits the memory into pages. And only if a process starts to access the memory, the page where the access is inside is considered "dirty" and backed by physical memory.
      3. This is necessary so libc can get a huge block from the system, without blocking memory which is not yet used. But why should we use libc and not just allocate memory by using mmap() instead? There are several reasons: 1. The memory returned by mmap() is aligned at page boundaries. The page size is usually 4kB. So if we allocate 10 bytes the rest of the page address space (4086 bytes) will be wasted. Since the system manages memory in pages this means real memory will be wasted. So we need libc to manage small memory blocks with smaller alignments. 2. There is a limited number of blocks which the system can return using mmap(). Since some processes use millions of memory blocks this would exhause this resource very quickly.
      4. We can guess three types of memory blocks in Java programs. 1. the Java blocks, 2. libc heaps, 3. stacks. Java blocks are usually at the beginning of the list and are huge. Their size equals the specified block sizes (-Xmx, -XX:PermGenSpace). The size may be split into several blocks. Especially into two blocks where one has read/write access and the other no access permissions (this is the growing heap). libc heaps have a size of exactly 64MB (64*1024*1024MB) or a magnitude of that (128MB, 192MB, ...). Each heap block is also split in two blocks with different access permissions. If the heap is full, there is only one block. Stacks have also a noticeable size. They are usually a magnitude of exactly 1MB. 1MB and 8MB stacks have been those I've seen the most in Java programs. Stacks have something more special: stack blocks are always preceeded by a guarding block. The guarding block is the size of one page (4k) and has no access premissions. This is used to detect stack overflows. The guarding block preceeds the stack block because stacks grow from bigger addresses to smaller addresses. If the stack overflows, the code will access the address space of this block without access permissions.
      5. Further deep detail reading: http://www.blackhat.com/presentations/bh-usa-07/Ferguson/Whitepaper/bh-usa-07-ferguson-WP.pdf

      It’s not my code! I googled it.

      Actually this blog post is not about Google. And it is even not about copying code. But it is indeed about code which I did not write. Since a couple days we use Sonar to check our code for coding style or conventions violations. Well this is not that new. We used to use Checkstyle, PMD, Findbugs already for a couple of years. But the switch to Sonar brought the heap of violations in our code up my mind again. Sonar says we have around 3000-5000 violations in our projects. Probably the most of them are eligible. But some of them are not:


      Here you see some of the violations found in the Configuration class in the equals() method. Nearly each line has a violation. The problem is: equals() is a automatically generated method. Coding conventions violations in generated code are just useless. Generated code doesn’t has to be maintainable. It doesn’t has to be readable.

      I thought about how to tell Sonar to ignore this code. One could use the //NOSONAR comment to make Sonar ignore lines. But you’d have to place it on every line. Or you could use

      @SuppressWarnings("all")

      but this would suppress all warnings, not only Sonar violations (reference).

      Then I stumbled upon @Generated annotation which is part of Java since 1.6. Using this annotation, code generators could automatically mark generated code, making life easier for code analyzers and developers. So in a perfect world my Eclipse code generator would generate this method:

      @Override
      @Generated("Eclipse source generator")
      public boolean equals(final Object obj) {
       if (this == obj) return true;
       if (obj == null) return false;
       if (getClass() != obj.getClass()) return false;
       final ConfigurationBase other = (ConfigurationBase) obj;
       if (autoRefreshDatabaseEnabled != other.autoRefreshDatabaseEnabled) return false;
       ....
      }

      and all code analyzers would magically ignore this method.

      This idea was already picked up by the sonar team but sadly not yet implemented.

      Samstag, 5. Oktober 2013

      Build Fennec (Firefox Mobile) - for the first time

      My first time compiling Fennec on Windows and first time compiling Fennec at all. I got the Ubuntu virtual machine image from Brad Lassey's blog as described in the MozillaWiki. I did an update to Android SDK, Android NDK, mozila-central sources. I followed the steps to build Fennec like described in the Wiki article. But as always IT never works out without problems.

      For the impatient: the real problem was a mismatch of the real Android SDK path and the path in mozconfig. Now the whole story:

      First, the PYTHON environment variable just had no effect. The mach python script was loaded by the old 2.6 Python interpreter. The workaround was to invoke mach using Python 2.7 directly:
      > python2.7 mach build
      This did the job and the script started the configuration step. But it didn't get far. After 10 seconds the build stopped and hung (once CPU core busy). It seemed like an unusual step to hang. The process list revealed a ld.gold process which refused to finish. Looking deeper into this showed, that /usr/bin/ld.gold is a symbolic link to /usr/bin/hardened-ld which itself is a perl script. Analyzing the script revealed that that hardened-ld tries to resolve all symbolic links until the link points to hardened-ld. Then it appends ".real" to the name and checks if the file does exist. If it does, it executes the file. If it does not exist, it executes the last resolved name:
      while (-l $tool && ($link = resolve_link($tool)) !~ /$self$/) {
          $tool = $link;
      }
      if (-x "$tool.real") {
          $tool = "$tool.real";
      }
      So if there'd be a /usr/bin/my-ld pointing to /usr/bin/ld.gold which itself would point to /usr/bin/hardened-ld and I execute /usr/bin/my-ld, the script would resolve the links until /usr/bin/ld.gold, check for /usr/bin/ld.gold.real and execute it if it exists. If it does not, it executed /usr/bin/ld.gold.
      Back to my problem. In my case /usr/bin/ld.gold was executed and /usr/bin/ld.gold.real was missing. So ld.gold just invoked ld.gold in a never ending loop. That's why the build hung. I guess this is a bug in the VM I got from Brad Lassey. After updating the dated Ubuntu /usr/bin/ld.gold.real appeared. However the endless loop problem in hardened-ld still remains. I filed a bug report for this.
      This step did not yet lead to a nice working build. Instead it surfaced a new error message:

      /usr/bin/ld.gold.real: fatal error: /home/mozilla/android-ndk-r8e/platforms/android-5/arch-arm/usr/lib/crtbegin_so.o: unsupported ELF machine number 40
      That's a weird error message. But searching the internet gave me this important hint: "It seams it use a native ld (for x86) to link an ARM object." Looks like the build was using the system default ld instead of the ld from the android-ndk toolchain.
      Updating toolchain path in mozconfig gave me an other error: my android-sdk path was wrong. Jesus, how I love useful error messages. When I updated the Android SDK, as recommended by the Wiki article, I just unpacked the tar. This left the SDK in a path named android-sdk-linux. But mozconfig was referencing android-sdk-linux_x86. That did the magic.
      So this was my mozconfig in the end:
      HOME=`echo $HOME`
      # Add the correct paths here:
      ac_add_options --with-android-ndk="$HOME/android-ndk-r8e"
      ac_add_options --with-android-sdk="$HOME/android-sdk-linux_x86/platforms/android-17"
      ac_add_options --with-android-version=17
      ac_add_options --with-android-tools="$HOME/android-sdk-linux_x86/tools"
      ac_add_options --with-android-toolchain="$HOME/android-ndk-r8e/toolchains/arm-linux-androideabi-4.7/prebuilt/linux-x86"
      ac_add_options --with-android-platform="$HOME/android-ndk-r8e/platforms/android-14/arch-arm"
      # android options
      ac_add_options --enable-application=mobile/android
      ac_add_options --target=arm-linux-androideabi
      ac_add_options --with-endian=little
      ac_add_options --with-ccache

      mk_add_options MOZ_OBJDIR=./objdir-droid
      mk_add_options MOZ_MAKE_FLAGS="-j9 -s"

      Conclusion: informative and correct error messages are invaluable.