Archive for the 'operating system' Category
openmoko memory allocator and hoard

Wrote a little program that allocates randomly sizes from 0 to 100 MB and compared the results for the default allocator and hoard. Hoard behaves better both as speed and overhead, however not much better. Will publish the results as soon I will have time to put them in a decent format.

mikrokernel on openmoko freerunner

I am happy to see that there are other experiments than linux on openmoko…

http://www.linuxdevices.com/news/NS8294545513.html

and

http://wiki.ok-labs.com/

Qt extended 4.5 super fast on openmoko

Well, compared with Om 2008.9, the openmoko stack without X11 the qt extended stack seem to be much much faster on openmoko. Honestly I do not understand why this agony with enlightement libraries and the old and bad X11, when there are much better ways to talk to the graphic cad than X11, and make the user happy…

Well apparently it was a bad idea, and a little disappointment to buy the openmoko for more than 300Euros, when sthg. like the eten X800 is available for around 300 euros, but at least it has a working software stack…:) and with minimal effort a linux can be built for it…

linux left, windows right pocket…

Yes at the moment linux is in the right pocket until linux catches up. Then will swap. I am talking about the neo freerunner vs. the eten m800. After months of problems with it the m800 is stable, I have quite a lot of software on the device that the freerunner would not provide at the moment.

tbbmalloc vs. hoard vs. tcmalloc vs. tcmalloc or intel vs. google vs. hoard vs. linux

I wrote a little benchmark (single threaded) to compare the overhead of allocated memory generated by different allocators for small blocks. The candidates are: thread building blocks scalable allocator, hoard,  google’s tcmalloc and the standard malloc. The benchmark allocates a total useful memory of around 300MB in random size blocks. The limit of the sizes is determined by the parameter to the program. I run a batch for each allocator for max sizes: 8, 16, 32, 64 etc. Below are the results:

./standard max_size: 8, time: 17810000 ticks, util size: 273471 kB, virt mem: 1250072 kB, overhead 357%
./tbb max_size: 8, time: 15190000 ticks, util size: 273471 kB, virt mem: 640204 kB, overhead 134%
./google max_size: 8, time: 18760000 ticks, util size: 273471 kB, virt mem: 642636 kB, overhead 134%
./hoard max_size: 8, time: 16430000 ticks, util size: 273471 kB, virt mem: 626300 kB, overhead 129%
./standard max_size: 16, time: 9680000 ticks, util size: 293005 kB, virt mem: 683792 kB, overhead 133%
./tbb max_size: 16, time: 8870000 ticks, util size: 293005 kB, virt mem: 461004 kB, overhead 57%
./google max_size: 16, time: 9270000 ticks, util size: 293005 kB, virt mem: 466380 kB, overhead 59%
./hoard max_size: 16, time: 7830000 ticks, util size: 293005 kB, virt mem: 450428 kB, overhead 53%
./standard max_size: 32, time: 5220000 ticks, util size: 302209 kB, virt mem: 472988 kB, overhead 56%
./tbb max_size: 32, time: 3520000 ticks, util size: 302209 kB, virt mem: 385228 kB, overhead 27%
./google max_size: 32, time: 4650000 ticks, util size: 302209 kB, virt mem: 393292 kB, overhead 30%
./hoard max_size: 32, time: 4320000 ticks, util size: 302209 kB, virt mem: 376508 kB, overhead 24%
./standard max_size: 64, time: 2590000 ticks, util size: 307102 kB, virt mem: 386396 kB, overhead 25%
./tbb max_size: 64, time: 1620000 ticks, util size: 307102 kB, virt mem: 351436 kB, overhead 14%
./google max_size: 64, time: 2530000 ticks, util size: 307102 kB, virt mem: 360396 kB, overhead 17%
./hoard max_size: 64, time: 2230000 ticks, util size: 307102 kB, virt mem: 343676 kB, overhead 11%
./standard max_size: 128, time: 1140000 ticks, util size: 309567 kB, virt mem: 347720 kB, overhead 12%
./tbb max_size: 128, time: 1090000 ticks, util size: 309567 kB, virt mem: 340172 kB, overhead 9%
./google max_size: 128, time: 1580000 ticks, util size: 309567 kB, virt mem: 345932 kB, overhead 11%
./hoard max_size: 128, time: 1160000 ticks, util size: 309567 kB, virt mem: 335740 kB, overhead 8%
./standard max_size: 256, time: 830000 ticks, util size: 310797 kB, virt mem: 329504 kB, overhead 6%
./tbb max_size: 256, time: 620000 ticks, util size: 310797 kB, virt mem: 347340 kB, overhead 11%
./google max_size: 256, time: 900000 ticks, util size: 310797 kB, virt mem: 344908 kB, overhead 10%

./hoard max_size: 256, time: 860000 ticks, util size: 310797 kB, virt mem: 339004 kB, overhead 9%
thread building blocks allocator improvement proposal

I would like to propose a little architectural improvement for the tbbmalloc. (lack of time does not allow me to double-check if what I post here is true:) )

As I described in some of my previous posts on my blog, integrating the tbbmalloc is a little problematic, as tbb relies on malloc. This means that a complete replacement of the malloc with tbbmalloc is impossible.

Custom malloc implementations try to link with the host application before libc, thus implementing the symbols of the malloc family. This linking (in case of shared libraries) can happen at runtime as well.

An elegant combination of malloc with tbbmalloc would be by extending the signature of the tbbmalloc functions by adding an extra formal parameter a pointer to the default malloc family function. Thus all calls in the custom malloc definition could be redirected to the tbbmalloc tegether with a pointer to the pointer the libc (wtih getsymbol etc.). tbb could decide if that allocation his job (small block), if not,redirect the call to the passed function. Eventually if the pointer is null, call malloc itself, allowing thus more flexibility.

so the malloc would look:

void * malloc(size_t size)

{

if(!libc_malloc) //check if libc_malloc symbol found

{

libc_malloc=find_libc_malloc();\

}

return tbb_malloc(libc_malloc,size);

}

thread building blocks on hp-ux, continued

Scenario:

legacy application (server) that uses 1Gb of memory.

Using the tbb allocator:

memory usage dropped to 1/3, that is 300MB!

this is not true… the measurement was wrong… however there is a gain of ~17%.

but:

it is impossible to replace completely the malloc, as tbb calls malloc to allocate large blocks used for the small blocks.

(partial) solution:

override the c++ new operator, and redirect it to the tbb allocator.

Recommendation for tbb:

extend the definition of the allocator function with a further formal parameter pointer to the default malloc function.

thread building blocks on hp-ux, the chicken and the egg

I was made again aware about Intel’s thread building blocks . The default allocator on hp-ux (pa-risc) allocates 16 byte of memory even for blocks smaller than 8 bytes. If you have a system that allocates hundreds of megabytes of small blocks of usefull memory smaller than 8 byte, this would mean a serious overhead! I tryed hoard (version 2.1.2a) the only one that compiles on hp-ux pa risc, but even hoard allocates 16 bytes.

TBB seems to allocate 8 bytes, but there is a little bottleneck that has to be solved. During initialisation of o multithreaded application the first library to be initialised is libpthread, probably due to some global variable. This requires memory, and if you indirect malloc to tbb allocator, this enters a deadlock, that can be seen on the following callstack:

#0 0xc020ba90 in __ksleep+0x10 () from /lib/libc.2
#1 0xc006023c in __spin_lock+0x10c () from /lib/libpthread.1
#2 0xc0057dc0 in pthread_key_create+0x34 () from /lib/libpthread.1
#3 0xc504cc40 in _Z17initMemoryManagerv () at ../../src/tbbmalloc/MemoryAllocator.cpp:1217
#4 0xc504cda8 in _Z19checkInitializationv () at ../../src/tbbmalloc/MemoryAllocator.cpp:1238
#5 0xc504d25c in scalable_malloc () at ../../src/tbbmalloc/MemoryAllocator.cpp:1343
#6 0xc07cf3b8 in malloc+0x58 () from custom_malloc.cpp
#7 0xc005845c in __pthread_specific_startup+0x20 () from /lib/libpthread.1
#8 0xc00597ec in __pthread_startup+0x1a8 () from /lib/libpthread.1
#9 0xc0027e58 in finish_dld_main+0x1520 () from /usr/lib/dld.sl
#10 0xc00154b8 in _dld_main+0x1c8 () from /usr/lib/dld.sl
#11 0x23b8e08 in __map_dld+0x650 ()
#12 0x23b82dc in $START$+0xd4 ()
#13 0xc006023c in __spin_lock+0x10c () from /lib/libpthread.1