I was made again aware about Intel’s thread building blocks . The default allocator on hp-ux (pa-risc) allocates 16 byte of memory even for blocks smaller than 8 bytes. If you have a system that allocates hundreds of megabytes of small blocks of usefull memory smaller than 8 byte, this would mean a serious overhead! I tryed hoard (version 2.1.2a) the only one that compiles on hp-ux pa risc, but even hoard allocates 16 bytes.
TBB seems to allocate 8 bytes, but there is a little bottleneck that has to be solved. During initialisation of o multithreaded application the first library to be initialised is libpthread, probably due to some global variable. This requires memory, and if you indirect malloc to tbb allocator, this enters a deadlock, that can be seen on the following callstack:
#0 0xc020ba90 in __ksleep+0x10 () from /lib/libc.2
#1 0xc006023c in __spin_lock+0x10c () from /lib/libpthread.1
#2 0xc0057dc0 in pthread_key_create+0x34 () from /lib/libpthread.1
#3 0xc504cc40 in _Z17initMemoryManagerv () at ../../src/tbbmalloc/MemoryAllocator.cpp:1217
#4 0xc504cda8 in _Z19checkInitializationv () at ../../src/tbbmalloc/MemoryAllocator.cpp:1238
#5 0xc504d25c in scalable_malloc () at ../../src/tbbmalloc/MemoryAllocator.cpp:1343
#6 0xc07cf3b8 in malloc+0x58 () from custom_malloc.cpp
#7 0xc005845c in __pthread_specific_startup+0x20 () from /lib/libpthread.1
#8 0xc00597ec in __pthread_startup+0x1a8 () from /lib/libpthread.1
#9 0xc0027e58 in finish_dld_main+0x1520 () from /usr/lib/dld.sl
#10 0xc00154b8 in _dld_main+0x1c8 () from /usr/lib/dld.sl
#11 0x23b8e08 in __map_dld+0x650 ()
#12 0x23b82dc in $START$+0xd4 ()
#13 0xc006023c in __spin_lock+0x10c () from /lib/libpthread.1