Intel has initiated Threading Building Blocks (TBB), I will initiate Cloud Building Blocks. I skimed through the documentation of TBB, and I have an idea about what is happening there, but I have realised that even if the TBB gives me valuable tools for one computer with multiple cores, it fails to give me similar tools to work on a farm of computers. And I think the future is of what is called Cloud computing. I set up therefore CloudBuildingBlocks.org, first as a wiki to collect some ideas there, maybe it will become a project…
I wrote a little benchmark (single threaded) to compare the overhead of allocated memory generated by different allocators for small blocks. The candidates are: thread building blocks scalable allocator, hoard, google’s tcmalloc and the standard malloc. The benchmark allocates a total useful memory of around 300MB in random size blocks. The limit of the sizes is determined by the parameter to the program. I run a batch for each allocator for max sizes: 8, 16, 32, 64 etc. Below are the results:
./standard max_size: 8, time: 17810000 ticks, util size: 273471 kB, virt mem: 1250072 kB, overhead 357%
./tbb max_size: 8, time: 15190000 ticks, util size: 273471 kB, virt mem: 640204 kB, overhead 134%
./google max_size: 8, time: 18760000 ticks, util size: 273471 kB, virt mem: 642636 kB, overhead 134%
./hoard max_size: 8, time: 16430000 ticks, util size: 273471 kB, virt mem: 626300 kB, overhead 129%
./standard max_size: 16, time: 9680000 ticks, util size: 293005 kB, virt mem: 683792 kB, overhead 133%
./tbb max_size: 16, time: 8870000 ticks, util size: 293005 kB, virt mem: 461004 kB, overhead 57%
./google max_size: 16, time: 9270000 ticks, util size: 293005 kB, virt mem: 466380 kB, overhead 59%
./hoard max_size: 16, time: 7830000 ticks, util size: 293005 kB, virt mem: 450428 kB, overhead 53%
./standard max_size: 32, time: 5220000 ticks, util size: 302209 kB, virt mem: 472988 kB, overhead 56%
./tbb max_size: 32, time: 3520000 ticks, util size: 302209 kB, virt mem: 385228 kB, overhead 27%
./google max_size: 32, time: 4650000 ticks, util size: 302209 kB, virt mem: 393292 kB, overhead 30%
./hoard max_size: 32, time: 4320000 ticks, util size: 302209 kB, virt mem: 376508 kB, overhead 24%
./standard max_size: 64, time: 2590000 ticks, util size: 307102 kB, virt mem: 386396 kB, overhead 25%
./tbb max_size: 64, time: 1620000 ticks, util size: 307102 kB, virt mem: 351436 kB, overhead 14%
./google max_size: 64, time: 2530000 ticks, util size: 307102 kB, virt mem: 360396 kB, overhead 17%
./hoard max_size: 64, time: 2230000 ticks, util size: 307102 kB, virt mem: 343676 kB, overhead 11%
./standard max_size: 128, time: 1140000 ticks, util size: 309567 kB, virt mem: 347720 kB, overhead 12%
./tbb max_size: 128, time: 1090000 ticks, util size: 309567 kB, virt mem: 340172 kB, overhead 9%
./google max_size: 128, time: 1580000 ticks, util size: 309567 kB, virt mem: 345932 kB, overhead 11%
./hoard max_size: 128, time: 1160000 ticks, util size: 309567 kB, virt mem: 335740 kB, overhead 8%
./standard max_size: 256, time: 830000 ticks, util size: 310797 kB, virt mem: 329504 kB, overhead 6%
./tbb max_size: 256, time: 620000 ticks, util size: 310797 kB, virt mem: 347340 kB, overhead 11%
./google max_size: 256, time: 900000 ticks, util size: 310797 kB, virt mem: 344908 kB, overhead 10%
./hoard max_size: 256, time: 860000 ticks, util size: 310797 kB, virt mem: 339004 kB, overhead 9%
I would like to propose a little architectural improvement for the tbbmalloc. (lack of time does not allow me to double-check if what I post here is true:) )
As I described in some of my previous posts on my blog, integrating the tbbmalloc is a little problematic, as tbb relies on malloc. This means that a complete replacement of the malloc with tbbmalloc is impossible.
Custom malloc implementations try to link with the host application before libc, thus implementing the symbols of the malloc family. This linking (in case of shared libraries) can happen at runtime as well.
An elegant combination of malloc with tbbmalloc would be by extending the signature of the tbbmalloc functions by adding an extra formal parameter a pointer to the default malloc family function. Thus all calls in the custom malloc definition could be redirected to the tbbmalloc tegether with a pointer to the pointer the libc (wtih getsymbol etc.). tbb could decide if that allocation his job (small block), if not,redirect the call to the passed function. Eventually if the pointer is null, call malloc itself, allowing thus more flexibility.
so the malloc would look:
void * malloc(size_t size)
{
if(!libc_malloc) //check if libc_malloc symbol found
{
libc_malloc=find_libc_malloc();\
}
return tbb_malloc(libc_malloc,size);
}
Scenario:
legacy application (server) that uses 1Gb of memory.
Using the tbb allocator:
memory usage dropped to 1/3, that is 300MB!
this is not true… the measurement was wrong… however there is a gain of ~17%.
but:
it is impossible to replace completely the malloc, as tbb calls malloc to allocate large blocks used for the small blocks.
(partial) solution:
override the c++ new operator, and redirect it to the tbb allocator.
Recommendation for tbb:
extend the definition of the allocator function with a further formal parameter pointer to the default malloc function.