Archive for the 'performance' Category
when faster is slower, libc world

Investigating different memmem string search algorithms for my future vi like text editor, turns out that the libc replacements for embededd devices are slower than the native gnulibc. The test program opens a 300MB file (linux kernel header and source files merged-that is my reference), and searches for the (random) string “xxxxxxxxxxxxxxxxboeuoeuboeuoeu”. I linked against the memmem functions from 4 different libraries + a Boyer-Moore algorithm implementation. The results below show the time to search through the are the results (measured with time utility):

* dietlibc: 3.529
* newlibc: 1.955
* libc: 1.238
* klibc: 0.694
* Boyer Moore impl: 0.228

Did anybody really hear about klibc? Me not before. But it seems that it is providing the fastest algorithm. Dietlibc and newlibc are interesting projects for embedded devices. The goal was not speed, but size. However for a string search algorithm space shuld not be so much of compromise.

Making further tests… hm.. doing simple grep hm… result: 0.157, searching for the same string in the same file (about file cache: all programs had the same chance of having the file inside the fs cache). Well the reason is that grep is not loading the full file into the memory, it just loads chunks, and searches inside those chunks. The test program loads the full buffer into memory. The numbers above are for a mmap-ed version. Mmap seems to be overall better solution. However the page misses introduce too much noise in the results of the measurement of the algorithm. However in order to get more accurate results, I modified the test program to allocate ~300MB, and “read” the file into the 300MB. Reading the file is slower than the going through the mmap, as you can see below. However once having the full file in RAM, the measurements are more accurate:

with malloc/read
* Boyer-Moore: read/mmap time 0.358, memmem time 0.092
* klibc: read/mmap time 0.358, memmem time 0.555
* libc: read/mmap time 0.357, memmem time 1.100
* newlib: read/mmap time 0.358, memmem time 1.177
* dietlibc: read/mmap time 0.358, memmem time 3.344

So, the “Naïve string search algorithm” used by dietlibc is clearly the looser. Having in mind that dietlibc is intended for embedded devices, would make sense to improve it.

note: the read/mmap functions are from glibc for klibc/dietlibc/newlibc

distributed build with gnu make, the minusj

I was just nanometers close to design a nice distributed build system, however the gnu make when called with -j does not tell to the target builder process any information (through environment variable) about the number of the instance started in parallel…

There is however a more complicated way, if you would like to better use the resources of a build team (suppose for eg. everybody has quadcores), where instead of calling the, let’s say gcc compiler, you call a wrapper, that would be a client application to a builddistributor server. The server would know which resources are not loaded in the network and distribute a compilation there…

 

 

upstart openmoko

Due to the complex dependencies in the linux boot process the safest way to boot is to run the different initialization scripts sequentially. The result is a long boot process, subsystems are initialized earlier than needed etc.

There are different approaches like upstart, initng, runit, einit that try to improve the boot process on linux, each approach concentrating on several aspects, like running initialization scripts in parallel, handling device generated events in a more flexible way etc. Ubuntu for example comes now with upstart as default init, but it failes to use even at minimum the strength of the ideas behind upstart.

With the intent to speed up the linux process both for desktop and for embedded devices like ubuntu, I started to write some scripts that generates native upstart scripts based on the information from the /etc/init.d/* scripts. As first stage it would be enough to create the native upstart script as wrapper scripts around the /etc/init.d/* scripts.

One could speed up the boot process if initialization scripts could be run in parallel. As we know certain scripts need to have other scripts already started before. To simplify my description I will use the terms target as synonim for the script and dependencies for scripts that need to be started. Having the dependency relation ship we identify two categories of scripts: dependants and dependees. Of course the dependee can be dependant and so on… It is a general requirement that there should be no cyclic dependencies in sysVinit scripts.

The most trivial approach is to create upstart scripts for the dependees containing following

start on starting …. #list those targets that require the current one

pre-start exec /etc/init.d/$MYSCRIPT start

pre-stop exec /etc/init.d/$MYSCRIPT stop

the dependants (targets) would contain sthg. like:

stop on stopping … #list those dependencies that if stopped, the current one should be stopped

pre-start exec /etc/init.d/$MYSCRIPT start

pre-stop exec /etc/init.d/$MYSCRIPT stop

the two above would be merged like:

start on starting …. #list those targets that require the current one

stop on stopping … #list those dependencies that if stopped, the current one should be stopped

pre-start exec /etc/init.d/$MYSCRIPT start

pre-stop exec /etc/init.d/$MYSCRIPT stop

the first version of a python script that would generate such wrappers would be sthg. like… (GPL!!!!)

def createfile(file, startingon):
f = open(”/etc/event.d/upgen/” + file, ‘w’)
f.write(’#file automatically generated by upgen \n\n’)
for dep in startingon:
f.write(’start on starting ‘ + dep + ‘\n\n’)
f.write(’\n\n’)
f.write(’pre-start exec /etc/init.d/’ + file + ‘ start \n\n’)
f.write(’pre-stop exec /etc/init.d/’ + file + ‘ stop\n\n’)
f.close()

def createfiles(startingonmap):
for file in startingonmap:
createfile(file, startingonmap[file])

depfile=’/etc/init.d/.depend.boot’

startingonmap={}

for line in open(depfile, ‘r’).readlines():
#skip lines with
if (line.find(’TARGETS =’) != -1) or (line.find(’INTERACTIVE =’) != -1):
pass
else:
t = line.split(’:')
dependant = t[0]
dependies = t[1].split()
for dependy in dependies:
inversemap = {}
try:
inversemap = startingonmap[dependy]
except:
pass
inversemap[dependant] = 1
startingonmap[dependy]=inversemap

createfiles(startingonmap)

openmoko memory allocator and hoard

Wrote a little program that allocates randomly sizes from 0 to 100 MB and compared the results for the default allocator and hoard. Hoard behaves better both as speed and overhead, however not much better. Will publish the results as soon I will have time to put them in a decent format.

openmoko project blown up like the global financial markets

As I was saying in my previous posts, I am more than disappointed about the openmoko project. The apple iphone is successful because the main man behind it wants one thing. And he wants that the users are happy with the device. And when are the users happy? If they have a usable phone… And openmoko has failed to provide that, however it had enough lego elements at disposal, and also enough time. Nobody will want to use a phone that takes 5 minutes to boot, and then another 1 minute to go to the phone interface and make a call. The problem is that the system (of the project) has to many freedoms and too many forces with random intensities in time and the result is a brownian movement, and it does not look to tend towards an equilibrium state, or it tends to an equilibrium state which is general failure :)

I have the following questions to the community:
1. Why was in not possible to create a simpler kernel that loads just the basic things to make the phone available and start anything else when needed. Or loading the kernel is slow due to printing the messages to the slow framebuffer?
2. why is this story with glamo so stupid? I was reading the specs. I spent (psychologically) 50 dolars more to have an accelerated graphic card for opengl… and it is not there… I have no acceleration on eten m800 but gl es is working on it!!!!
3. why is everything so slow? Why not moving to sthg. fast like qt embedded on top of framebuffer. I believe that tweaking of framebuffer (directfb) driver would be easier than X to activate someway the graphics acceleration.

The answer is following: the main guy behind it is just incapable, all the story is a blown up story like the global financial system… (in this case the guy behind is like the big guys who manipulate the financial markets :))

p.s. Nobody should believe that have sthg. to do with QT or Nokia other than liking the QT embedded:)

tbbmalloc vs. hoard vs. tcmalloc vs. tcmalloc or intel vs. google vs. hoard vs. linux

I wrote a little benchmark (single threaded) to compare the overhead of allocated memory generated by different allocators for small blocks. The candidates are: thread building blocks scalable allocator, hoard,  google’s tcmalloc and the standard malloc. The benchmark allocates a total useful memory of around 300MB in random size blocks. The limit of the sizes is determined by the parameter to the program. I run a batch for each allocator for max sizes: 8, 16, 32, 64 etc. Below are the results:

./standard max_size: 8, time: 17810000 ticks, util size: 273471 kB, virt mem: 1250072 kB, overhead 357%
./tbb max_size: 8, time: 15190000 ticks, util size: 273471 kB, virt mem: 640204 kB, overhead 134%
./google max_size: 8, time: 18760000 ticks, util size: 273471 kB, virt mem: 642636 kB, overhead 134%
./hoard max_size: 8, time: 16430000 ticks, util size: 273471 kB, virt mem: 626300 kB, overhead 129%
./standard max_size: 16, time: 9680000 ticks, util size: 293005 kB, virt mem: 683792 kB, overhead 133%
./tbb max_size: 16, time: 8870000 ticks, util size: 293005 kB, virt mem: 461004 kB, overhead 57%
./google max_size: 16, time: 9270000 ticks, util size: 293005 kB, virt mem: 466380 kB, overhead 59%
./hoard max_size: 16, time: 7830000 ticks, util size: 293005 kB, virt mem: 450428 kB, overhead 53%
./standard max_size: 32, time: 5220000 ticks, util size: 302209 kB, virt mem: 472988 kB, overhead 56%
./tbb max_size: 32, time: 3520000 ticks, util size: 302209 kB, virt mem: 385228 kB, overhead 27%
./google max_size: 32, time: 4650000 ticks, util size: 302209 kB, virt mem: 393292 kB, overhead 30%
./hoard max_size: 32, time: 4320000 ticks, util size: 302209 kB, virt mem: 376508 kB, overhead 24%
./standard max_size: 64, time: 2590000 ticks, util size: 307102 kB, virt mem: 386396 kB, overhead 25%
./tbb max_size: 64, time: 1620000 ticks, util size: 307102 kB, virt mem: 351436 kB, overhead 14%
./google max_size: 64, time: 2530000 ticks, util size: 307102 kB, virt mem: 360396 kB, overhead 17%
./hoard max_size: 64, time: 2230000 ticks, util size: 307102 kB, virt mem: 343676 kB, overhead 11%
./standard max_size: 128, time: 1140000 ticks, util size: 309567 kB, virt mem: 347720 kB, overhead 12%
./tbb max_size: 128, time: 1090000 ticks, util size: 309567 kB, virt mem: 340172 kB, overhead 9%
./google max_size: 128, time: 1580000 ticks, util size: 309567 kB, virt mem: 345932 kB, overhead 11%
./hoard max_size: 128, time: 1160000 ticks, util size: 309567 kB, virt mem: 335740 kB, overhead 8%
./standard max_size: 256, time: 830000 ticks, util size: 310797 kB, virt mem: 329504 kB, overhead 6%
./tbb max_size: 256, time: 620000 ticks, util size: 310797 kB, virt mem: 347340 kB, overhead 11%
./google max_size: 256, time: 900000 ticks, util size: 310797 kB, virt mem: 344908 kB, overhead 10%

./hoard max_size: 256, time: 860000 ticks, util size: 310797 kB, virt mem: 339004 kB, overhead 9%