Wednesday, March 04, 2009

Benchmarking Stackless, Kamaelia and Fibra

In the cooperative threading world, as far as Python goes, there are a few choices. Stackless is the most well known, I guess. Since reading this post, I've been itching to see how Fibra compares to Kamaelia and Stackless. Now that I've implemented 'channels' or as I like to call them, 'tubes' in Fibra, I can implement the Stackless hackysack benchmark, and finally get a direct comparison of task switching speed between Fibra, Kamaelia and Stackless.

Fibra results:
python -n 3 -s "import hackysack" "hackysack.runit(10000,1000, dbg=0)" 
3 loops, best of 3: 227 msec per loop

Kamaelia Results: Using this code.
python -n 3 -s "import hackysack" "hackysack.runit(10000,1000)
3 loops, best of 3: 2.41 sec per loop

Stackless results: Using this code.
python -n 3 -s "import hackysack" "hackysack.runit(10000,1000, dbg=0)" 
3 loops, best of 3: 31.5 msec per loop

These benchmarks show that Stackless is 7x faster than Fibra, and Fibra is 10x faster than Kamaelia.

This is the code I used to benchmark Fibra.

import fibra
import random
import sys

scheduler = fibra.schedule()

class hackysacker:
counter = 0
def __init__(self,name,circle): = name = circle
self.messageQueue = fibra.Tube()

def incrementCounter(self):
hackysacker.counter += 1
if hackysacker.counter >= turns:
hs =
if hs is not self:
return hs.messageQueue.push('exit')

def messageLoop(self):
while 1:
message = yield self.messageQueue.pop()
if message == "exit":
debugPrint("%s is going home" %
debugPrint("%s got hackeysack from %s" % (,
kickTo =[random.randint(0,len(]
debugPrint("%s kicking hackeysack to %s" % (,
yield self.incrementCounter()
yield kickTo.messageQueue.push(self)

def debugPrint(x):
if debug:
print x

turns = 5

def runit(hs=10000,ts=1000,dbg=1):
global hackysackers,turns,debug
hackysackers = hs
turns = ts
debug = dbg

hackysacker.counter= 0
circle = []
one = hackysacker('1',circle)

for i in range(hackysackers):

def main():
yield one.messageQueue.push(one)


if __name__ == "__main__":


Anonymous said...


Your Kamaelia link requires a www. infront.


Simon Wittber said...


mono said...

Shouldn't you disable debugging for the Kamaelia test as well?

Simon Wittber said...

The Kamelia code was modified and has no debugging statements in it, therefore does not need to be turned off.

René Dudfield said...

sounds pretty good :)

I don't know how fibra works... so to help understand your code... could you please add some comments?

I wonder how it would compare to a threaded version?


Simon Wittber said...

I'm not really sure how the hackysack thing works either, I just copied the stackless example and replaced the threaded pieces with fibra pieces.

A threaded version will crash your computer. 10000 OS level threads is simply too much! :-)

René Dudfield said...

ah oh well. I'm sure python threads would be waaay slower... it'd just be interesting to compare them :)

well, you can use 10,000 on some systems if configured correctly... even 100,000

eg. 10,000 python threads on ubuntu...

ulimit -s 300;python -c "import pygame.threads;pygame.threads.init(10000);pygame.threads.tmap(lambda x:x, range(10000))";ulimit -s 8096

Simon Wittber said...

Cool! I didn't know you could do that.

On my Mac, I had to limit it to 1000 threads.

Threads = 830 msec
Fibra = 39.8 msec
Kamelia = 324 msec
Stackless = 9.8 msec

Hmm this is interesting. I might graph the scaling properties of each library.

Michael said...

Hi Simon,

FYI, I've blogged about this, and why I think Fibra is so much quicker in this case. Hackysack is a pretty unnatural scenario for any concurrency system IMO, but that doesn't stop it being an interesting benchmark, and doesn't stop your results being really cool :-)

I think the key difference really is you've focussed on a better scheduler, and have schedule generally passive things, whereas we have a (deliberately) simple scheduler and schedule generally active things.

Whatever the reason though, nice work! :)

Regards, Michael

Simon Wittber said...

Greetings Michael.

Thanks for the response post, but I can't seem to leave a comment at your blog in Safari?

The behavior you observed in the Fibra scheduler is coming from the handlers/ handler class. The core scheduler degenerates into a round robin scheduler when not using any handlers. Using a different handler for each yielded type allows Fibra to do implement specialised scheduling rules when a task yields control.

For example, if a task is waiting for a value to arrive via it's tube, the tube handler will remove the task from the main schedule (swaps the task out) until a value arrives in the tube. The sleep handler does something similar. This keeps the number of active tasks to a minimum.

Michael said...

I'm not sure what's up with comments on my blog - I use dojo to enable wysiwyg editting, and the version used there doesn't always do the right thing, so rather than discriminate against safari users I've switched them off completely for the moment.

I'll have to delve into why kamaelia is being so much slower here in more detail, but overheads inside the scheduler is my biggest suspicion. Since you're running 10x faster that really shows kamaelia is doing __really__ badly, and that really sucks ... and thanks :) for the heads up just how badly it sucks. :-/

Your graphs show that as well. Regarding the debugging comment above, I doubt that's a major factor (could be, but I doubt it's *that* much of a factor).

Simon Wittber said...

Right, in case I wasn't being exactly clear, I think the difference in performance is likely due to the implementation of the Tube, rather than the core scheduler algorithm. Though, in the case of Fibra, the Tube wouldn't work without the scheduler...

As far as the debugging flag is concerned, I think all the debugging code was stripped from the Kamaelia test case, so the Kamaelia test has a extremely tiny advantage in this regard.

Unknown said...

Do you plan on integrating the old network module into the new version of fibra ?

Anonymous said...

Thanks for running these tests. I was wondering how Parallel Python compares when added to the mix, understanding that it's more oriented towards SMP than coroutines.

Anonymous said...[random.randint(0,len(]
should be written as

Popular Posts