Monday, June 15, 2009

Stackless vs GIL: It's a draw, cont.

I decided to double check my assumptions about Stackless and the GIL on a multicore machine, using the same example provided by David Beazley in his slides, which highlights the GIL contention issue. It's probably a better example than my previous test.

import stackless
import time
import threading

def count(n):
while n > 0:
n -= 1

T = time.clock()
count(100000000)
count(100000000)
print time.clock() - T

>>> 17.37
A normal sequential execution takes about 17 seconds.
T = time.clock()
t1 = threading.Thread(target=count,args=(100000000,))
t1.start()
t2 = threading.Thread(target=count,args=(100000000,))
t2.start()
t1.join(); t2.join()
print time.clock() - T

>>> 53.22
This confirms David's experiment. It takes much longer due to threads fighting over the GIL.
T = time.clock()
stackless.tasklet(count)(100000000)
stackless.tasklet(count)(100000000)
while stackless.getruncount() > 1:
task = stackless.run(100)
if task:
task.insert()
print time.clock() - T

>>> 25.34
Twice as fast as the threaded solution, and roughly 50% slower than than the sequential solution. Cool.

We can get much better results by increasing the granularity of the scheduler. If we change the tick count to 1000, the Stackless solution takes 17.71 seconds, and the threaded solution takes 22.25 seconds. This is interesting behavior, which I can't yet explain.

5 comments:

ulrik said...

Here is something random, but.. in python2.6 there is also multiprocessing. I wanted to try this, but I only have a single core so it's a bit ridiculous I guess. And yes, the ranking comes out quite randomly, but Multiprocessing could be faster than Threading. How that changes when adding IPC is a big issue of course.

Two runs:

$ python2.6 vsthread.py
Sequential: 12.8867890835
Threaded: 13.4743509293
Multiproc: 13.142551899
$ python2.6 vsthread.py
Sequential: 13.8414778709
Threaded: 13.215129137
Multiproc: 13.1815469265

Code snippet http://python.pastebin.com/m7e0e5273

Seun Osewa said...

I stumbled upon a 'trivial solution' to the slowdown on multicore systems. What do you think?

csenger said...

For Zope (a threaded server) benchmarks showed that as a rule-of-thumb pystones/50 is a good value for checkinterval. Today it's mostly a value > 1000.
Some details and links can be found here:
http://plone.org/products/jarn.checkinterval

Ben Sizer said...

Surely the increased granularity simply reduces the number of context switches (whether involving locks or not) which would lead to a predictable speed-up?

Catalin Festila said...

Nice article !

Popular Posts