Monday, May 03, 2010

Consistent Hashing in Python, Redux.

import hashlib
import bisect


class ConsistentHash(object):
def __init__(self, D=5):
self.keyspace = []
self.D = D

def partition(self, arg):
md5 = hashlib.md5(str(arg)).hexdigest()
return int(md5[:4], 16)

def add(self, hash):
for i in xrange(self.D):
k = self.partition("%s.%s"%(hash,i))
bisect.insort(self.keyspace, (k, hash))

def remove(self, hash):
self.keyspace = [i for i in self.keyspace if i[1] != hash]

def __getitem__(self, i):
return self.keyspace[i%len(self.keyspace)][1]

def hash(self, key):
p = self.partition(key)
i = bisect.bisect_left(self.keyspace, (p,None))
return self[i-1]

2 comments:

Jim Wilcoxson said...

Very beautiful! A few suggestions:

- if even distribution is important, the number of replicas must be very high, like 4K, 8K, or even higher. For memcached, I think they use 160 replicas, and other sites have recommended 100-200. I think of default of 5 is a little misleading, although for millions of keys, it may not be important

- instead of using only 4 bytes of the hash, you may want to use 8 or 16, especially if using a high number of replicas to get a more even distribution

- I think there may be an edge case problem with bisect. See:
http://michaelnielsen.org/blog/consistent-hashing/
When I compared your version with Michael's, both using the same number of hash bits, yours required many more replicas to get similarly even distribution of keys

Mike said...

Could you explain in layman's terms what each function does please? I'd really appreciate that! Thanks

Popular Posts