Consistent Hashing for Distributed Cache Systems, the topic itself seems to be very heavy and confusing. Before jumping on to the article, let us break down the topic into simple terms and draw a conclusion on what this article aims to achieve. This is part-1 of the series in which build the intuition as to why consistent hashing is needed. In the next part, we will define Consistent Hashing for Distributed Cache Systems and write a simple go implementation.
Coming back to the topic, here is it’s break down.
Consistent : Uniform/invariable
Hashing: The transformation of a string of characters into a usually shorter fixed-length value.
Cache: A key-value store, given a key you can get the value.
Distributed System: System whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another.
Now let us try to stitch the pieces together and throw some sense into the topic.
What is a Distributed Cache System?
As already mentioned, a cache is nothing but a Key-Value store, you store a value against a key, and can later fetch it. It is just like the map data structure. (Going deeper into the cache is out of the scope of this article).
Now suppose you have a web application running and it uses a cache. But we are talking about distributed computing here, so your application is hosted across multiple servers. That means you have multiple servers running the same web code any request coming from client is routed to one of those servers by a load balancer.
Now take into account that your service is using a cache. Now since there are N servers which can use this key-value store, you cannot save the cache on any of those N servers. You need to have a dedicated machine (let us call it C) which stores this cache, and a proper semaphore (OS Concept) to avoid a race condition.
Now let us take into action one more scenario, suppose the key-value pair is huge, in millions or billions, do you think a single machine can save that much data? Also, there would be N servers making a constant request to the C (central cache), don’t you think too much traffic will cause undue latency?
To save the day, we make use of what is known as Distributed Caching. Instead of keeping the complete cache on one machine, we distribute the cache between (say) k machines. The complete cache (key-value) store is now divided into k parts, and each part is saved on one machine. Thus, none of the k machines are overloaded in terms of memory or terms of traffic. Sounds good, right??
No, we have one more problem to solve, given a key, the server should know which of the k machines holds that key. There has to be a way of getting to know precisely which of the k machines have the key.
Some algorithms/techniques answer the question, given the key, which of the cache machines holds that key. From now on we will be calling the cache machine as a cache server, so to reformulate the topic, Given a finite number of cache servers, and a key, we want to design an efficient way to determine which of the cache servers holds that key.
Reiterating the problem statement, we have N cache servers and given a key we need to locate the server which contains the key. The problem can also be rephrased as, given a key, which server to save the key to. Let us examine it. Let f(x) be the function which tells which server the key
x will be stored on. Suppose
save call comes for
key1 with value as
v1 . We calculate
f(key1) which gives us the server
N , and we save
v1 on that server
N . Again when the
fetch call comes for
key1 we calculate
f(key1) to get the server
N and fetch the value from the server. What we need to find here is the black box function
A very naive way of handling this would be to take the hash modulo of the number of servers. Let us define a hash function h(x), for any key x this function returns an integer hash value. Every time a key comes, we compute the hash, modulo it by the number of servers
N and whatever the result is we save the key, or try to fetch the key from that particular server.
Let us take an example, we have 4 cache servers (S0, S1, S2, S3), and some keys, let us see how the allocation occurs.
Key Hash Modulo Server k1 43 3 S3 k2 34 2 S2 k3 21 1 S1 k4 16 0 S0 k5 23 3 S3 k6 41 1 S1
Let us now suppose out of our 4 servers, one (S3) crashed, it not available anymore for fetching value from. We had a request for
k5 , now since our
N is 3 now, our formula gives us
23 mod 3 = 2 . We will look into S2, will result in a cache miss and the result will be fetched from the source and saved in the S2 now. Thus we see when a cache server is removed, some of the keys are fetched from the source and saved into one of the available servers. Let us try to see what happens with other keys.
Key Hash Modulo Server k1 43 1 S1 k2 34 1 S1 k3 21 0 S0 k4 16 1 S1 k5 23 2 S2 k6 41 2 S2
This table gives us a very useful insight. The server location for almost all the keys had to be changed. In an ideal world, we would have wanted that only the keys that were saved on the server that crashed (
S3) to be reshuffled, but here every key is going through reshuffling.
What does it cost?
As soon as a cache server is removed (or there is any change in the cache system), almost all the keys are reshuffled. Thus after any change
fetch for all the keys face a
cache miss and they have to be fetched from the source again. This increases the load on
Source . This problem is described as rehashing problem.
In our next article, we are going to study Consistent Hashing for Distributed Cache Systems which mitigates this rehashing problem and write a basic Golang implementation of the same.