A system being an isolated, independent entity is a scarce thing. They do have dependencies, dependencies on other systems. A very classical example is Database. Any system, doing anything useful, is probably backed by a database, the database is the system’s dependency. If something happens to the database, the system is compromised, well of course unless you have made a fallback for this. These fallbacks, these little hacks a designer adopts to make the system up and running make the system resilient.
This is a series focussing on designing resilient systems and primarily deals with the concept of the circuit breaker. This particular blog post is going to be a theoretical introduction to circuit breakers and will be followed a simple implementation of the circuit breakers in the Golang programming language. But before moving on to circuit breakers, let us discuss a very simple example of a resilient system.
Resilience in Action
What according to you is the most essential part of a system?? Well, undoubtedly it is the one that interacts with the user. No matter how optimized your backend is, how scalable your system is if the user experience is not good, nothing else matters. And this is where you see the most innovative ways to make the system resilient. Suppose there is a hypothetical app identical to the true caller. Every time you receive a phone call, it shows you the name of the dialer with a little photograph. Suppose the photograph is saved in an S3 bucket or some CDN. What happens when the network is slow or when S3/CDN is down??
Don’t you see the number? or Do you see an ugly image not found tag on your phone screen? The answer is no. Instead, the UI of the application shows you a dummy image of a person.
The majority of the customers don’t even notice that something went wrong under the hood. That is a very basic example of a truly resilient system.
While designing a system one must remember that anything that can go wrong, will go wrong. It is impossible to design a system that never fails, but it is possible to design a system that takes care of the majority of those failures.
Let us also look at an example of a backend fault tolerance system. Suppose your system takes in two locations and fetch the distance between them using an external API. Now suppose the external API is down. You can make a fallback in this case, simply use basic trigonometry to give approximate results. This is better than crashing the application. Maintaining a cache of recently queried data is also an example of fault tolerance design. Even when the connectivity to the database is down, you can get the last queried data from the cache which can be returned to the end-user with a simple message as “Data last refreshed X minutes ago”.
Introduction to Circuit Breakers
Circuit breaker is a household safety device which is used to break the electrical circuit in case something goes wrong with electricity, so that even if the supply is compromised, the household appliances are safe, making the whole electrical system more resilient. Adding a transfer switch to the whole system allows you to get supply from the generator, a sensible fallback mechanism. But how does that make sense in software design??
Let’s look at what a circuit looks like in the software world.
This is an example of a circuit in terms of software. The point where the user calls the backend to the point where he receives the response is a circuit. The above picture, of course, is a circuit in the closed state. This is also called a Happy path, where everything goes as it should go. But what happens when the external service misbehaves?? What happens when the external service is down?? Well, when x% of y requests fail in a rolling window of t seconds, the circuit is opened. Please don’t panic about x, y, and t. We will cover this in detail later in the post.
In case the external service is down, the circuit breaker opens and does not let you call the external service at that time. Well, what have we achieved with this?? Looks like nothing, the end-user is still not able to get any response to his request. How do even circuit breakers help??
Let’s look at what we achieved here first :
- The external service is already down, we should not waste more of our resources by making a connection request to them and waiting for the response. We can open the circuit open for a brief time and try after some time to see if the service up. We save a lot of resources which otherwise would have waited for a service which is down.
- Also, by keeping the external service free for a while give it a time to get healthy quickly.
- The user does not have to wait 10-15 seconds only to find the request failed, the system immediately tells the request cannot be completed because the external service is down. (Swiggy does it when UPI is down, the corresponding option is automatically disabled).
Circuit breakers are not just sanity checks, they are first-hand informers for you. They tell you if you can trust the external service at that particular time or not. And this information lets you implement a fallback mechanism for your system.
So we see how we trigger fallback when the circuit is open. But what can we do in the fallback function? Remember the intro to this post, if you find that the circuit representing the database connection is open, the fallback is to fetch old data from the cache. It is better than no data at all.
When to open the circuit
When to open the circuit depends on the SLA between you and the external service, though some parameters are used to configure a circuit breaker and should be configured correctly.
- Default Timeout: How long do you want to wait before declaring a request timed out? The latency of external API adds up to the latency of your system. Having a very less timeout increases the number of time out errors you have to deal with, whereas a longer period makes your service slow.
- Error Percentage Threshold: What is the minimum percentage of failed requests you need to open the system. A less number opens up the circuit even if some packets are lost due to network issues, whereas a very large number is someways defies the purpose of the circuit breaker.
- Request Volume Threshold: Minimum number of requests needed to decide if the circuit has to be opened. Let us see how it is different from the error percent threshold. Suppose you define that the 50% error opens the circuit. But 50% of what?? That is where the Volume Threshold comes into the picture. If it is very less, say 2, even one failed request will open the circuit, whereas if it is high, then again the whole purpose of the circuit breaker is defied.
- Sleep Window: A circuit cannot remain open forever. Sleep Window is the time after which a connection is tested after the circuit is open.
What errors should open the circuit
One thing for sure, not all the errors should be considered while making the decision if the circuit is to be opened or not. The errors originating at the user end, like Bad Requests (HTTP 400 and 401) should not at all be considered. Else it will be possible for a malicious user to send a lot of Bad Requests and disrupt the normal functioning of your app.
However, errors originating from the server like Internal Server Errors (HTTP 500) or Service Not Available (HTTP 503) should contribute to the decision.
This article was just a brief introduction to the concept of circuit breakers and possible use cases. We will discuss more about implementing circuit breakers in upcoming posts.