System Design Series — Availability
All about building robust, functional, and scalable systems.
In the previous writeup, we have learned about latency and throughput, in this writeup, we will be learning about system availability. For an instance, you could interpret availability in a couple of ways, one of the ways is how resistant your system is to failures. What will happen when your authentication server goes down, what if your database server goes down, to all these cases, we have something called “Fault Tolerance” in the system, which is the measure of how fault-tolerant is your system.
Availability can also be understood as the percentage of time in a year, your system is operational enough such that all its primary functions are satisfied. These days most of the systems are highly available. For system designers availability is one of the primary goals while designing a system.
Let’s assume an example of medium.com. Generally, there are writeups, blogs, and publications on this website. What if it goes down for a few hours? The world is not going to turn down but in the case of Aeroplane support software (the software which helps airplanes to be operational during flight) goes down for a few minutes, hundreds of lives will be at stake. So that is extremely unacceptable. If youtube ever goes down for hours it will cause chaos because hundreds and millions of people use youtube every day and it will impact a large population. Let us come to the real question, what does it mean for a system to be available? We typically measure availability as the percentage of the system uptime in a given year. Nowadays when we are dealing with availability we are dealing with a very very high percentage. In the industry, most of the systems target very high availability so we end up measuring availability not exactly in percentage but rather in “nines”.
If the system is online for 99% of the year, we can say that the availability of the system is two nines, if it is online for 99.9% of the year, then we can say that the availability of the system is three nines. Here we have a chart from Wikipedia containing various information regarding downtime and availability measures in days and hours.
You can see for two nines case our system is going downtime for 3.65 days a year which is unacceptable in today’s scenario and to be in business and even more unacceptable if your system supports life and death scenarios.
Nowadays five nines are regarded as the gold standard for availability and we should target it while designing a system. Many service providers have something called SLA (Service level agreement) and it is an agreement between a service provider and customers or end users of the system. So many service providers have explicitly written SLAs and tell customers, hey we guarantee this much service availability as mentioned in SLAs. SLO is synonymous with SLA, it's related to SLA but it’s not the same. SLO stands for service level objective. You can think of them as components of SLAs. If I provide a service for you, I provide a percentage of system uptime, that uptime percentage is an SLO. All the cloud service providers like GCP, AWS, and Azure have clear-cut SLA and that is mentioned on the product page of their website.
Having high availability might come with a tradeoff and it might come with higher latency or lower throughput. All parts of the system are not compulsorily required to be highly available but some parts must be needed like the payment service or the authentication service.
How do we increase the availability of the system?
A system shouldn't have a single point of failure. That means, your system shouldn’t have a single point of failure that causes your whole system to go down. That can be achieved by “Redundancy”, which means certain parts of the systems can be duplicated, replicated, or multiplicated to reduce a single point of failure. So, lets us assume a simple client-server architecture application when a client sends a request to the server and the server communicates with the database server. The server can be termed as a single point of failure. So we want to make servers redundant which means we have to duplicate the servers and add a load balancer to distribute traffic uniformly to different servers. But now the load balancer is the single point of failure. So load balancers are used in large part to avoid a single point of failure (discussed in load balancer writeup). We have to duplicate load balancers to overcome large traffic issues. We have two types of redundancy, active and passive.
Active redundancy also allows for element failure, repair, and substitution with minimal disruption of system performance. Passive redundancy is the application of redundant elements only when an active element fails (e.g., using a spare tire on a vehicle in the event of a flat tire).
You can find me on LinkedIn — https://www.linkedin.com/in/connectayush
Thank you :)