High Availability vs Fault Tolerant vs Disaster Recovery

This guide covers the comparison between three important concepts, High Availability vs Fault Tolerant vs Disaster Recovery.

High Availability design ensures that system is performant for a very high uptime. High Availability systems are an excellent solution for application that must be restored quickly and can withstand a short interruption during failure.It includes both redundant components and mechanism for failure detection as well as workload redirection. These can be achieved using load balancers and Auto scaling groups.

Fault tolerant design ensures that system is up and working even in faulty scenarios. A fault tolerant system must be highly available as well. A fault tolerant system has no service interruption but a higher cost, while a highly available environment has a minimal service interruption. In case of failure a fault tolerant system ensures system availability with degraded performance. It consists of two tightly coupled components that mirror each other. So if primary component goes down the secondary component is ready to take over. This design is useful if there is critical application that support zero downtime processes.

Disaster recovery ensures that in situation when there is damage beyond repair, system is able to preserve key data and bring up servers in same state. Disaster can be failure of components or entire physical infrastructure.

Disaster recovery involves a set of policies, tools and procedures to enable the recovery of system and infrastructure. Disaster recovery requires having a secondary location where critical data and workloads can be restored in case of disruptive event. Disaster recovery solution takes high availability and fault tolerant to one level up. But it considers both high availability and fault tolerant design. Disaster recovery design has two important objectives, Recovery Point Objective (RPO) and Recovery Time Objective (RTO).

Recovery point Objective or RPO is a time between when disaster occurs and the last recoverable copy of key business data was created. This can be reduced by taking frequent back ups, snapshots and transaction logs to avoid business loss.

Recovery time objective or RTO is a time between when a disaster occurs and when system is restored to operational state and handed over to business. This can be improved by spare hardware and ready to use components.