Re-thinking high-availability

Virtualization and Cloud computing brought new approaches on computer architecture organization and Data Center operation, so I believe that a brief review on key availability concepts and how these apply for cloud computing is worth re-thinking. Are old strategies adopted years ago still valid nowadays?

Usually, availability is measured by a combination of tow metrics:

                MTBF – Mean Time between Failures

                MTTR – Maximum Time to Repair

It describes the number of system failures in a period of time and how long, on average; it takes to be get back online. Using these two metrics, the overall availability is given by this formula:


Where D is availability (e.g. 99.997%)

So, in order to think on availability we must treat this concept in layers as part of the system organization, usually we slice the strategy into 5 layers:

  • Electronic Components: It’s the first layer and basically is elementary components of hardware from MTBF perspective, i.e. quality and endurance of hard-disks, memory, CPUs, power supply, etc.
  • Data Protection: Mechanism and algorithms embedded in hardware architecture that guarantee data coherency, consistency and integrity flowing through components, such as value generated by CPU, stored into cache and coherency with neighbour caches, memory and capacity to recover from bit-flip, and disk integrity and durability;
  • Component Redundancy: In order to avoid that a failure in a hardware component can compromise the system, adoption of redundancy lower MTBF components can improve availability of entire system such as: redundant power supply, RAID configuration for hard-disks, redundant NIC, memory sparing technology, etc.
  • Server Redundancy: Redundancy of entire server with appropriate technology based on systems: Failover cluster for scale-up applications or load balancers for scale-out applications in order to make the environment more reliable on case of a server failure;
  • Disaster Recovery: Assuming that a major disaster can happen, this layer of availability is designed to address Data Center disaster or even a documented procedure to restore the entire system.

Actually, with virtualization at the base of cloud computing strategies that embed failover capabilities and scalability provided by sharing compute resources, server redundancy becomes a granted feature in a cloud infrastructure.

If you look at each layer of availability, you will find that for most enterprise-class servers, components used to assemble the machine already possess a high MTBF without extra costs. Thanks to manufacture process improvements and scaling computer industry. While basic security is present at the CPU and memory level, anything more advanced or any redundancy will add to the bottom line cost.

Let’s take a moment and re-think the approach. If you have a stateless application you will only lose the active session in the failing machine but the application itself is still available.  And for stateful applications you wont just lose the active but be penalized on the MTTR, which is something predictable, you can work around. The time to boot the guest machines and services on one of the remaining servers in the pool could be a fair approach with a revert investment in RAS on more servers in order to provide a better response time.  And as a bonus you can use the extra machines for peak demand when necessary to ease your budget.

It’s for these reasons that some manufactures are shipping machines sharing components instead of putting in redundant configuration, such as a twin server sharing the same power supply, blade servers that share I/O from enclosure, etc.

Sometimes, we have to review our values in order to keep competitive, I’m pretty sure, if seven years ago, predicting this kind of demand and ways to archive better availably with lower CAPEx would be considered insane… It might be time to rethink your strategy.

Best Regards!