Can your business afford downtime
In recent years organizations have been moving their business-critical services and infrastructure to cloud solutions. They are cost efficient and flexible, but the question remains, are they reliable?
Downtime has always been one of the major disadvantages of cloud computing. A Service Level Agreement of 99.99% availability would be too low for critical services. That single hour of downtime may cost your business dearly.
Common problems experienced
Every computer-system-based activity, both big and small, is, at some point, faced with the problem of computer failure.
The unavailability of a critical application can be due to 3 types of problems:
- Hardware and environment, including the complete failure of a computer room (20% of problems)
- Software: regression on software update, overloaded service, software bug (40% of problems)
- Human errors: administration error and inability to properly restart a critical service (40% of problems)
How to solve the issue
The first rule is to make sure your solution is simple. Having a complex high availability system can lead to further downtime.
For hardware and environment failures in the Cloud, redundancy is required with at least 2 Virtual Machines (VMs) for running a critical application. The VMs must be put in different availability zones (isolated locations) of the Cloud provider to support the complete failure of a location. It must be able to:
- detect the failure of one VM
- restart the critical application on the other VM
- replicate in real-time the data of the application from one VM to the other one without data loss in case of failure (synchronous real-time replication)
For software failures, your high availability solution must be able to:
- monitor the critical application with process checker and custom checker
- restart the application locally in case of failures
- switch to the other VM if the restart is permanently unsuccessfull on a VM
- allow a smooth ugrade procedure VM by VM, when passing fixes on the application, to avoid software regression on software update
- implement load balancing to avoid failures on overloaded application
For human errors, your solution must:
- ensure no special computer skills is required
- implement a very simple administration console to stop/start/swicth the critical application between VMs
- ensure all procedures are automatic without manual operation: failover when there is a failure, but also failback when a failed server must reintegrate the cluster
The move to cloud will continue at pace. It must do if business is to be able to take advantage of all the benefits digital transformation can bring. But unless you look at managing risk your plans will fail.
Understand the business continuity plans that are in place for the occasion you have a computer failure.