Fire and brimstone coming down from the skies! Rivers and seas boiling! Dogs and cats living together… mass hysteria!
I’ve been in or around IT Operations for 30+ years, I’ve seen some efficient shops and some inefficient shops. One of the things that continues to surprise me is the number of organizations with no idea what downtime truly costs them. Quickly search the internet and numerous studies will show top causes and the extravagant hourly cost of downtime, both tangibly and intangibly. Getting close to the mythical 100% up-time is expensive, no argument there. But without understanding the tangible and intangible costs of downtime to your organization, it’s nearly impossible to justify the costs of the investment needed to maintain and improve availability to the bean-counters of the world.
Availability is IT’s responsibility – a necessary evil that all business units of an organization have input and agreement on which applications are mission critical as well as the cost of those applications not being available.
This also helps ensure application inter-dependencies are addressed. Recovering and getting a back-end order entry database up quickly doesn’t do much good if the front-end customer facing web servers are still down. And no, because Charlie in Accounting can’t get to the company’s team softball schedule doesn’t make it critical. There are critical versus inconvenient when it comes to downtime.
I categorize availability and the ability to recover into three types: Disaster Recovery (DR), Business Continuance (BC) and High Availability (HA).
The main difference besides cost is Time. A wise friend of mine used to say, “you can hire more personnel, you can buy more servers, storage, switches, bandwidth; but you can’t buy more time.”
Two key items related to time that must be understand is the Recovery Time Objective (RTO) – how long does it take to recover my server/DBs/Apps and Recovery Point Objective (RPO) – how old or out-of-date is my data that I’m recovering to.
Disaster Recovery (DR) – typically implies a partial or complete application(s) or data center outage that greatly affects the internal end-users and external customers.
Whether a local or remote recovery plan is in place, a decision must be made on which plan to implement. These decisions take time depending on the issue, plus you still have to bring systems up and working at the main site or the DR site. Tick, Tick, Tick …. And no, having Charlie in Accounting take home backup tapes is not a “good” DR plan. You laugh, but you’ve seen it. RTO can be measured in many hours or days.
Business Continuance (BC) – is less of a plan and more of a system or set process to follow to maintain business functionality.
Depending on the issue, recovery can range from a manual paper process to a simple server reboot to failing over to a warm site. BC implies manual intervention and possible loss of productivity, but some level of business functionality is maintained. For example, a company may have to take customer orders on paper and enter them later once the systems are up. Regardless of the scenario, RTO can be measured in minutes or hours.
High Availability (HA) – HA implies automated failover, with minimal or no manual intervention.
HA is an extension of BC by setting up resources to be fault tolerant. Yes, it’s more costly, but it provides the quickest recovery time and least business impact. RTO can be measured in seconds or minutes.
I’m also a firm believer in the physical separation of servers (OSs, DBs, applications) along with their storage (data) whenever possible, whether in different racks, on different floors, across campus, across town or across country. It’s imperative to minimally have a secondary site to have a truly effective DR, BC or HA environment. Regardless of the amount of redundancy in a data center, a single data center is a single point of failure. This brings up the cost topic again, that’s why it’s critical to understand the impact of downtime to your business and the risk of a disaster happening to your location. The costs of one outage could fund a DR, BC or HA environment.
So if I separate my stuff, how I do I keep everything synchronized?
How do I ensure my data isn’t so far out of whack that it’s next to useless? The good news is there is a virtual cornucopia of options available. Metro or stretch clustering of servers is very stable (can you say split-brain) and storage arrays and storage virtualization engines offer a variety of mirroring and replication features. In addition, many of these features move up the stack to the databases and applications. Don’t forget the third party software vendors that focus solely on getting data from point A to point B in an effective manner. It doesn’t take a degree in rocketology to realize this isn’t simple, there are lots of options and variables. Synchronous mirroring versus asynchronous, continuous data protection with journaling, WAN optimization, compression, de-dupe, encryption, bandwidth cost and limitations, infrastructure are just a few of the many things to consider. For example, synchronous mirroring is a must for HA, but there are distance limitations and it doesn’t protect you from logical corruption. Only you can decide what’s best for your organization and data.
I recently ran across a whitepaper that states up to 90% of downtime is planned. Anyone who has pulled an all-nighter after a failed firmware upgrade or spent a weekend or holiday doing a hardware refresh understands that a planned outage can be more intrusive as an unplanned one. Having an properly configured BC or HA environment not only protects you from unplanned outages, but can greatly reduce the time and risk of planned outages or eliminate the need for an outage all together. Don’t overlook planned downtime when creating a business case for a DR, BC or HA environment–it can be a crucial selling point.
Obviously, I can’t address all of the details and implications in such a short space. Luckily, there is tons of information on the net and numerous companies consult with organizations to implement architecture to meet your availability requirements.
Key downtime considerations and action items:
- Cost to your business for planned and unplanned
- Get organizational input and buy-in
- Categorize servers, databases and applications in importance of both tangible and intangible costs
- Not every server/DB/App needs to be configured for HA
- Do your research. The good news is there are tons of options, and the bad news is there are tons of options. The market continues to drive new innovations.
- Test and document procedures. Things change, so make sure you periodically re-test and adjust documentation accordingly. Testing has a huge impact on your RTO.
Keep an open mind to all the possibilities to solve your particular needs. One size does not fit all.