How much time does the average data center spend swapping hot-swappable drives to maintain 99.999% or better uptime?
How much does this cost in manpower? How much time is spent waiting for data to re-stripe across these drives before they are usable in an array? And why are modern data centers so complacent with this practice?
These are the questions I ask many potential customers in early discussions to see if X-IO self-healing storage arrays are a good fit for their environments. Usually this discussion starts when I tell my audience that X-IO’s Intelligent Storage Element (ISE) DataPacs are engineered to dramatically limit environmental degradation that would accelerate drive MTTF (mean time to failure) for HDDs and SSDs, and that our Matrixed RAID (redundant array of independent disks) system further extends drive life. Given our patented architectures, we can deliver 99.999% data availability for five years with a sealed drive array. For emphasis, I usually add, ” ‘sealed’ means no hot-swappable drives.” It’s at this point that one of two reactions occurs: either the listener sits forward, eager to learn how this is done, or I get a look that essentially says that I’ve lost my mind.
If I can explain the above in more detail, hopefully you’ll agree that my mind is mostly intact.
The crucial metric that determines the drive array reliability is the MTTF or MTBF (mean time between failures) of each individual drive in that array.
These terms are used somewhat interchangeably, and both are based on a theoretical lifetime for the modern SAS drive of 300,000 to 1.2 million hours. I actually prefer the Seagate redefined concept as the Annual Failure Rate. Whichever definition you use, the idea is simple. In an array, drives will fail at a predictable rate. Seagate’s AFR estimates are between 0.5% and 1%. So let’s consider a modern array.
1,024 X 1.2TB drives deliver 1PB RAID 5 array composed of 5-disk parity groups of 1TB drives. In modern arrays, additional capacity of 20% is common to ensure high availability by provisioning hot spares. So, the modern 1PB array will have 1,474 drives in this example. With an AFR of 0.75%, 11 of these drives will fail in a year. Assuming a five year life, 55 drives will be replaced in this 1PB array. If this is a 2PB array, 110 drives will require replacement over five years. 3PB, 165 drive replacements. Smaller drives mean more replacements. Larger drives mean longer rebuild times. Why five years? Because X-IO offers a five-year hardware warranty on the sealed ISE units.
So what’s the manpower cost to replace these drives over a five-year period? These data are less available, mostly because modern data centers accept this burden of maintaining hot-swap arrays as a sunk cost. Conventional thinking is that this expense cannot be altered, i.e. it’s not a variable expense, so just pay it and move on. Consider the chart below. Your experience may vary, but this is not an unrealistic estimate:
A storage admin in a medium IT organization can have a fully burdened cost to the organization of $75/hour. Each drive replacement, absent complications, costs about $45. Over five years, that 1PB array of hot-swap drives costs the IT department $2,475. A 2PB array costs $4,950. That 3PB array – $7,425. If your IT organization has challenging logistics, e.g. rigorous Change Control, Good Manufacturing Process (GMP) or physical access issues, it’s not unreasonable to estimate the five-year cost of replacing drives in a 1PB array between $5K and $10K. I’m familiar with a large telecommunications data center that has a dedicated tech wheeling a cart around several days per week for an entire day of disk replacements. This is a single data center, and this company has five data centers in the US alone.
Not included in this cost estimate is the time waiting and monitoring the re-striping operation on the replaced drive. A quick Google search will show that this is a fraught endeavor, with many users reporting errors which then require more time and IT admin man-hours to sort out. There is also the outlier case of bent data pins at the backplane when the replacement drives are inserted. Admittedly, this is a rare occurrence, but not unknown; the consequences of backplane replacement are nightmarish. A more common occurrence is an operator pulling the wrong drive, destroying data availability (recall there is already one failed drive in the RAID group). The drive is slammed back in, and the failed drive is then pulled out.
Modern automobiles are increasingly sealed systems, serviced much less frequently, serviced by highly trained experts, and yielding lifetimes of 200,000 miles or more. Why are modern storage arrays manufacturers still relying on the hot-swap drive as the core of their availability? Find out in Part 2: How Much Are You Spending to Swap Drives?