Foreword: The phrase: "If a tree falls in the forest and no one is around to hear it, does it make a sound?" is used quite often in technical circles to make a point that stuff happens, but occasionally without consequence. That said, not every "silent" case goes without consequences. For a storage device, data errors generally have significant impact, and that impact is generally not positive. Is there anything worse than a data error? Yes, there is... For a storage device, the "silent" case is the worst case. The impact of a "silent data error" can be far worse than an error which was detected. A silent data error may not be identified, but if you compare it to the falling tree, it could be a tree that falls on your house! BAM!
Just as most people insure their houses to protect against risks like falling trees, you should insure your data's integrity. You can start by asking your storage device manufacturer to tell you how their design helps protect against silent errors, and, make them give you data which proves it. Does your SSD manufacturer blast their SSDs with radiation and tell you how they perform? Intel does.
In this blog post, Intel Fellow and Director of Reliability Methods, Neal R. Mielke, explores the concepts of "silent data errors", details how Intel SSD® Data Center Family products are designed to minimize likelihood of silent errors, and provides data comparing Intel SSD silent error performance to several competing SSDs.
You probably don’t think much about supernovas, but what if one drained your bank account? Part of the job of designing an enterprise-class Solid State Drive (SSD) is to make sure that things like that don’t happen. It’s easy to claim to have done the design right, but harder to prove it.
Drives can have two types of data errors. The sinister type that threatens your bank account is known as the silent data error. A silent data error occurs when a drive sends bad data to the host, without telling the host that there’s been an error. The host will likely just use the bad data – garbage in, garbage out. It’s easy to imagine nightmare scenarios, like incorrect billings, the wrong product being shipped, an flight directed to the wrong city, or your bank balance dropping from $10,005 to $5. ‘Data integrity’ simply means reducing to a very low probability the occurrence of silent data errors. The other type is the detected data error (often called uncorrectable data error), which occurs when a drive can’t retrieve a particular sector (maybe a Hard Disk Drive (HDD) has a scratch on the disk) and instead sends an error code to the host.
Enterprise systems tolerate detected errors but not silent ones. That’s because a system that’s told about an error can correct it through RAID or by switching to a duplicate copy of the data, but with a silent error it won’t know to try. So silent errors have that nightmare-scenario risk. Drive datasheets specify detected error rates in the range of 1 error per 1E16 or 1E17 bits read, but it’s common for data center customers to demand silent error rates that are several orders of magnitude lower. That’s far below the threshold for measurability in normal testing, so in practice it means zero tolerance. This zero tolerance mindset usually gets translated into action by requiring certain design features and theoretical calculations (more on these later). Those are good things, but we all know that design choices and theoretical calculations don’t always work out; design and theory said that the Titanic was unsinkable. So, in practice zero tolerance for silent errors sometimes means checking off certain design features and then hoping for the best. At Intel we’ve tried to take an extra next step, building a zero-tolerance mindset not only into our design choices but also into how we test our drives.
Why would a drive return incorrect data? Some people incorrectly assume that silent errors occur simply when there are too many bit flips in the storage medium (say, from a scratch on an HDD). But strong Error Detection and Correction Codes (EDAC), the kind that Intel uses, can do a great job of detecting errors in the media even in the rare event that they can’t correct the data. This EDAC strength is one of those things that can be mathematically calculated. So, with careful design, media bit flips that can’t be corrected will become detected errors, not silent ones. The main silent error risk is from bit flips elsewhere: in the drive’s controller (primarily its internal SRAMs and flip flops), or in the DRAM the controller uses as external memory. The controller manages the transfer of data between host and media - a bit flip that causes the controller to do the wrong thing could result in a silent error.
This is where supernovas come in. These exploding stars created many of the cosmic rays that rain down on us and also the radioactive elements like Uranium and Thorium in our soil, trace amounts of which end up in the controller and DRAM. This means that all integrated circuits are exposed to low levels of ionizing radiation. And ionizing radiation can cause a bit to flip, an event known as a soft error. Soft errors are rare, but they do occur, and they can’t be eliminated. The unavoidability of soft errors is the reason why enterprise servers rely on ECC-protected DRAM. Enterprise drives need to be designed to handle soft errors, too.
Handling soft errors requires many design steps. A good first step is to protect the SRAMs and DRAM with parity or EDAC. But those don’t protect flip flops in the controller’s logic circuitry, and soft errors there can also corrupt user data. One approach that Intel takes is to envelop the user data in what’s known as end-to-end data protection. When the host writes a sector, the controller appends a set of Cyclic Redundancy Check (CRC) bits (think of parity on steroids) before passing the whole bundle through the rest of the circuitry to the NAND. When the host later reads that sector, the controller checks that the CRC bits still jive with the user data. If there’s been a bit flip, the CRC will detect it. This is another aspect that can be worked out mathematically and shown to meet ultra-low silent error rate requirements.
This end-to-end approach might seem sufficient, but it isn’t. It’s possible for a controller that’s been confused by a bit flip to return the correct data but for the wrong sector, or out-of-date (stale) data for the correct sector. Either way, the CRC bits will jive with the user data, so the error will be silent. Such problems have been observed in the field (see Bairavasundaram et al, “An Analysis of Data Corruption in the Storage Stack,” ACM Transactions on Storage, Vol. 4, No. 3, Nov 2008). Detecting those scenarios requires other consistency checks, which are employed in Intel enterprise drives. Unlike RAM EDAC and end-to-end protection, these consistency checks and the responses to them are more matters of firmware design than hardware design.
What happens if a consistency check fails? If an Intel enterprise drive detects the error during execution of an operation like garbage collection that isn’t critical to data integrity, it may simply abort the operation. But if the error affects a data-critical operation, and the drive can’t be sure of the data integrity, our zero-tolerance design philosophy is to take any steps necessary to come as close as possible to preventing silent data errors. Sometimes that might mean returning an uncorrectable error status. Sometimes that might require having the drive freeze permanently, or “brick”. A bricked drive, of course, is a failed drive, but this aligns with Intel’s zero tolerance policy for silent errors, and we make sure that the drive’s failure rate meets its datasheet specification.
If you’re thinking that all this is just words and promises, we don’t blame you. Our attitude is “trust but verify”. We trust our engineering approach but just to be sure we verify it experimentally. Of course, like everyone else in the industry we run a lot of drives for a long time – a typical drive qualification in the industry involves continuously running 1000 drives for 1000 hours (6 weeks). But if you do the arithmetic you’ll see that the number of bits read would be on the order of only 1E18 bits. A silent error rate of 1 error per 1E19 bits read is often considered grossly unacceptable in the enterprise field, but this kind of qualification won’t even detect it. So we use two other ways to verify our engineering.
The first way is firmware validation. During special tests, we use software techniques to artificially inject bit flips into the controller’s SRAMs and external DRAM and test to determine that the result isn’t a silent error. The key here is that we can inject far more errors than would happen just by running a lot of drives for a few weeks. This approach works well for RAMs, but it’s hard to inject errors into flip flops that are deeply embedded in the logic circuitry.
The second way is to expose SSDs to accelerated levels of radiation. This causes flips everywhere, from the RAMS to the flip flops. Radiation testing is the standard approach for soft errors in the Integrated Circuit (IC) industry because the soft error rate can be accelerated by a huge amount (think 100 million times). For SSDs, that means that in hours or days it’s possible to measure error rates that are a thousand times smaller than what a normal qualification can measure. It’s surprising to us that accelerated soft error testing of SSDs and hard drives isn’t more common; we’re aware of only one publication (from Cisco*, see
http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2012/20120823_S303A_Shah.pdf). We’ve tested SSDs at the Indiana University Cyclotron Facility (IUCF)* and at the US Government’s Los Alamos Nuclear Science Center (LANSCE)*.
The IUCF testing confirmed one of Cisco’s conclusions: the most obvious effect of soft errors in SSDs is that the drives freeze, or hang. Sometimes the result is a bricked drive, but sometimes the drive re-boots after being given a power cycle. In the Los Alamos testing, we sought to answer one previously-unanswered question: when drives re-boot, do they return 100% correct data, or were silent errors generated as the result of the hang?
The table below summarizes the Los Alamos results for one Intel drive and three competing models, all enterprise data center class SSDs advertised as having end-to-end data protection. All the drives hung at a certain rate when exposed to the Los Alamos radiation beam. If you divide that rate by the acceleration of the beam, you have the projected rate under normal conditions. The table shows the projected % of drives that would hang per year because of cosmic rays at sea level. Different columns show the rates for all hangs, the rates for hangs that led to bricked drives, and the rates for hangs that led to drive re-booting after a power cycle. None of the drives had silent errors during run-time before the hang. Instead, the “Silent Errors” column shows how often drives would hang, re-boot, and then return silent data errors. When a drive never bricked or never had a silent error, the main entry is zero and we show in parenthesis the 90% upper confidence limit (based on the sample size tested). We tested all drives with the same random read/write workload.
|Intel® SSD DC S3700 Series||0.029%||0.028%||0.001%||0 (<0.001%)|
|Competitor B||0.255%||0 (<0.1%)||0.255%||0.255%|
|Competitor C||0.133%||0.066%||0.066%||0 (<0.08%)|
We were pleased that the results verified our design approach. The Intel drive had a low rate of hanging and bricking, far below the datasheet failure rate specification, despite our decision to sometimes intentionally brick a drive if we can’t be certain of data integrity. And that approach worked: the
Intel drive as tested revealed no silent errors. Of course, we can’t say that the Intel drive has a silent error rate of zero – we’d need an infinite sample size (of SSDs or beam time) to measure zero. All we can say from this data set is that the rate would be less than 0.001%/year, or less than 1 per 100K drives.
The competing models either bricked more often or had silent data errors, or both. We can’t extrapolate from these three models to the whole industry, but we can draw some conclusions. One is that no drive should be expected to be immune to soft errors; the only question is how often it’s affected and in what way. Another – a big one – is that a drive identified as having end-to-end data protection doesn’t necessarily have a low silent error rate: silent errors on 0.255% of drives per year (drive B) would be considered alarming by many in the enterprise storage business.
It’s worth reiterating that the silent errors occurred after a drive hung and then was re-booted via a power cycle. For the two models that had silent errors, the silent errors were of the stale-data type. The data returned for certain sectors had been correct at an earlier point in time. Then the host re-wrote those sectors and the drive acknowledged completion of those writes – but after the hang those sectors were found to actually contain the old data.
It’s wise for customers to consider what value they put on data integrity, especially the issue of stale data after a re-boot. In some applications, data integrity won’t matter; no one may care if a few pixels in a jpeg picture get corrupted. But in other applications the answer may be that data integrity is of critical importance. If so, it’s important to evaluate not just whether a drive claims to have end-to-end data protection, but also whether the drive has been proven to deliver real data integrity. If the wrong drive is used in an important application, the resulting data corruption could have unpredictable consequences. In some cases, similar to the consequences of supernovas, exploding billions of years ago.