"The computer says 'No'," I was told as I was turned away from a tram ride at a nice ski resort that had recently upgraded its ticketing to an advanced, automated system. The system could track everything from a person's season pass status to the roster of all the people taking the base to peak tram ride at any given time. I later found out that that everything was fine except that one of the resortâ€™s systems went down, which made me think - what if the computer had said 'No' midway up on my ride?
Similarly, enterprises today demand reliability in their datacenter for operations ranging from customer facing CRM for the call centers to the backend databases cranking out account settlements. Reliability and availability are essential to our perceptions of quality, yet there are still many who often appreciate power and performance the most in choosing the servers for our datacenters.
Yet most computer circuits are susceptible to what are called soft-errors. These are non-permanent data or operation errors that originate from random environmental alpha particles, cosmic ray radiation, or other thermal neutrons. Computers work with binary signals, and the energetic particles can cause a signal change from '1' to '0' or '0' to '1' in submicron circuits, resulting in errors that can sometimes be observed in calculations. While engineers strive to minimize these types of errors by adding additional checking and correction circuits, there is also an important feature we should look for in mission critical servers: Error prevention.
An error prevented is one that never has to be detected, corrected, logged, and recovered from. A high-end server processor such as the Itanium processor 9300 series based on the EPIC architecture is conceived with error prevention as a design goal. It makes extensive use of soft-error hardened and resistant latches and registers (memory elements) that are 100 times and 80 times more resilient respectively than their non-hardened versions. In fact, over 99 percent of all latches in the system interconnect functional areas, the highways within the Itanium 9300 processor, use the soft-error resilient latches.
There are many RAS features to consider on a mission critical processor, such as advanced machine check architecture (MCA), physical (electrically-isolated) partition handling, and Cache Safe Technology. The following whitepaper, link here, is a good start for those interested in these advanced features. Yet itâ€™s also important not to overlook the role error prevention plays to improve reliability and availability in silent ways. Combined with a mission critical system design and hardened operating system, it means companies will be much less likely encounter a catastrophic event that they cannot recover from, which equates to increased savings to their bottom-line.
As a final thought, by preventing more soft errors, the best RAS feature becomes the one that you seldom notice, but that means all your computers will more often correctly say 'Yes'.
Till next time!