How Intel’s diagnostic tool helps datacenter IT administrators manage service quality and uptime
Availability is the single greatest asset in datacenter operations. From Board and C-suite to IT operations teams, availability is job 1 for companies built on private or hybrid cloud. CPUs are the foundation of the computing stack. There are rigorous metrics and processes in place to ensure the highest quality in semiconductor manufacturing, and while rare, the reality is that some may go rogue in the field. Google researchers dubbed these “mercurial cores,” which sounds like a binge-worthy Netflix series, but in reality, are a debug nightmare without the right tools and screening processes.
The development of analytic tools has helped manage system health, yet these tools have been traditionally developed for monitoring faults in software and storage. In technology products, including semiconductor electronics, faults can be driven by multiple factors, including early life failures, random defects in manufacturing, test coverage gaps, logic issues, wear-out, and even cosmic radiation. Cosmic rays from outside the solar system have been found to affect technology by creating a “bit flip” so that a “0” in a program’s binary code becomes a “1,” and vice versa (Business Insider).
CPU Faults and Data Corruption
The increasing complexity of modern IT infrastructure requires datacenter systems to implement effective predictive analytics software. Without an effective diagnostic tool, organizations are prone to unplanned downtime and data corruption stemming from system failures. CPUs, especially those with high fault tolerance, are commonly overlooked as the root cause of data corruption and system failure. While rare, silent data corruption or errors (SDC, SDE) in CPUs, as noted by Facebook and Google, are especially difficult to resolve. IDC notes that IT administrators need help in “proactively addressing potential service quality and uptime issues.”
Maintain a Proactive Posture
Organizations need to be proactive rather than reactive; trying to debug processor corruption can take months. To avoid this, enterprises need a system that warns against potential CPU errors, failures, and data corruption. A tool that performs periodic testing is ideal for managing system health. Each time a test is performed, the enterprise has a chance to replace flawed equipment and prevent future failures. Datacenter managers understand and expect system faults to occur on occasion, which is why fleets build in flexibility and redundancy, and plan for periodic maintenance to ensure data integrity and high availability. Intel understands each customer’s workload is unique, so we have created multiple options for running the diagnostic tool to help fit a predictive maintenance plan.
Simplify the Diagnostic Process with Intel
That's why we designed the Intel® Data Center Diagnostic Tool (DCDIAG) to pinpoint a faulty CPU in less than an hour. The tool also provides a background test mode which runs tests for only one second per hour with minimal impact to system performance. Diagnostic tools have been used in the client space for a long time, but for the first time, datacenters will also have access to these dynamic tools. The Intel® Data Center Diagnostic Tool performs tests similar to those run in Hyperscale Cloud Providers’ datacenters. IT administrators who use the Intel® Data Center Diagnostic Tool as a regular system maintenance program can proactively identify potential problems, eliminating service quality and uptime issues before they occur.
IDC acknowledged Intel® Data Center Diagnostic Tool’s potential positive impact on enterprise datacenters in the IDC White Paper, sponsored by Intel, How a Diagnostic Tool Can Maintain Service Quality and Uptime by Discovering Potential System Failures Before They Happen. Authors Lucas Mearian and Ashish Nadkarni explain, “Along with pretesting equipment, diagnostic tools should be used to continue to monitor the health of systems as they mature over time . . . Intel can address a massive market made up of IT administrators who are increasingly facing the ever-growing difficulty of proactively addressing potential service quality and uptime issues because of processor malfunctions.”
Currently, the Intel® Data Center Diagnostic Tool is a Linux application that supports1st, 2nd, and 3rd Generation Intel® Xeon® Scalable Processors (products formerly Skylake, Cascade Lake, and Ice Lake processors). In the future, Intel plans to release a Windows version of the diagnostic tool. Learn more about how this new tool advances our work to elevate quality in an increasingly complex, heterogenous, and disaggregated world.