How Intel Strengthens System Health

How Intel’s diagnostic tool helps datacenter IT administrators manage service quality and uptime

Availability is the single greatest asset in datacenter operations. From Board and C-suite to IT operations teams, availability is job 1 for companies built on private or hybrid cloud. CPUs are the foundation of the computing stack. There are rigorous metrics and processes in place to ensure the highest quality in semiconductor manufacturing, and while rare, the reality is that some may go rogue in the field. Google researchers dubbed these “mercurial cores,” which sounds like a binge-worthy Netflix series, but in reality, are a debug nightmare without the right tools and screening processes.

The development of analytic tools has helped manage system health, yet these tools have been traditionally developed for monitoring faults in software and storage. In technology products, including semiconductor electronics, faults can be driven by multiple factors, including early life failures, random defects in manufacturing, test coverage gaps, logic issues, wear-out, and even cosmic radiation. Cosmic rays from outside the solar system have been found to affect technology by creating a “bit flip” so that a “0” in a program’s binary code becomes a “1,” and vice versa (Business Insider).

CPU Faults and Data Corruption

The increasing complexity of modern IT infrastructure requires datacenter systems to implement effective predictive analytics software. Without an effective diagnostic tool, organizations are prone to unplanned downtime and data corruption stemming from system failures. CPUs, especially those with high fault tolerance, are commonly overlooked as the root cause of data corruption and system failure. While rare, silent data corruption or errors (SDC, SDE) in CPUs, as noted by Facebook and Google, are especially difficult to resolve. IDC notes that IT administrators need help in “proactively addressing potential service quality and uptime issues.”

Maintain a Proactive Posture

Organizations need to be proactive rather than reactive; trying to debug processor corruption can take months. To avoid this, enterprises need a system that warns against potential CPU errors, failures, and data corruption. A tool that performs periodic testing is ideal for managing system health. Each time a test is performed, the enterprise has a chance to replace flawed equipment and prevent future failures. Datacenter managers understand and expect system faults to occur on occasion, which is why fleets build in flexibility and redundancy, and plan for periodic maintenance to ensure data integrity and high availability. Intel understands each customer’s workload is unique, so we have created multiple options for running the diagnostic tool to help fit a predictive maintenance plan.

Simplify the Diagnostic Process with Intel

That's why we designed the Intel® Data Center Diagnostic Tool (DCDIAG) to pinpoint a faulty CPU in less than an hour. The tool also provides a background test mode which runs tests for only one second per hour with minimal impact to system performance. Diagnostic tools have been used in the client space for a long time, but for the first time, datacenters will also have access to these dynamic tools. The Intel® Data Center Diagnostic Tool performs tests similar to those run in Hyperscale Cloud Providers’ datacenters. IT administrators who use the Intel® Data Center Diagnostic Tool as a regular system maintenance program can proactively identify potential problems, eliminating service quality and uptime issues before they occur.

IDC acknowledged Intel® Data Center Diagnostic Tool’s potential positive impact on enterprise datacenters in the IDC White Paper, sponsored by Intel, How a Diagnostic Tool Can Maintain Service Quality and Uptime by Discovering Potential System Failures Before They Happen. Authors Lucas Mearian and Ashish Nadkarni explain, “Along with pretesting equipment, diagnostic tools should be used to continue to monitor the health of systems as they mature over time . . . Intel can address a massive market made up of IT administrators who are increasingly facing the ever-growing difficulty of proactively addressing potential service quality and uptime issues because of processor malfunctions.”

Currently, the Intel® Data Center Diagnostic Tool is a Linux application that supports1st, 2nd, and 3rd Generation Intel® Xeon® Scalable Processors (products formerly Skylake, Cascade Lake, and Ice Lake processors). In the future, Intel plans to release a Windows version of the diagnostic tool. Learn more about how this new tool advances our work to elevate quality in an increasingly complex, heterogenous, and disaggregated world.

Published on Categories Data CenterTags , , , ,
Rebecca Weekly

About Rebecca Weekly

Hyperscale Strategy and Execution, Intel Corporation Vice President, General Manager, and Senior Principal Engineer. Rebecca leads the organization that influences every aspect of Intel’s cloud platform solutions. Together they shape Intel’s development, production, and business strategy for Hyperscale Cloud Service Providers by driving strategic collaborations with key partners to ensure platform requirements meet customer needs. Rebecca is the Open Compute Project chair and president of the board and is on Fortune’s 40 Under 40 list of most influential people in technology. In her "spare" time, she is the lead singer of the funk and soul band, Sinister Dexter, and enjoys her passion of dance and choreography. She has two amazing little boys, and loves to run (after them, and on her own). Rebecca graduated from MIT with a degree in Computer Science and Electrical Engineering.