Article co-written by Monika Sane.
Telemetry refers to an umbrella of tools, utilities, and protocols to remotely extract and decode information for debugging potential issues with Intel® SSDs. Telemetry works over industry standard protocols, and eliminates or minimizes the need to remove SSDs from customer systems for retrieving debug logs. Telemetry thus enables host tools, Intel technical sales specialists, (TSS), Intel application engineers (AEs), and Intel engineering teams to better identify and debug performance excursions, exception events and critical failures in Intel® SSDs, without sending the physical drive to Intel for failure analysis.
This capability is designed in accordance with NVMe* 1.3 telemetry specifications as well as corresponding ACS 4 SATA definitions (which are common industry standards), and is expected to accelerate debugging of external and internal bug sightings pertaining to Intel® SSDs. The key difference between NVMe and SATA is the fact that there is no controller-initiated capability on SATA drives.
Purpose of Change
Today’s method uses vendor specific commands that are not accessible in customer firmware, and also not available on (Intel) locked SSDs. Thus far, we have asked customers to pull the basic internal logs by Intel external tools. If we need more information from the drive, we ask the customers to ship us the drive in order to use an internal tool, called triage. (Basically, customer needs to remove the drive and send it back, or take it to a special test PC that has access to the Intel® database for unlocking and then extract data.) This process can sometimes lead to loss of time and be detrimental to debug schedule.
As an improvement over traditional ways, due to the use of NVMe 1.3 telemetry commands, now there is a standard method for customers to parse data in a running system; data that will enable debug of performance issues. Such issues may not be reproducible on Intel internal testing or instrumented drive setups. Telemetry also creates a standard method for customers to gather essential data required by Intel engineers to identify and debug critical failures, and, as mentioned earlier, without sending the physical drive to Intel for failure analysis. Customers also have flexibility to develop standard-based tools, for analyzing the gathered data. Hence we call it “enhancing customer triage”.
The NVMe 1.3 specification defines two new log pages: 1) Host initiated telemetry log (log page identifier 0x07) 2) Controller initiated telemetry log (log page identifier 0x08). Intel has decided to use the NVMe 1.3 telemetry specification across all of our products (including SATA products). Also, the NVMe 1.3 telemetry specification defines that the page return data contains up to three consecutive data areas. Intel defines what internal data needs to be packaged into the telemetry data sets.
Telemetry basically includes three steps:
- Pulling /collecting the information.
- Parsing the information.
- Analyzing the information.
In current products, Intel is developing an end-to-end tool chain to retrieve logs in customer environments (with various operating systems such as Linux, MS Windows, etc.). Intel will ensure completeness of the logs (assert dumps, nLog, side trace, etc.), in order to effectively and completely debug sightings w/o needing access to the drive (unless of course, the drive is dead). Intel would also like to utilize this method to debug internally generated sightings, such as those generated from validation (both continuous validation and product validation).
In my next article, we'll look at three primary use case implementations of telemetry standards and tools for Intel® SSDs.