Telemetry: Enhancing Customer Triage of Intel® SSDs

Article co-written by Monika Sane.

Telemetry refers to an umbrella of tools, utilities, and protocols to remotely extract and decode information for debugging potential issues with Intel® SSDs. Telemetry works over industry standard protocols, and eliminates or minimizes the need to remove SSDs from customer systems for retrieving debug logs. Telemetry thus enables host tools, Intel technical sales specialists, (TSS), Intel application engineers (AEs), and Intel engineering teams to better identify and debug performance excursions, exception events and critical failures in Intel® SSDs, without sending the physical drive to Intel for failure analysis.

This capability is designed in accordance with NVMe* 1.3 telemetry specifications as well as corresponding ACS 4 SATA definitions (which are common industry standards), and is expected to accelerate debugging of external and internal bug sightings pertaining to Intel® SSDs. The key difference between NVMe and SATA is the fact that there is no controller-initiated capability on SATA drives.

Purpose of Change

Today’s method uses vendor specific commands that are not accessible in customer firmware, and also not available on (Intel) locked SSDs. Thus far, we have asked customers to pull the basic internal logs by Intel external tools. If we need more information from the drive, we ask the customers to ship us the drive in order to use an internal tool, called triage. (Basically, customer needs to remove the drive and send it back, or take it to a special test PC that has access to the Intel® database for unlocking and then extract data.) This process can sometimes lead to loss of time and be detrimental to debug schedule.

As an improvement over traditional ways, due to the use of NVMe 1.3 telemetry commands, now there is a standard method for customers to parse data in a running system; data that will enable debug of performance issues. Such issues may not be reproducible on Intel internal testing or instrumented drive setups. Telemetry also creates a standard method for customers to gather essential data required by Intel engineers to identify and debug critical failures, and, as mentioned earlier, without sending the physical drive to Intel for failure analysis. Customers also have flexibility to develop standard-based tools, for analyzing the gathered data. Hence we call it “enhancing customer triage”.

Telemetry Method

The NVMe 1.3 specification defines two new log pages: 1) Host initiated telemetry log (log page identifier 0x07) 2) Controller initiated telemetry log (log page identifier 0x08). Intel has decided to use the NVMe 1.3 telemetry specification across all of our products (including SATA products). Also, the NVMe 1.3 telemetry specification defines that the page return data contains up to three consecutive data areas. Intel defines what internal data needs to be packaged into the telemetry data sets.

Telemetry basically includes three steps:

  1. Pulling /collecting the information.
  2. Parsing the information.
  3. Analyzing the information.

In current products, Intel is developing an end-to-end tool chain to retrieve logs in customer environments (with various operating systems such as Linux, MS Windows, etc.). Intel will ensure completeness of the logs (assert dumps, nLog, side trace, etc.), in order to effectively and completely debug sightings w/o needing access to the drive (unless of course, the drive is dead). Intel would also like to utilize this method to debug internally generated sightings, such as those generated from validation (both continuous validation and product validation).

In my next article, we'll look at three primary use case implementations of telemetry standards and tools for Intel® SSDs.

Published on Categories StorageTags , , , , ,
Behnam Eliyahu

About Behnam Eliyahu

Behnam Eliyahu is an Application Engineer in the Non-Volatile Memory Solutions Group (NSG). He responsible for technical enabling and support of flash storage (SSDs) and non-volatile memory and storage (Intel® Optane™) for EMEA customers. His responsibilities include escalation for field representatives, direct coverage of key customers, design-win activities, product trainings, support of product qualification and validation, support of solution development and sustaining support of SSDs after launch. He works with storage innovators/OEMs/ISVs to help driving the storage transition to PCI Express, NVMe-based 3D NAND flash products, Optane™, and supporting the ever expanding scale up and scale out storage needs. Prior to that, he was an enterprise SSD firmware team lead in NSG design center in Longmont, Colorado. His team developed the flash translation layer for both 2D and 3D NAND technologies for the PCIe product line.