Telemetry: Customer Triage Use Cases for Intel® SSDs

Article co-written by Monika Sane.

In my previous article, I introduced Intel's use of telemetry to remotely gather debug data from Intel® SSDs. While there is the potential for any number of different event scenarios that pull the telemetry data, all of the scenarios boil down to just three main use cases. The details of use case implementations vary depending on the scenario; however, the general use case parameters remain the same.

As shown in Figure 1a, a telemetry log can consist of three data areas. None of the data areas have a fixed size. There is a target limit on data area 1 for host initiated of 100KBs, and there is a hard limit of 31.9MBs for the command, but the actual data area sizes is dynamic.

Use Case 1: Periodic Status Pull

In this use case, the host periodically pulls telemetry log data from the SSD. The frequency and data set read will vary according to the individual scenario. Ideally, the less frequent the host polling event, the larger the data set that the host will pull. Possible reasons for periodic telemetry data pull include, but are not limited to, performance excursion during customer test, excursion in the field, and long-term health monitor in the field.

Requirements: Minimal data assembly overhead, as shown in Figure 1b, minimizes chances that the data capture masks performance issues. Small data set transfer size (data area 1 only) minimizes the chances that the telemetry data fetch masks the performance issue. Small data size also minimizes the load on system network resources for uploading data.

Use Case 2: Single Event

In this use case, the host has detected a singular error event. This use case arises when the host detects an error that the SSD does not. Possible errors include, but are not restricted to, command time out, long command latency, or data mis-compare.

Requirements:  In this case, the goal is to have maximum data capture, as shown in Figure 1c. The root cause of the error may have occurred a relatively long time before it is detected, therefore we will need the full depth of data available (data areas 1, 2, and 3).

Telemetry data logs

Use Case 3:  Gathering both dynamic and static telemetry data at varying frequencies

Pull data areas 1 through 3 at lesser frequency to gather static data—pull this data first! To get dynamic data, pull data area 1 at the desired frequency.

Along with data collection in firmware via telemetry, Intel is also building intelligence (interpretation/threshold) into a separate tool or standard, or Linux-based tool. Initially, our focus is on capturing telemetry data for Intel internal use to remotely debug issues. Therefore, we are ensuring that we capture all pertinent information in Intel internal tools in order to pull, parse, and analyze the logs. We will then determine what information can be made available to our customers to address customer desires.

Efficiency Calculations

Key metrics for debug efficiency are time to data (TTD)—defined as the time from when a bug appears at the customer to when actionable data is delivered to the drive vendor—and “time to first failure analysis (FA) summary (TFFAS)—defined as the time from first bug appearance to when the first FA summary is provided.

For the current mode of operation, TTD can take from days to as much as a week, depending on such variables as how far away a drive vendor’s field person is and how easy it is to get to the drive. TFFAS is directly affected by this delay—without data to analyze, there is nothing to communicate for the first FA summary. Both TTD and TFFAS are expected to be reduced significantly using the telemetry feature.

Once the customer has integrated telemetry, collecting the telemetry data is virtually instantaneous and in-circuit. Delivering that telemetry data to Intel, or the customer’s own technician, depends on that customer's processes. Time needed for analyzing the telemetry data to reach the first FA summary is also significantly reduced, with the use of Intel® automated tools, such as triage. Once a bug occurs, for example, the customer can pull the telemetry logs and provide them to the drive vendor on the same day—likely within hours of the first failure occurrence without any intervention from drive vendor’s application engineer (AE).

Other benefits of telemetry are that it is an industry standard so customers can use multiple vendors' drives but expect to be able pull telemetry files using the same industry standard command for all the drives in their products. We’d like the OEMs to adopt this for all vendors, not just for Intel® SSDs. The OEMs should only need to write a single utility to gather logs off all of their vendors, in theory, since this is now NVMe and SATA standard mechanisms for telemetry.

Efficiency also means light touch support from Intel AEs and technical sales specialists, relative to what we have had to do until now (without telemetry). It also means not having to ship that many failures back-and-forth, and eliminating the time it takes to do that.

Future of Telemetry

Intel plans to enhance telemetry capabilities in future products, through the following means:

  • Partner with internal FSE experts to define next generation debug hooks (in both hardware and firmware).
  • Implementation and follow through on the next generation hooks.
  • Partner with all FSE existing forums/experts on achieving above objectives.

Customer Response

Great interest has been expressed in telemetry as a debug aid from Intel customers worldwide. Every customer asked for ability to interpret some portion of the telemetry log (first level, self-serve analysis) to understand potential failure and pre-fetch some action(s) without involving Intel. Intel is excited to evolve this feature in the upcoming months. Please stay tuned for more information on telemetry.

Published on Categories StorageTags , , , , ,
Behnam Eliyahu

About Behnam Eliyahu

Behnam Eliyahu is an Application Engineer in the Non-Volatile Memory Solutions Group (NSG). He responsible for technical enabling and support of flash storage (SSDs) and non-volatile memory and storage (Intel® Optane™) for EMEA customers. His responsibilities include escalation for field representatives, direct coverage of key customers, design-win activities, product trainings, support of product qualification and validation, support of solution development and sustaining support of SSDs after launch. He works with storage innovators/OEMs/ISVs to help driving the storage transition to PCI Express, NVMe-based 3D NAND flash products, Optane™, and supporting the ever expanding scale up and scale out storage needs. Prior to that, he was an enterprise SSD firmware team lead in NSG design center in Longmont, Colorado. His team developed the flash translation layer for both 2D and 3D NAND technologies for the PCIe product line.