Do you know what hardware telemetry is, and how it can be used to modernize your data center? You might not, because it is data that generally doesn’t present itself to you unless you take steps to expose it. I’m talking about the stream of data generated by the components on your hardware platform, such as the CPU, memory, and PCIe interface.
Telemetry data can help you diagnose thorny problems, and it can be an unsung hero for detecting and identifying issues like memory failure and for managing system-health elements like power consumption and thermal efficiency. Check out this infographic on the five benefits of telemetry monitoring to learn more.
Monitoring your telemetry data is a key step toward creating a modern autonomous data center: an intelligent data center that can predict, diagnose, and respond to workload needs for infrastructure. But first you need to start collecting the data.
Get a First Taste of Telemetry
If you are new to telemetry, a good place to start is with the Intel Telemetry Collector (ITC). With ITC, you can ingest and visualize data from various sources and multiple machines to gain an understanding of the different metrics available.
ITC collects telemetry such as power and thermal statistics, performance counters, process activities, threads, and more. The ITC visualization highlights typical pain points such as memory bandwidth, NUMA imbalance, and interrupt request (IRQ) affinity issues. With access to the same collection of tools that Intel’s own performance engineers use, you can identify system imbalances, frequency inefficiency, and memory and input/output (I/O) issues. With ITC, you’ll begin to understand the depth of telemetry data provided by Intel architecture. Intel telemetry is so robust that advanced users are able to optimize their code by looking at every single memory access to debug issues and identify bottlenecks.
If this sounds interesting to you, talk to your Intel representative about how to get access to ITC.
Build a Telemetry Software Stack
ITC is a good first step into telemetry, but it is not a fleet solution. To monitor all your server telemetry, you’ll want to build a scalable software stack. There are various options here, including Elasticsearch, Logstash, and Kibana (ELK). But the easiest and best-supported toolchain for this purpose is the Prometheus stack.
Prometheus is an open source database program you can use to collect time-series data from telemetry agents installed on all your servers. For a telemetry agent, you’ll probably want to use collectd or Telegraf. To build a dashboard for viewing the data in Prometheus, I recommend Grafana.
Example of telemetry visualization using Grafana
This Prometheus stack supports the open telemetry standard, is containerized, and is not difficult to set up. Intel supports the Prometheus toolchain by adding exporters for Intel hardware components, either through collectd or Telegraf exporters or through direct exporters.
Reap the Benefits
The first benefit you’ll likely notice when you start monitoring your telemetry is an improved ability to diagnose and troubleshoot. It can be difficult and time consuming to identify exactly what’s gone wrong in your data center. You might not know whether a problem or degradation is caused by a failure of hardware, such as memory, or by a workload imbalance that’s overloading certain servers or components. Without the right information, you can find yourself constantly struggling to put out fires and not knowing what exactly is causing them.
Which applications are using your memory? Do you have a lot of resource contention in your cache? Do you have a properly hyper-threaded application that’s being well distributed, or is it only using one core? Is your server throttling? What are the power and temperature conditions inside each server? With telemetry, you can have a finer-grained picture of exactly what is going on in all of your servers, which can provide these kinds of information and insights to speed and simplify problem resolution.
Come for the troubleshooting but stay for the opportunities. Your immediate need might be to put out fires more quickly. But once you have telemetry monitoring in place, you’ll discover that everything about managing your data center starts getting easier and more efficient.
When you can see where the bottlenecks and overloads are occurring over time, you can start fine-tuning your processes and orchestrating your workload placement based on real-time online workload profiling to avoid those performance bottlenecks. This helps to protect network performance and balance the load across the infrastructure. Intel has worked with hyperscalers to solve highly advanced problems using Intel telemetry. One hyperscaler was able to eliminate up to 96 percent of instruction cache misses. Another demonstrated 4.5 percent performance improvements by optimizing configurations for microservices using telemetry data for A/B testing.
Meanwhile, your Grafana dashboards have the ability to trigger alerts when conditions arise that need attention—and automated alerts are a first step toward automating your data center on a grander scale. Our vision of a modern autonomous data center is one where human intervention is increasingly replaced by automated processes. By applying machine learning to all this telemetry data, systems will be able to predict and prevent some problems and to diagnose and resolve others.
To learn more about getting started with telemetry monitoring, check out this white paper.
Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.
Your costs and results may vary.
All product plans and roadmaps are subject to change without notice.
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.