Recently, I posted my blog, “Do You Know What Your Data Center is Up To?,” to jump-start a discussion about making your data center autonomous. One thing I realized after I posted that first blog is that I didn’t give any technical details. This blog addresses that problem, providing some real “meat” that you can use to start the shift to autonomous data center.
First, let me talk about the reference architecture for telemetry collection. A telemetry architecture includes the following key components:
- Learning: Ingesting the key telemetry data points, defining baselines, and then identifying trends which drive prescriptive controls.
- Control: The scheduler that acts upon the telemetry data to make changes.
- Alerting: When collection data is out of norm, the alerting system sends a notification to the DC Ops team.
- Visualization: The system that shows the collected data in graph format.
- TSDB (time-series database): The system that captures the inbound telemetry (push) or scrapes (pull) data from collection end points.
- Scalable & Expandable Collection Tool: An agent or application that captures the key telemetry and either pushes or pulls the information in a given time interval.
Each component has a number of options and potential combinations to create the right architecture for your needs.
Now that you know what an end-to-end telemetry collection system looks like, let’s consider a couple real-world solution architectures that showcase a few of these components. These solutions are industry-known and widely available to deploy throughout your data center.
The following solution architecture combines CollectD* with Grafana* and Prometheus*. What makes this compelling is the community support for Grafana, which allows you toeasily download and deploy graphs into your instance. Prometheus will scrape the logs from the local machines, therefore the nodes are not actually reporting out; instead, they are exposing their telemetry info for the pull. Installation is rather simple for both Grafana and Prometheus and the configuration information is easy to add for the nodes to your instance.
As another option, you could use CollectD with the ELK stack (Logstash*, Elasticsearch* and Kibana*). What I have found very intriguing about this solution is that if you are using Docker*, there are a few containers easily implementable (from an all-in-one container to a three-container model). This solution requires each node to setup their “Network” plug-in in CollectD to push their information for a given time interval to Logstash.
Both of these solution architectures are open source, which means they have great community support and you have more configuration options than you can imagine. Along with CollectD working on the top server OS platforms such as Centos*, Red Hat Enterprise Linux (RHEL), Ubuntu*, and FreeBSD*.
With each of these solution architectures the scale factor is important. Anyone can set up a telemetry system on a single node or even a rack – and this is perfect for testing and to show the business value and TCO of such a system. But when it’s time to scale to thousands of nodes, you need to start thinking of deployment. Do I use a Docker image to deploy via Kubernetes*? Do I use Chef*, Ansible*, or a custom tool to make the telemetry system part of my automated build and deployment process? These are all good questions to think through when considering future telemetry you may need for new use cases and adjacent platform components.
So let’s keep this going… what do I collect? Well, it depends. What is your pain point? Here’s a mapping of pain points to what telemetry you should collect (not an exhaustive list):
What to Collect
Why You Need It
|Infrastructure Efficiency (Power Management, Thermal Management and Workload Optimization)||System-level Info||Gathers the system-level information around power and thermals which is being exposed through the Baseboard Management Controller (BMC) to the OS.|
|Utilization||Utilization of the system over time|
|Performance Information||Monitoring the performance of the CPU, cache, and memory. Needed to determine the amount of effective headroom on the system. Used to target specific machines for workload co-location|
|Reliability||Power & Temperature Spikes||Need to know if there are any system issues like power spikes or high temperatures causing the failure or rapid degradation of the device|
|Memory Errors||Logging tool for memory errors seen in the system|
|DIMM Performance & Health||Shows the wear-out indication of a drive along with various other health-related metrics (for example, unsafe power downs)
Raw performance and health for the DIMM
Utilization of the system and indicator if the system is highly used during the time of failure
|Throughput||Monitoring high throughput and its relation to the failure or potential failure. Used alongside the MCElog and ipmitools for failure trend detection|
|Performance||Performance Metrics||System-level performance for each cgroup, which is necessary to understand the resource consumption of each cgroup (that is, workload)|
|Memory Access||Measuring the memory accesses of the system, which is used for calculating the effective headroom for resources|
I hope this blog has gotten you even further excited about transforming your data center into an autonomous environment. Stay tuned for my next blog, which will delve deeper into what exciting projects Intel is working on that can help you with that transformation.