The journey to transform your data center to a modern autonomous data center can be a reality. What is a modern autonomous data center? Well, it’s taking the latest hardware from compute, storage, memory, network, and accelerators, and combining it with software automation for management and infrastructure efficiency.
While this sounds exciting, there are fundamentals that I discussed in a previous blog, "Do You Know What Your Data Center Is Up To?", discussing key use cases for going modern in infrastructure management. In my next blog, "To Collect or Not Collect?", I focused on what kind of data center information is relevant to collect for your specific use case, and how to collect that data. Now, let’s take the next step and dig into what Intel is doing to enable the modern infrastructure. I’m going to review our commercial offering, Intel® Data Center Manager (Intel® DCM), and several open source projects, then close with “the art of what is possible.” Let’s go!
Managing Rising Energy Costs
First, let’s start with our commercial solution called, Intel® Data Center Manager for power and thermal management. Energy costs are the fastest-rising expense for today’s data centers. Intel® Data Center Manager provides real-time power and thermal consumption data—giving you the clarity you need to lower power usage, increase rack density, and prolong operation during outages.
- Real-Time Power and Thermal Monitoring: Accurate power and thermal consumption data gives you the insights needed to manage your data center or co-location power usage and hotspots.
- Health Monitoring and Utilization: Granular sub-component failure analysis and out-of-band real-time utilization data (CPU, disk, and memory).
- Increase Rack Density: Maximize server count per rack in a fixed-rack power envelope for increased data center utilization.
- Optimize Power Usage: Optimize power profiles per server, rack, floor, or workload/application and reduce electricity costs.
- Power Through Outages: Continue or prolong operations during power outages. Business is better with shorter and less frequent outages.
Intel® DCM helps you to take control of your data center and drive efficiency in your data center.”
– Ajay Garg, Director, Intel® DCM
The Open Source Community
Now, as we shift into open source projects, let’s start by talking about collectors—specifically collectd, a mature system statistics collection mechanism widely used across the industry. It consists of a core daemon and a set of read/write plugins to collect/push telemetry. Its pluggable architecture enables collection of chosen metrics with read plugins. In the northbound direction, write plugins enable pushing data to different places, such as OpenStack* services or databases, or send the data over the network.
As an active community member and plugin developer on collectd, Intel's goal is to expose key telemetry that can help drive the maturity of the modern autonomous data center, from collection to prediction to automation—this will provide true platform service assurance. Specifically, we are working on a few plugins that expose telemetry from Intel’s unique platform features that are shown in the table below. Utilizing a combination of these features can truly differentiate your solution by leveraging these metrics to improve quality of service; reliability and resiliency; fault management; compute, network, and storage resource utilization. With collectd’s interface to cloud-native monitoring solutions like Prometheus* and Nagios*, it provides the essential metrics and interfaces for optimal resource utilization and integration with machine learning and artificial intelligence algorithms.
Table 1: collectd Plugins Pertaining to Various Platform and Cloud Technologies Enabled by Intel
|Intel® Run Sure Technology / RAS||Mcelog, PCIe AER, LogParser: Metrics and notifications pertaining to Intel® Run Sure Technology related to memory, PCIe, UPI, Core and LLC RAS features|
|Intel® Resource Director Technology (Intel® RDT)||Intel® RDT-related metrics from Cache Allocation Technology (CAT) and Memory Bandwidth Technology (MBA)|
|OVS*||Ovs_stats, ovs_events: Metrics related to Open Virtual Switch*|
|DPDK*||Dpdk_stats, dpdk_events, hugepages: DPDK-related metrics|
|OpenStack*||Gnocchi, Cielometer, Aodh: Integration with OpenStack projects|
|Cloud||Write_Kafka, write_Prometheus, Node_exporter, VES: Integration with various cloud platforms|
|Storage||RAID, SMART: Storage-related metrics|
|Out of Band||IPMI, RedFish: Telemetry from out-of-band platform features|
|Platform||PMU: Platform counters|
Having been on the road talking with customers, there is healthy knowledge of collectd; however, the question I get asked is which plugins to use for which use cases. For example, if a customer is concerned with reliability for SSDs, what specific tools need to be installed, what plugins should be turned on to what rate of collection? Another use case is, “How do I detect noisy neighbor issues and what metrics would help identify these?” These are great questions! We do have all of this documented to be consumed publicly and I share this when I talk about each use case with customers.
The open source project that we are investing in to provide use case, events, and metrics information is called OPNFV Barometer. Being the only telemetry project in the OPNFV ecosystem, it provides a scalable and easily deployable containerized solution that takes all of our plugins and delivers data via a configured, ready-to-run package using Ansible*. It helps deploy and interconnect the latest collectd plugins, Influxdb* (a time series database), and Grafana* (a visualization tool), all in less than 10 minutes via “One Click Install”! You can visualize the various telemetry data sets available for you to leverage for your deployments. I highly recommend trying this out in the lab and seeing how easy it is to install and configure.
Collectd and OPNFV Barometer provide out-of-the-box, easy to scale, open source metrics and monitoring solutions for all your infrastructure telemetry needs.”
– Sunku Ranganath, Network Software Engineer, Intel
Let’s now connect all the dots from use case to specific plugin and explore “the art of what is possible.” One specific use case is memory reliability, in which we leverage CPU, memory, storage, and platform telemetry for correlation to memory errors. Specifically, we use the following collectd* plugins: MCElogs, Intel® Performance Counter Monitor (Intel® PMU), IPMI plugin, and CPU/memory/disk plugins. We then use this information with the algorithm we developed based on a Dynamic Bayesian Network to give a probability of failure for each memory module. As you can imagine, we can go from here to an operations console or to a scheduler to put a server in maintenance mode to reduce any downtime and assist with a faster mean time to repair (MTTR).
What’s next? Let’s finish out the art of what is possible and let’s get to installing, using, and analyzing the data.… Stay tuned for my next blog!