The journey to transform your data center to a modern autonomous data center can be a reality. What is a modern autonomous data center? Well, it’s taking the latest hardware from compute, storage, scheduling and orchestration, memory, network, and accelerators and combining it with software automation for management and infrastructure efficiency. While this sounds exciting, there are fundamentals which I discussed in an earlier blog (What is your data center up to), discussing key use cases for going modern in infrastructure management. In the next blog (To collect or not collect), the focus was on what kind of data center information is relevant to collect for your specific use case and how to collect that data. Then I shifted into what is Intel doing to enable modern infrastructure. Now let’s move onto a few implementation and reference items that will help you move from batter to a cake.
First, let’s talk about rolling out telemetry. While you have many options available for provisioning systems and applications, for this example we chose to use Anisble*. The goal was to start from a base system, install all core OS packages, then bring in the required packages for collection and then start up the collectd service while making the data available via collectd_exporter for a Time Series Database (TSDB) to scrape the information. My peer Marco Righini helped develop the Ansible playbook (a sample of which is shown in the code snippet below) and we posted a work reference on github (https://github.com/JoshHilliker/Telemetry-Infra).
Shifting to the configuration of collectd, for this example we stripped out portions of the collectd.conf file to just focus on a few metrics that we want to expose. You can always pull down the latest version of collectd and then leverage 100+ plugins defined by the community. If you are ready to try it out I would recommend doing this in a lab/dev environment first and then create what you want to roll out in your production environment. For a lab best known method: my peer Karl Vietmeier uses vagrant for rapid startup, recycle, and test. (https://github.com/kvietmeier)
Here’s a few gotchas that tripped me up the first few times with collectd:
1) Syntax matters.
2) Having all the prerequisites are important or you may not have the plugin or right library to make the plugin start.
3) If the service fails most likely there is a plugin not loaded correctly or a configuration block is not closed out correctly. Check “journalctl –xe” after you start the service to see more, or cat the log file (this example is with Centos* v7).
4) Check out the syntax on the collectd wiki (https://collectd.org/documentation/manpages/collectd.conf.5.shtml), because there may be a flag for %’s versus values that will help with the alerting later.
5) When enabling new plugins, do one at a time because it makes troubleshooting that much easier (compared to having to # out each one and reverse the process).
6) Intel has a few plugins that you will need to get, such as Intel_RDT and Intel_PMU.
7) Use the syslog plugin to enable logging (see the configuration block) because it will help with troubleshooting later.
8) Sequence matters. If you are going to use a write_http service, you will need that running prior to restart/start of collectd. Therefore, start collectd_exporter first, then start collectd.
9) If you don’t ## the config block, it will use defaults, which can make things easier. For example, with IPMI it will give you all sensors, or you can pare down the list to the key ones if you un-## the block.
Last month I was asked, “what does the output look like? What do the graphs look like?” Great questions! For this example, I used collectd + Prometheus* + Grafana* to show what is possible. In the graph below, I show a few metrics that matter, such as the Intel® Optane™ SSD Wearability indicator, also called E9H or Endurance Analyzer. You will see it in the Smart Metrics for the drive and you can set an alarm for when you see a dip to a different level to alert you. I also show the memory used, which is a good indicator to see what the memory trends are. While this is not the MCELOG output, it is a good metric to keep an eye on.
The JSON file is posted on github (https://github.com/JoshHilliker/Telemetry-Infra).
Where will the journey take you next? For the next blog I’m going to dive into more Intel® Data Center products and how they fit into the modern infrastructure vision.