When I started this blog I focused on the what, why, Intel Telemetry, and a little on the how. After hearing more feedback while out on the road talking to customers, I thought now would be a great time to dig in a little deeper on the “how”, from the initial infrastructure analysis to scaling out telemetry for your DC. I appreciate hearing your feedback on my prior blogs and I will continue to capture the journey to the Modern Autonomous Data Center.
When we engage with customers, we like to do a quick analysis of the infrastructure to understand the configuration, workload, hardware, software, network, power and thermals. This quick analysis helps identify areas to focus on, whether it’s higher density, dynamic power capping, thermal controls or optimize the hardware configuration. I have discovered that the analysis phase alone gives great insights. Intel has done a great job of creating a few key tools that we can share, ask your Intel Sales rep for more info!
So.. let’s dive right into the quick analysis conversation. After working with one of our customers and hearing their feedback we decided internally to pull our telemetry and performance engineers together and collaborate on a single telemetry package for this very analysis. We called it OTP [ One Telemetry Package ], this package would pull together all of our key tools, do a rapid collection, ingestion and graphing of the key elements and insights for a customer review. We also looked at ways to make the footprint per node as small as possible, document all dependencies, and even check those dependencies and contentions before running the tool. This package is the same package our performance engineers would use to do an initial performance review of your systems and environment!
Now that you have a quick overview of what OTP can do, let’s talk about the standard steps we go through on the analysis phase.
- Step 1. Showcase our MADC Vision and telemetry details
- Step 2 . Understand customer pain points and pick a focus area
- Step 3. Collect and analyze details by running OTP for 24 hours on a set of systems
- Step 4. Define a plan/proposal/model
- Step 5. Pilot/POC project
“This is great data how do I scale this into production?” What a great question and it comes with its own set of challenges. As we tackle the scale question we go back to our work on collectd and Intel plugins. We have taken all the required telemetry metrics from the OTP tooling set and translated that into collectd plugins along with a proper collectd configuration. We are also working on a collectd container that will have all this information in it for easy deployment in container environments. While the collection rate will be different in scale than in the analysis phase it will still serve the purpose in scale.
I want to end this blog with a shout out to the Telemetry Task Force at Intel that made OTP and the Scale output possible – Samantha Alt, Harshad Sane, Vaishali Karanth, David Shade, Inaki Madrigal, Naren Nayak, Jack Cannon. As you see us out on the road please give us your input, feedback and what you would like to see. Until the next blog.. telemetry on!