As the parent of three, the youngest still in middle school, I’ve learned how to keep track of how my kids spent their time, their money, and who they spent time and money with. As a long-time Intel employee, I have spent much of my career working on remote and automated health management of client PCs—think Intel® vPro™ Technology, CPP (Compliance, Patching & Provisioning), etc. Over the last few months I have shifted to a new role: Working with servers, specifically in the cloud solutions space focused on infrastructure modernization and telemetry. What I am finding that is consistent from parenting to PCs to servers, is that knowing what signs to watch for and then taking appropriate action is critical.
Knowing what’s going on under the covers in the data center is especially key for Cloud Service Providers (CSPs). The CSP market is fiercely competitive. Margins are often slim and inefficiencies in the data center can mean the difference between making or losing money. Customers want services for the lowest price, but they also expect high availability. To meet these demands, CSPs have to increase data center operational and cost efficiency—and that’s where telemetry can help.
So, What’s the Big Deal About Telemetry?
Over the last two months I have dug in on telemetry, including the latest developments with NodeManager*, Intelligent Platform Management Interface (IPMI), collectors, time series databases, visualization capabilities, orchestrators, containers, and more. It’s been very cool to see the progress that’s been made in the cloud space since my initial days in hybrid cloud a decade ago. However, I’m not just excited about technology for technology’s sake. I’ve asked myself, “what are the real-world use cases for using telemetry—what can CSPs do with it and how can it help their business?” While I’m still learning, I think I’ve heard and seen enough to pose the “Big Three”:
- Power management: Static power capping on a single node is old hat. What if you could do dynamic power capping, looking at the complete rack level? What if you could understand what your nodes are doing and make real-time changes based on fluctuating power needs? What if you could automatically balance competing power caps?
- Thermal management: Think about the possibilities if you could use NodeManager to see what is happening with thermals in a box as well as it’s position in the rack—and then be able to interface with the orchestrator to move workloads or talk to the CRAC unit to bump cooling up a notch?
- Workload optimization: Then, what if you could drill down one more step, to determine which nodes are underutilized and to dynamically move workloads around to take advantage of free CPU cycles?
If you haven’t noticed, there’s a common theme here: Do more! Do it more efficiently!
This isn’t just a pipe dream—the telemetry technology exists TODAY. You can start your journey to lower Power Utilization Efficiency (PUE) and Total Cost of Ownership (TCO) and increase resource utilization by learning how to expose the telemetry already present in your data center—that’s what the rest of this blog will talk about. Then, you can move up the telemetry maturity ladder by interacting with the telemetry data to take action, use analytics to automate monitoring and action, and finally add in machine learning and artificial intelligence to create a truly autonomous, self-monitoring and self-healing data center.
So, how do you get started on this amazing journey?
First Stop: Expose the Telemetry You Already Have
Your data center probably is already equipped with temperature and airflow sensors. If you’re using servers based on Intel® architecture, you also already have sensors and platform telemetry throughout the data center that are capable of helping with server performance, reliability, efficiency, and capacity planning. For example, modern Intel® Xeon® processors include sensors that monitor cache, CPU, memory, and I/O utilization, as well as sensors for airflow and outlet temperature.
Going back to my parenting analogy, suppose your tween is supposed to come straight home from school and you have a video camera inside your house. If you never look at the video stream, that camera is useless! Similarly, all these sensors in your data center don’t do much good if they aren’t being exposed to a data center management tool. Phase 1 of your telemetry journey is about filling any gaps in tooling and ensuring that your data center management tool can talk to all the sensors. Scalability is paramount—whatever solution you choose must be able to scale to thousands of nodes.
A good place to start is to take a close look at your PUE. One of the biggest drains on energy and therefore TCO, is not being able to pinpoint power inefficiencies in the data center. Unfortunately, in many cases, CSPs measure their PUE only once or twice a month, getting data from the data center’s uninterruptible power supply (UPS). Even daily measurements taken at the power distribution unit (PDU) don’t provide the whole picture. Using the Intel® Power Thermal Aware Solution (Intel® PTAS) lets you measure power at each individual server, using sensors already present in the CPU—a feature available in all modern Intel® Xeon® processors.
While power is one important metric for PUE, temperature also plays a role. Temperature sensors placed on server racks do not provide insight into what’s really going on with each individual server. But accessing the data provided by inlet, outlet and airflow sensors on Intel® Xeon® processors can provide superior thermal data. With a good visualization tool, you can set thresholds for power and thermal data to enable better overall data center PUE management.
By exposing telemetry continuously, CSPs can collect a wealth of data on power, thermals and utilization—improving efficiency across the entire data center footprint. And, unlike your kids, who may resent your monitoring their activities, your servers won’t mind a bit. Once you start putting your telemetry data to work, your customers will enjoy enhanced uptime, your bosses will appreciate lower TCO, and your IT folks will be able to focus on proactive data center innovations instead of running around reacting to whatever fire they have to put out at that moment. Heck, even the environment will thank you as you optimize your PUE.
I’m super excited about the data center efficiency and automation enabled by telemetry and Intel technology. My future blogs will explore in more depth how to put modern infrastructure and telemetry to work to help drive lower PUE and TCO and increase resource utilization.