Here’s a dilemma many data center operators face: we’d like to quantify all aspects of data center efficiency, but many measurements are too difficult or too costly to make. So, we either succumb to inaccessible data or we compromise, accept what we can measure, and move ahead.
PUE is a great example of what compromising can accomplish. PUE, despite known imperfections, has become the de facto industry standard for data center infrastructure efficiency. It focuses on a high impact area of data center cost efficiency in a simple but clear way and provides a basis for defining improvements. It’s not perfect, but it’s relevant. It’s not overly complex but instead relies on easily available data. And PUE has done more to improve data center efficiency than anything short of Moore’s Law.
So, where does this leave us? Well, the answer is, with an elephant in the data center.
The most wasteful energy consumers in data centers (especially in low-PUE data centers) are inefficient servers. Here is some data collected from an actual data center recently by and architect in our HDC team, William (Bill) Carter:
The data is from Bill's walkthrough of just one data center in a typical large company. The details may or may not be relevant to your specific data center, but the general lesson might. In this particular data center servers older than 2007 consume 60% of the energy, but contribute only an estimated 4% of the compute capability. Energy consumption by IT equipment is dominated by inefficient servers. That's a pretty big elephant!
This observation got me thinking about what ways one could build a framework to assess this problem.
There are lots of ideas already. The Green Grid is working toward a Productivity Proxy as a non-proprietary computational model for understanding the workload portion of the energy efficiency equation. But this is still a few years away at best. Emerson Electric proposed the CUPS model in 2008 to reflect the workload dependent aspects of energy efficiency, its simplicity is appealing and it’s one of the proxies The Green Grid is exploring for broader use.
But that elephant, inefficient servers, is still lurking in your data center. We need a way to take action today.
So let me stir up the pot and propose an idea for a kind of Server Usage Effectiveness or SUE type metric. In the spirit of PUE, the first order of business is to keep it relevant and simple. Let's not try to boil the ocean with over analysis. Let's get started today!
So what’s the one thing we know about computers? That their efficiency has closely followed a “Moore’s Law” type progression of doubling in efficiency every two years. That means, all else held constant, a server that is two years old is about half as efficient as a server you’d purchase today.
We can take this idea and simplify a server efficiency assessment as follows:
To simplify the math, we take the age in whole years: a server less than one year old has an age of zero, older than one and less than two is an age of one, etc. The term 2 -Age/2 = 0.707Age gives an approximate age based performance gap compared to the most current generation of server. The idea is similar to Emerson’s CUPS idea, except that instead of setting 2002 = 1.0, SUE sets Today = 1.0. So, yes, you need to update SUE annually.
Here’s how SUE might work: let’s say I’ve bought 500 servers per year for the previous four years. My baseline server population would be as listed in the table below. In the current year, I refresh my oldest servers with newer ones.
Following the above rules, you end up with the result
What does SUE mean in words? An SUE = 2.2 simply implies that you have 2.2 times the number of servers you actually need (based on current daa center productivity and workload, etc.). Pretty straight forward.
In the above data from Bill, for instance, I could very easily (and approximately) assess an SUE. Before the server refresh of 2010 might have been as 3.5 or higher. The current SUE is around 1.5, a good number but still with plenty of room for improvement (the company is still paying the operational costs of about fifty percent more servers than it needs to). Of course this is an approimate analysis to illustrate the point. Greater granularity of the data is needed to be more specific.
So what are some pluses and minuses of this approach? Well, before we launch into this, let’s be honest and admit nothing is perfect. Observations like, “you can game the system,” “it doesn’t account for x or y or z,” and “you need to think” are acknowledged (since they’re basically true for everything).
Here’s what I like about this approach:
- It accounts for system performance. This is the biggest factor driving energy efficiency in our industry. It’s the elephant.
- It’s quick. I can make an assessment of SUE in a couple hours for almost any data center- without impacting operations.
- I can talk about the results in plain English. This is a big benefit when talking with management.
- It’s a “contemporary” metric, meaning it captures the time evolution of server performance relative to today’s productivity.
- It follows the scheme of PUE where 1.0 is ideal.
Here’s what you might not like:
- You need to think before you use it. It is not the answer to every question about data center performance.
- It’s approximate. That’s the compromise needed to avoid analysis paralysis. Tools such as the productivity proxy are on the roadmap and we can use them when they’re available. But this lets us get started today.
- It doesn’t account for the differences in system architecture or absolute performance. This can be accounted for, of course, but at a cost of increased complexity.
- It doesn’t account for data center operational efficiency or actual workloads. That route again adds complexity. One day all servers will be fully instrumented and that information will be readily available.
- It's a relative metric and doesn't allow comparison of different data centers. That's true, and it is actually true of PUE as well. I actually think this is a hidden strength.
I was chatting with Michael K. Patterson, a data center architect here at Intel, on this latter point and he flashed on an idea that brings us even closer to what he calls the “holy grail” of a total-data center efficiency metric. He proposed defining a “TUE” as
Let’s take TUE for a test drive. Say my data center has a PUE of 1.8. Its TUE in the 500 server/year example before refresh would be 4.0. With a server refresh TUE would improve to 2.7, reflecting not a gain in infrastructure efficiency, but a sustantial efficiency gain at the data center level from the IT hardware itself.
This approach may help us understanding data center level improvements and let PUE become even more valuable a metric. For instance, one of the common concerns with PUE is that replacing older inefficient servers with fewer efficient ones can degrade PUE, even if the data center’s energy use and work output may have improved. This approach partially addresses that concern, purely focused on the server influence on the datacenter.
This is not a complete picture of the data center, however, and a hidden assumption is that servers make up the bulk of the IT energy use. These and other factors could be added to the model, but at the expense of complexity and the risk of analysis paralysis. Meanwhhile this approach does let us start measuring the elephant today. Mike, Bill, and I intend to explore these ideas in coming weeks and, if they pan out, perhaps test the waters with the Green Grid.
To summarize, inefficient servers are potentially the biggest efficiency drain in your data center. To date it has proven difficult to attack systematically the overall efficiency of the servers in your data center in the way PUE has been so successful at putting a microscope on infrastructure. The Server Effectiveness idea above is a “zeroth order” approximation that overcomes the complexity and cost of detailed server data and let’s you begin to size up the elephant in your data center: inefficient servers.
So tell me, what do you think? Is this a way to start the conversation about the elephant? Could this be useful in your data center today?
Do you have suggestions to improve this kind of assessment (while still keeping it easily measurable)? Mike, Bill, and I are definitely interested!