Turning the Tide: Energy Efficient Servers and the Efficiency Conundrum

A while ago I wrote about the CIO’s dilemma with PUE.  Just as you’d never run a business looking only at gross margin, you also can’t run an energy efficient datacenter just looking at PUE;  it is critical to pay attention to bottom line indicators such as datacenter productivity and total power consumption. In this blog I want to take the discussion sideways to what I will call the Efficiency Conundrum.

A current hot question is, “what additional datacenter efficiency metrics should be developed?” Why new metrics? As it was explained to me at a recent Green Grid meeting, legacy data centers typically ran in the PUE = 2.0 range. That left a lot of room for improvement.  Newer data centers, on the other hand, are in the PUE = 1.2 range, with some data centers claiming as low as 1.10. So we are fast approaching the point of diminishing returns for improving PUE (at least for new data centers).

Does that mean we are at optimal efficiency? Of course not. But the question is where can we look for further efficiency gains and what metrics should be used to measure them?

An area that has been proposed is extending the idea of PUE inside the server itself. For instance fans and power conversion inside the server box perform somewhat the same function as the fans and UPS’s in the data center. How might we look at efficiency here?

James Hamilton proposed a tPUE that addresses that concern. The idea is to “charge all the fans and power conversion equipment [inside and outside the server] to the infrastructure.” There is also discussion ongoing about a  “server PUE” which might have some alternative scaling but basically make the same assumption that the fans and power conversion inside the server should be counted against the overall efficiency of the data center.

These are all great ideas, but before jumping in, it is always good to think about some potential unintended consequences we might encounter. Here is where we run into at least two conundrums.

A short while ago I was talking with two of our power-thermal architects at Intel, Sandeep Ahuja and Robin Steinbrecher, about the consequences of some interesting server optimization work they are investigating.

At a high level, there are basically two competing effects inside the server. Airflow increases linearly with fan speed but causes a strong increase in fan power. (Note that this is a cubic relationship in the fan law but in practice is closer to quadratic with DC brushless fans.)  On top of this, heat transfer does not increase linearly with airflow thereby making small decreases in temperature very costly in terms of fan power.  Conversely,  as processor temperature increases the sub-threshold leakage current (and power) increase.

The trade-offs look like this:

Optimizing Wall Power – Trade off Between Fan 2.bmp

This is the first conundrum: not all fan power is bad when viewed at a system level.  The actual optimization point is highly dependent on the server design and CPU characteristics.  In some cases there is no bottom of the bathtub and driving lower fan speed will always win out. But in some other cases, as you increase the fan power the reduction in CPU static power more than makes up for the difference; unless your metrisc accounts for this behavior, you may favor increasing total power consumption.

Another problematic area is the power consumption of the silicon itself.  The power consumption of the Intel® Xeon® 5600 series processor  varies depending on the particular version of the processor.

For example, the Xeon® L5630 processor with a Thermal Design Power envelope of 60W offers six cores, 12MB cache, and 1333MHz DDR3. Depending on workload conditions, this processor may be the optimal performance choice for a customer, yet would worsen the server efficiency score because the CPU power is lowered in relation to other components of the server.

A similar example is the recently introduced low power DDR3L. As a result of its lower operating voltage it will lower memory power substantially while preserving high performance.

Both of these examples point to the second conundrum: choosing energy efficient silicon components can improve bottom line indicators like energy consumption and performance but may negatively impact other efficiency indicators that look more holistically at the system and data center. The efficiencies of the silicon components will show up in server energy efficiency metrics that gauge system power consmption against performance, but may not influence metrics that just weigh trafe-offs in energy consumption between components without comprehending how the two interact.

So, to reiterate, the idea of extending efficiency metrics beyond data center PUE may be needed. I’ve pointed out two potential “unitended consequences” of too tight a focus on just efficiency metrics. The CIO’s conundrum is that doing the right thing (chosing energy efficient components) may look worse if you only report your efficiency metrics to the CEO and BOD. Your call to action is to not fall into that trap: again it is to make sure you can also track the bottom line if you are going to tout “efficiency.”