Management practices from the HPC world can get even bigger results in smaller-scale operations.
In 2014, industry watchers have seen a major rise in hyperscale computing. Hadoop and other cluster architectures that originated in academic and research circles have become almost commonplace in the industry. Big data and business analytics are driving huge demand for computing power, and 2015 should be another big year in the datacenter world.
What would you do if you had the same operating budget as one of the hyperscale data centers? It might sound like winning the lottery, or entering a world without limitations, but any datacenter manager knows that infrastructure scaling requires tackling even bigger technology challenges -- which is why it makes sense to watch and learn from the pioneers who are pushing the limits.
Lesson 1: Don't lose sight of the "little" data
When the datacenter scales up, most IT teams look for a management console that can provide an intuitive, holistic view that simplifies common administrative tasks. When managing the largest-scale datacenters, the IT teams have also learned to look for a console that taps into the fine-grained data made available by today's datacenter platforms. This includes real-time power usage and temperature for every server, rack, row, or room full of computing equipment.
Management consoles that integrate energy management middleware can aggregate these datacenter data points into at-a-glance thermal and power maps, and log all of the data for trend analysis and capacity planning. The data can be leveraged for a variety of cost-cutting practices. For example, datacenter teams can more efficiently provision racks based on actual power consumption. Without an understanding of real-time patterns, datacenter teams must rely on power supply ratings and static lab measurements.
A sample use case illustrates the significant differences between real-time monitoring and static calculations. When provisioning a rack with 4,000 watts capacity, traditional calculations resulted in one datacenter team installing approximately 10 servers per rack. (In this example, the server power supplies are rated at 650 watts, and lab testing has shown that 400 watts is a safe bet for expected configurations.)
The same team carried out real-time monitoring of power consumption, and found that servers rarely exceeded 250 watts. This knowledge led them to increase rack provisioning to 16 servers -- a 60% increase in capacity. To prevent damage in the event that servers in any particular rack create demand that would push the total power above the rack threshold, the datacenter team simultaneously introduced protective power capping for each rack, which is explained in more detail in Lesson 5 below.
Lesson 2: Get rid of your ghosts
Once a datacenter team is equipped to monitor real-time power consumption, it becomes a simple exercise to evaluate workload distribution across the datacenter. Servers and racks that are routinely under-utilized can be easily spotted. Over time, datacenter managers can determine which servers can be consolidated or eliminated. Ghost servers, the systems that are powered up but idle, can be put into power-conserving sleep modes. These and other energy-conserving steps can be taken to avoid energy waste and therefore trim the utility budget. Real-world cases have shown that the average datacenter, regardless of size, can trim 15 to 20 percent by tackling ghost servers.
Lesson 3: Choose software over hardware
Hyperscale operations often span multiple geographically distributed datacenters, making remote management vital for day-to-day continuity of services. The current global economy has put many businesses and organizations into the same situation, with IT trying to efficiently manage multiple sites without duplicating staff or wasting time traveling between locations.
Remote keyboard, video, and mouse (KVM) technology has evolved over the past decades, helping IT teams keep up, but hardware KVM solutions have as a result become increasingly complex. To avoid managing the management overlay itself, the operators of many of the world's largest and most complex infrastructures are adopting software KVM solutions and more recently virtualized KVM solutions.
Even for the average datacenter, the cost savings add up quickly. IT teams should add up the costs of any existing KVM switches, dongles, and related licensing costs (switch software, in-band and out-of-band licenses, etc.). A typical hardware KVM switching solution can cost more than $500K for the switch, $125K for switch software, and another $500K for in-band and out-of-band node licenses. Even the dongles can add up to more than $250K. Alternatively, software KVM solutions can avoid more than $1M in hardware KVM costs.
Lesson 4: Turn up the heat
With many years of experience monitoring and managing energy and thermal patterns, some of the largest datacenters in the world have pioneered high ambient temperature operation. Published numbers show that raising the ambient temperature in the datacenter by 1°C results in a 2% decrease in the site power bill.
It is important to regularly check for hot spots and monitor datacenter devices in real time for temperature-related issues when raising ambient temperature of a datacenter. With effective monitoring, the operating temperature can be adjusted gradually and the savings evaluated against the budget and capacity plans.
Lesson 5: Don't fry your racks
Since IT is expected -- mandated -- to identify and avoid failures that would otherwise disrupt critical business operations, any proactive management techniques that have been proven in hyperscale datacenters should be evaluated for potential application in smaller datacenters. High operating temperatures can have a devastating effect on hardware, and it is important to keep a close eye on how this can impact equipment uptime and life cycles.
Many HPC clusters, such as Hadoop, build in redundancy and dynamic load balancing to seamlessly recover from failures. The same foundational monitoring, alerts, and automated controls that help minimize hyperscale energy requirements can help smaller sites identify and eliminate hot spots that have a long-term impact on equipment health. The holistic approach to power and temperature also helps maintain a more consistent environment in the datacenter, which ultimately avoids equipment-damaging temperatures and power spikes.
Besides environment control, IT teams can also take advantage of leading-edge energy management solutions that offer power-capping capabilities. By setting power thresholds, racks can be liberally provisioned without the risk of power spikes. In some regions, power capping is crucial for protecting datacenters from noisy, unreliable power sources.
Following the leaders
Thankfully, most datacenters operate on a scale with much lower risks compared to the largest datacenters and hyperscale computing environments. However, datacenters of any size should make it a priority to reduce energy costs and avoid service disruptions. By adopting proven approaches and taking advantage of all the real-time data throughout the datacenter, IT and facilities can follow the lead of hyperscale sites and get big results with relatively small initial investments.