Monday, May 16, 2011

New ASHRAE Temp/Humidity Guidelines

FYI, ASHRAE (American Society of Heating, Refrigeration, and Air-conditioning Engineers) has released new guidelines for temperature and humidity conditions for IT equipment. Among the changes (see p.8), are new classes A3 and A4, which allow 5C-40C and 5C-45C temperatures in data centers (respectively), along with 8-85 and 8-90% RH.

There are even guidelines for lower humidity level if certain procedures are followed.

--kb

P.S. Thanks to Pasi Vaananen for the heads-up.

Sunday, March 7, 2010

Energy Star for Server Should Require Right-sized Power Supplies

The EPA is currently developing Tier 2 (second-phase) requirements for servers with the Energy Star ratings. Although the EPA is probably right to back away from the non-standard definition they were developing to capture "net power loss" for servers, they should still look at adjusting server power supply efficiencies when power supplies are oversized.

Some vendors may ship servers that only draw 200W at 100% utilization with a power supply that can provide 1200W. Traditional ways of evaluating power supplies measure those power supplies across their full rated capacity. If a power supply is sized appropriately, this makes sense. However, a power supply that is too much higher than the system will see in real life should be de-rated.

For examaple, a 2000W power supply might have good efficiency at 50% and 100% of load, but power efficiency tends to drop off at lower load levels, particularly those below 25% of maximum load. Take that same 2000W power supply and put it in a server drawing a maximum of 200W, and the power supply wout always be operating below 10% load. The normal power supply rating levels are of little value if the realistic power draw is much lower than the rated power draw.

To be fair, the tested configurations of servers don't always represent the highest possible loading: adding extra memory, additional hard drives, and extra PCI Express cards can increase a servers power draw. But having no upper limit leaves too much wiggle room and jeopardizes the integrity of the Energy Star rating method.

One possible solution to work around this is as follows:
  1. Measure the server power consumption under an acceptable benchmark such as SPECpower_ssj2008. Record 2x the maximum power draw (i.e., at 100% load in the benchmark).
  2. Look at the rated output power for the power supply or power supplies needed to operate the server in that configuration [ignore redundant power supplies used for reliability purposes]. Record the sum of the power of all the non-redundant power supply output power ratings.
  3. If the answer in Step 2 is less than or equal to the value from Step 1, no adjustment is needed. Skip Steps 4 and 5.
  4. If the answer in Step 2 is more than the value in Step 1, plot the efficiency rating of the non-redundant power supplies. Extrapolate the efficiency of the power supply (power supplies) at the value recorded in Step 1. Extrapolate the efficiency at 50% and 25% of the value shown in Step 1. Do the same for any other power supply levels normally required, but rate them as a ratio of the value shown in Step 1.
  5. Evaluate the efficiency of the system based on the load levels and efficiency determined in Step 4 above.

This adjustment would correct ratings for power supplies oversized for the systems they're being tested with. This will incent server vendors to right-size power supplies to better match the real power range of the systems they're being rated for.

--kb

Saturday, February 20, 2010

FaceBook's HipHop Software Efficiency

Haiping Zhao has a great blog entry showcasing how Facebook has been able to improve the efficiency of their applications through the development of HipHop for PHP. Simply stated, HipHop transforms PHP code into optimized C++ code. According to Zhao, this technology reduced the average CPU utilization on their servers by 50%.

This showcases two things in particular:
  • Software can have a major impact on system efficiency. Even relatively good solutions like PHP can still be improved.
  • Metrics that look only at hardware-centric criteria often ignore the benefits of more efficient software.

This second bullet merits further elaboration. Administrators looking at CPU utilization as an approximation of total server work accomplished would erroneously assume their servers were only doing half as much work with HipHop than they were beforehand, even though they would be doing the same amount of work with better software, just doing it more efficiently.

Future posts will talk about ways to measure useful work.

--kb

Tuesday, February 16, 2010

Good IBM doc on cpufreq

Previous posts have talked about power management in Linux, including information about the cpufreq module. IBM has published a good document explaining many aspects of how to use the cpufreq module. This document is worth a look.

Monday, February 15, 2010

Intel® Energy Checker SDK Released

For nearly the last two years, I've been working with a colleague named Jamel Tayeb to develop a tool that could be used to help measure the energy efficiency of software (and data centers as an aggregate measure). I'm happy to say that the Intel® Energy Checker SDK has now finally gone public and is available for download from http://whatif.intel.com.

Most of the technology world's focus regarding energy efficiency has focused on hardware: better processors, better memory, better disks, better power conversion, etc. This is good, but it overlooks the substantial contribution that better software can make towards improving energy efficiency. An automobile driver who drives over the top of a hill may use more energy than someone who drives around the hill; software designed with energy efficiency in mind may use a different algorithm than a brute force approach that seems simpler at first.

The Intel® Energy Checker SDK provides developers and systems integrators a simple API that they can use to measure the amount of "useful work" performed by the system and then correlate the useful work with energy consumption. The useful work is not the number of instructions executed, cycles retired, or the average CPU utilization--that's not why you buy software. For example, you buy e-mail software to do things like send e-mails, so the measures of useful work can be the number of messages sent, the number of kilobytes in those messages sent, the number of messages received, and the number of kilobytes in those messages received. Software developers can choose what measures of useful work they export and how often they choose to export this information.

The SDK includes tools to measure the rate of power usage and to measure/calculate energy consumption over time. The SDK supports several external power meters as well as the ability to read energy consumption directly from power supplies having certain levels of instrumentation.

The software developer can easily aggregate/weight the work done in their application(s) with work done in other instrumented applications and compare that to the energy consumed by the system or systems under test to determine energy efficiency. This is an important step towards making software more energy efficient and may lead towards energy-aware algorithms in leading software packages. In turn, this will help administrators measure the aggregate useful work of their facilities, rather than simply measuring hardware-centric metrics that actually penalize more efficient software.

The SDK is available free of charge (and without royalties) from http://software.intel.com/en-us/articles/intel-energy-checker-sdk/. The SDK supports Windows, Linux, Solaris 10, and MacOS X. Source code for the core API and many utilities is included, though Intel distributes some utilities in binary form only. Check it out!

--kb

Thursday, September 3, 2009

Cisco & Sun Servers Spar for Best Humidity Support

One of the clear trends in data centers is to improve data center efficiency by making greater use of HVAC economizer modes. For air-side (dry-side or outside air) economizers, one of the keys to broad adoption is to be able to use outside air across as broad a relative humidity (RH) range as possible.

Among the major blade vendors, Cisco appeared to have taken the lead by offering support for the broadest operating humidity range, but Sun appears to have matched Cisco recently:

(All humidity ranges are non-condensing. All data is from vendor web sites as of 9/3/09).

Ever-widening ranges for supported humidity make the use of dry-side economizers more feasible. If vendors were able to support 0-100% relative humidity, data center operators wouldn't need to worry about humidifcation/de-humidification controls. Eliminating such controls and systems could lower capital costs, reduce operating costs, lower the carbon footprint of facilities, and lower their water footprint as well.

--kb

Thursday, July 16, 2009

Adding a Geographic Element to PUE Calculations

The PUE metric has become one of the most significant metrics for measuring the gross efficiency of a data center. As data center operators boast of PUE numbers that approach the optimal rating of 1.0, it's often difficult to separate out environmental or regional factors.

Is a PUE of 1.5 in Phoenix better or worse than a PUE of 1.4 in Seattle?

It depends. In absolute numbers, the lower PUE provides an indicator of the most efficient facility. However, achieving a PUE of 1.5 in Phoenix is much more difficult than an equivalent or slightly lower number in Seattle because Phoenix is so much hotter and requires more air conditioning. Moving data centers to cooler locations helps the PUE rating, but sometimes data centers need to be located in a specific city or region. How can you compare PUE values in regions with different environmental conditions?

One possible approach is to add a geographic compensating factor:

gPUE = G * PUE

The geographic compensating factor G would be determined by The Green Grid or other trusted body based on compiled weather data. Ideally, this could be calculated empirically through a formula using data maintained by the U.S. Department of Energy (refer to this blog link for information on that data and a free tool to visually represent that data).

This approach would allow somebody to measure the technical innovation of a given facility while providing an adjustment to account for geographic disparities in temperature, wind, solar loading, etc. It's not a perfect solution (since some cooling optimizations might not work in cooler or hotter climates), but it provides some measure of equalization to facilitate more equitable comparisons between PUE claims in different locations.

--kb

Monday, June 15, 2009

Making Ice to Lower PUE and TCO

At night, demand on the grid is lower, energy costs tend to be lower, and temperatures are also lower. These three factors make night an attractive time to produce thermal storage. This allows facility managers to time-shift HVAC-related energy costs to reduce peak demands on the grid and lower energy costs.

Although some facility managers have developed their own methods for time-shifting HVAC energy requirements, Ice Energy may be the first vendor to market a product specifically designed to do this. The Ice Bear* distributed energy storage system provides up to 5 tons of cooling load during peak hours.

It's good to see innovative products like this coming to market.

--kb

Wednesday, May 27, 2009

Truckin' Down the Information Superhighway

Last week, I was talking with a friend from Sun who is involved with Sun's containerized data centers. He mentioned that since they helped the Internet Archive put 3.2 petabytes of storage in a shipping container, they figured they could put the container on a truck, take 7 days to ship the container across country, and still average >40 Gbps over that 7 day period!

Coincidentally, two days later Amazon introduced Amazon Web Services Import/Export with a blog that starts off with the following colorful quote attributed to Andy Tanenbaum:
Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

Amazon Web Services Import/Export allows people to send USB or eSATA hard drives/media to Amazon for data sets that are impractical to send over available communications links.

It turns out that the bulk version of sneakernet may be the most expeditious way to move data. The more things change, the more things stay the same.

--kb
Note: Revised title on 5/29/09.

Monday, May 11, 2009

How a Good Metric Could Drive Bad Behaviors

The PUE (Power Usage Effectiveness) metric from The Green Grid has become a widely referenced benchmark in the data center community, and justifiably so. However, there can be a dark side to following this metric blindly.


Introduction

PUE is defined as follows:

PUE = Total Facility Power/IT Equipment Power

Using the PUE metric, a facility manager can judge what ratio of power is lost in "overhead" (infrastructure) to operate the facility. A PUE of 1.6 to 2.0 is typical, but facility managers are striving to approach a PUE of 1.0, the idealized state.

Companies willing to drive more sustainable practices may incent facility managers to improve facility PUE levels. However, if this is done without context towards the overall energy or other resource consumption, it could drive inefficient behaviors.



Issue #1: Dissimilar Infrastructure Power Scaling

If a facility manager tracks PUE over a variety of workloads, they will see how the data center's infrastructure power consumption tracks with the IT load. Ideally, the infrastructure overhead (HVAC system, UPS system, etc.) will match linearly with the consumption of the servers and other gear in the data center, but this is rarely the case.



In many cases, the fixed overhead for power and cooling systems will become a higher percentage of overall power consumption as the IT load diminished. In other cases, there will be significant step functions in overall power consumption as large infrastructure items such as chillers, CRACs, or other equipment is turned on or off (as depicted in the graph to the left).

In such situations, reducing the IT power consumption could increase the PUE even if it reduces the overall energy consumption of the data center. People will often act in the direction towards which they are incented (i.e., what improves their paycheck). Managers incented to improve PUE without any clear tie-in to overall energy consumption might be reluctant to shut off unused servers or aggressively implement power saving features on their IT infrastructure if it increased their PUE--even if doing so would reduce overall facility power consumption.

Ensuring overall energy consumption is part of the incentive package (not just PUE) is critical to driving the desired behaviors.

[Part of this needs to be linked with overall productivity of the data center so that increased use of the data centers is encouraged while still incenting improved efficiency. I'll write about this in an upcoming post.]



Issue #2: Shifting Infrastructure Loads to IT

Another issue to watch is a desire to classify some infrastructure-like services as IT loads in order to improve PUE efficiencies. Examples of this include moving UPS systems into IT racks or putting large air-mover devices into equipment cabinets and trying to classify them as IT loads. This is "gaming" the system and should be actively discouraged.

The Green Grid is aware of this issue and is adding more guidelines to help people improve the accuracy and consistency of their PUE reporting.



Issue #3: Improving Infrastructure Efficiency at the Expense of IT

The third issue to watch is a move towards facility or equipment practices that reduce the infrastructure power consumption but increase the IT power consumption. In particular, the adoption of higher operating temperatures for data centers warrants particular scrutiny.

I've noted previously that there are significant gains possible by raising data center temperatures and making greater use of dry-side or wet-side economizers. However, it's important to compare the energy savings on the infrastructure side with the energy costs on the IT side. At higher temperatures, leakage currents in silicon increase and fans inside servers need to run faster to move more air through each server.

Increase the IT consumption and lower the infrastructure consumption and you get a two-fer: the PUE numerator goes down and the PUE denominator goes up, lowering the overall PUE. However, if the net power consumption doesn't go down, it usually** doesn't make sense to increase the ambient temperature. Once again, looking at overall power consumption in addition to PUE is important in incenting the proper behaviors.

--kb


**Note: For greenfield (new) data centers or substantial datacenter retrofits, raising the allowed data center temperature may eliminate or substantially reduce the CapEx (capital expenditure) cost for that data center even if the direct energy costs are slightly higher. For example, if a data center doesn't need to purchase a chiller unit, that could shave millions of dollars off the construction cost for a facility. In such cases, more complicated parameters will be needed to evaluate the benefits of raising the ambient temperature in the facility; these likely will include a net present value analysis for the CapEx savings vs. OpEx (operating expense) costs, consideration of real estate savings, etc. The real win is when both CapEx costs are avoided AND OpEx costs are lower.