Thursday, April 9, 2009

More on Google's Battery-backed Servers

As noted in Evaluating Google's Battery-backed Server Approach, there are a number of benefits to Google's recently-disclosed practice of putting VRLA batteries on every server, but there are quite a few drawbacks as well.

One of the drawbacks not discussed in the prior post is a set of issues related to power transients and harmonics. With a conventional data center, there are multiple levels of power transformation and isolation between the individual server and the grid. Power usually comes in at high- or medium-voltage to a transformer and comes out as low voltage (<600v) before going to a UPS and a PDU.

In an effort to improve efficiency and reduce capital costs, facility managers are looking at removing some of these isolation layers. This is fine to a certain extent. After all, there are a lot of small businesses that run one or two servers on their own, and there aren't major problems with them. In those cases, however, there are usually relatively few computers hooked together on the same side of the electrical transformer that provides power to the building. This transformer provides isolation from building to building (or zone to zone in some installations).

When you scale up into a large data center, however, you get thousands and thousands of servers in the same building. If you remove those extra layers of isolation, the burden for providing that extra isolation falls to the power supplies in the individual servers. If servers use traditional AC power supplies, issues like phase balancing and power factor correction of all the separate power supplies becomes more of an interdepent issue.

The issues can be helped or hurt depending on what's nearby. Servers without isolation near an aluminum smelter, sawmill, subway, or steel mill may see wide fluctuations in their power quality which can result in unexplained errors.
I've seen cases with marginal power feeds where individual racks of servers seem to work fine, but the aggregate load when all servers are operating causes enough of a voltage sag that some servers occasionally don't work right. Let me tell you, those are a real pain to diagnose.

On the other hand, if you're somebody like Google or Microsoft who can locate data centers in places like The Dalles, Oregon or Quincy, Washington that are just a stone's throw from major hydroelectric dams or other sources of power, perhaps you can rely on nice clean power all the time.

External power factors may be the least of a data center manager's problems, however. The big concern with eliminating the intermediate isolation is that transients and other power line problems from one power supply can affect the operation of adjacent systems, and this can build up to significant levels if fault isolation and filtering is not supported.

Another issue that bedevils data center managers is the issue with phase balancing. In most AC-powered systems, power is delivered via three phases or legs (A, B, and C phases), each 120° out of phase with each other. At some point (usually the PDU), a neutral conductor is synthesized so that single-phase currents can run from one of these legs to neutral. In a properly balanced system, there will be equal loading on the A leg, the B leg, and the C leg. If the phases are not properly balanced, there are several bad things that can occur, including the following:
  • The neutral point will shift towards the heaviest load, lowering the voltage to the equipment on that line, resulting in premature equipment failure and undervoltage-related errors
  • An imbalanced load may cause excess current to flow over specific conductors and overheat
  • Breakers or other overcurrent mechanisms may trip

Phase imbalance can occur when network administrators do not follow a rigorous process of plugging every third server into alternate phases. Additionally, shifting workloads could cause some servers to be more heavily utilized than others--and phase balancing is almost certainly not a factor considered in allocating applications to specific servers. An even more pernicious issue can arise with systems employing redundant power supplies, such as blade servers: in an attempt to maximize efficiency, management software may shut down certain power supplies to maximize load on the remaining power supplies--all without considering what the impact to phase balancing is when the load is not equally shared among all power supplies.

Data centers that employ conventional PDUs don't generally have these issues (or have them at lesser severity), since the PDUs and their transformers are usually designed to handle significant phase imbalances without creating problems.

Additional considerations with the Google battery-backed server approach:

  • Acid risks from thousands of individual tiny batteries (i.e., cracked cases in thinner-walled batteries)
  • Shorting risks from batteries that can deliver thousands of amps of current for a short period
  • More items to monitor, or higher risks of silent failures (albeit with smaller failure domains) when you most need the batteries

This is a complex issue. I'm not convinced that Google has determined the optimal solution, but kudos to them for finally being willing to publicly discuss some of what they consider to be best practices. Collectively, we can learn bits and pieces from different sources that could end up delivering more efficient services.

--kb

No comments:

Post a Comment