Showing posts with label reliability. Show all posts
Showing posts with label reliability. Show all posts

Thursday, April 9, 2009

More on Google's Battery-backed Servers

As noted in Evaluating Google's Battery-backed Server Approach, there are a number of benefits to Google's recently-disclosed practice of putting VRLA batteries on every server, but there are quite a few drawbacks as well.

One of the drawbacks not discussed in the prior post is a set of issues related to power transients and harmonics. With a conventional data center, there are multiple levels of power transformation and isolation between the individual server and the grid. Power usually comes in at high- or medium-voltage to a transformer and comes out as low voltage (<600v) before going to a UPS and a PDU.

In an effort to improve efficiency and reduce capital costs, facility managers are looking at removing some of these isolation layers. This is fine to a certain extent. After all, there are a lot of small businesses that run one or two servers on their own, and there aren't major problems with them. In those cases, however, there are usually relatively few computers hooked together on the same side of the electrical transformer that provides power to the building. This transformer provides isolation from building to building (or zone to zone in some installations).

When you scale up into a large data center, however, you get thousands and thousands of servers in the same building. If you remove those extra layers of isolation, the burden for providing that extra isolation falls to the power supplies in the individual servers. If servers use traditional AC power supplies, issues like phase balancing and power factor correction of all the separate power supplies becomes more of an interdepent issue.

The issues can be helped or hurt depending on what's nearby. Servers without isolation near an aluminum smelter, sawmill, subway, or steel mill may see wide fluctuations in their power quality which can result in unexplained errors.
I've seen cases with marginal power feeds where individual racks of servers seem to work fine, but the aggregate load when all servers are operating causes enough of a voltage sag that some servers occasionally don't work right. Let me tell you, those are a real pain to diagnose.

On the other hand, if you're somebody like Google or Microsoft who can locate data centers in places like The Dalles, Oregon or Quincy, Washington that are just a stone's throw from major hydroelectric dams or other sources of power, perhaps you can rely on nice clean power all the time.

External power factors may be the least of a data center manager's problems, however. The big concern with eliminating the intermediate isolation is that transients and other power line problems from one power supply can affect the operation of adjacent systems, and this can build up to significant levels if fault isolation and filtering is not supported.

Another issue that bedevils data center managers is the issue with phase balancing. In most AC-powered systems, power is delivered via three phases or legs (A, B, and C phases), each 120° out of phase with each other. At some point (usually the PDU), a neutral conductor is synthesized so that single-phase currents can run from one of these legs to neutral. In a properly balanced system, there will be equal loading on the A leg, the B leg, and the C leg. If the phases are not properly balanced, there are several bad things that can occur, including the following:
  • The neutral point will shift towards the heaviest load, lowering the voltage to the equipment on that line, resulting in premature equipment failure and undervoltage-related errors
  • An imbalanced load may cause excess current to flow over specific conductors and overheat
  • Breakers or other overcurrent mechanisms may trip

Phase imbalance can occur when network administrators do not follow a rigorous process of plugging every third server into alternate phases. Additionally, shifting workloads could cause some servers to be more heavily utilized than others--and phase balancing is almost certainly not a factor considered in allocating applications to specific servers. An even more pernicious issue can arise with systems employing redundant power supplies, such as blade servers: in an attempt to maximize efficiency, management software may shut down certain power supplies to maximize load on the remaining power supplies--all without considering what the impact to phase balancing is when the load is not equally shared among all power supplies.

Data centers that employ conventional PDUs don't generally have these issues (or have them at lesser severity), since the PDUs and their transformers are usually designed to handle significant phase imbalances without creating problems.

Additional considerations with the Google battery-backed server approach:

  • Acid risks from thousands of individual tiny batteries (i.e., cracked cases in thinner-walled batteries)
  • Shorting risks from batteries that can deliver thousands of amps of current for a short period
  • More items to monitor, or higher risks of silent failures (albeit with smaller failure domains) when you most need the batteries

This is a complex issue. I'm not convinced that Google has determined the optimal solution, but kudos to them for finally being willing to publicly discuss some of what they consider to be best practices. Collectively, we can learn bits and pieces from different sources that could end up delivering more efficient services.

--kb

Sunday, March 1, 2009

Sealed Containers: Reality or Myth?

One of the interesting debates for those looking at containerized data centers is whether or not containerized data centers need to be serviceable in the field. Different products on the market today take different approaches:
  • The Sun Modular Datacenter (nee "Blackbox") provides front and rear access to each rack by mounting the racks sideways and using a special tool to slide racks into the center aisle for servicing.
  • The Rackable ICE Cube provides front access to servers, but the setup doesn't lend itself to rear access to the servers.
  • HP's Performance-Optimized Datacenter (POD) takes an alternative approach: there's a wide service aisle on the front, but you need to go outside the container to get to the back side of the racks via external doors.

Some industry notables have advocated even more drastic service changes: James Hamilton (formerly with Microsoft, now with Amazon) was one of the early proponents of containerized data centers, and he has suggested that containerized data centers could be sealed, without the need for end-users to service the hardware. The theory is that it's cheaper to leave the failed servers in the rack, up until the point that so many servers have failed that the entire container is shipped back to the vendor for replacement.

How reasonable is this?

Prior to the advent of containers, fully-configured racks (cabinets) were the largest unit of integration typically used in data centers, and these remain the highest level of integrated product used in most data centers today. How many data centers seal these integrated cabinets and never open the door to the cabinet throughout the life of the equipment in that cabinet? This is perhaps the best indicator as to whether a sealed container really matches existing practices.

We had looked at the "fail in place" model in the company where I work, but it was difficult for managers to accept that it was okay for some number of servers to be failed in a rack. As long as the cost of fixing the hardware is cheaper than the cost of buying a new server (or the equipment is under warranty), most finance people and managers want to see the servers in a rack functional.

What do you think? Do you see people keeping cabinets sealed in data centers today? Does fail in place make sense to you?

Tuesday, February 17, 2009

Server Cost Adders for Higher-temp Operation

Numerous industry notables, including Microsoft's Christian Belady, have been advocating the operation of data centers with higher ambient temperatures. The cost savings by reducing or eliminating cooling plant costs could yield considerable savings. But what does it take to build servers designed to operate at these higher temperatures?

As mentioned in a previous post, telecommunications equipment is typically designed to meet the NEBS standards (55°C maximum ambient). Cost adders for NEBS equipment include the following:
  • Higher temperature integrated circuits (ICs). Commercial-grade ICs are generally rated to 70°C; higher ambient temperatures could force the use of extended temp components.
  • Heat sink costs. Higher temperatures often drive more expensive heat sink materials (i.e., copper rather than aluminum) and more use of heat sinks on components that don't need them at lower temperatures. For example, some servers need heat spreaders on DIMMs to be rated to operate at higher temperatures.
  • Corrosive gases tolerance. Telecommunications equipment generally needs to pass tests to ensure reliability in the presence of corrosive gases, including high sulfur-content air. Before dismissing this requirement, consider the case of air-side economizers: if you're bringing in outside air, do you need to worry about contaminants in the air, such as diesel exhaust from nearby trucks or from diesel generators?
  • Wider humidity range. Most NEBS equipment is designed for a wider range of allowable humidity exposure than most data center equipment. The broader use of economizers might make a wider humidity range desirable for data centers.
  • Flame tests. NEBS flame tests may be overkill for most data center equipment, in part because most data centers have sprinklers or other fire suppression controls (unlike telecom central offices, which do not have sprinklers).
  • Shake and vibe tests. NEBS equipment generally is tested to meet seismic Zone 4 earthquake tests. These tests could just as well apply to data center equipment, but it is something beyond what most data center equipment is validated against.
  • Materials selection. The use of V0-rated plastics and HF-1 or better foams in data center equipment is not necessarily a cost adder if designed in up front, but it can add appreciable expense if retrofits have to be made after-the-fact.
  • Air filters. Data center equipment generally doesn't need air filters, so these can be eliminated.
  • Long life. This actually encompasses two aspects: extended availability of certain components and long-life reliability. Telecom products often require the availability of the same components for 5-7 years, much longer than typical data center products. Similarly, telecom products often are designed to meet usable lifetimes that are much longer than most data center refresh cycles.

Which of these attributes are needed for equipment in data centers with higher temperatures? What other attributes are needed for higher temps?

--kb