Cooling failures and overheated servers have even worse consequences than power failures in most mission-critical data centers. A well-maintained uninterruptible power supply should keep servers operating until generators kick in, power is restored or an orderly shutdown occurs. But in today's world of high-density hardware and elevated operating temperatures, a cooling failure -- even with supposedly redundant air conditioners -- can cause server crashes in seconds. Use data center temperature monitoring to avoid data center hot spots that lead to early hardware failures and unexplainable data errors.
How data center hot spots occur
Hot spots are insidious; they can unknowingly creep up on you until equipment starts to fail or strange data anomalies appear. If you add or move equipment around without real knowledge of the room's cooling capacities, hot spots can occur. In nearly every data center, cooling capacities vary at different locations in the room and at different positions along the rack height. Since hot spots usually occur slowly, they can easily go unrecognized until it gets serious.
Find hot spots through data center temperature monitoring
The easiest and least expensive way to find data center hot spots is with temperature-indicating blanking panels. The multi-colored strips on these panels are heat-sensitive, and provide a visual indication of inlet air temperatures. Mount them near the top, middle and bottom of each rack, or at least in every other rack. Alternatively, mount temperature probes in front of hardware, close to the top, middle and bottom of racks. If you can only afford one per rack, put it in front of the most vulnerable hardware, which is usually the highest server in the rack.
Data center temperature and humidity probes are available as add-ons to smart rack power distribution units, as individual wireless devices and as part of some data center infrastructure management systems. All three offer software options that give real-time graphical displays of temperature conditions throughout the room. Ultimately, you should combine these readouts with computational fluid dynamics (CFD) air flow modeling, which allows you to verify cooling adequacy by simulating the proposed new installation before equipment is even installed.
Many data centers invest in redundant cooling units but don't actually have redundant cooling; sometimes it's just poor design. Some computer room air conditioning units have insufficient knowledge of how air really moves in a data center, causing even worse cooling conditions. In modern designs, redundant units run simultaneously with normal units, but at reduced speed, so you don't realize added servers are stealing redundant capacity until a cooling unit fails or is turned off for maintenance.
Thankfully, servers can tolerate a higher operating temperature for several days with little negative effect. ASHRAE's allowable thermal envelope goes up to 32 degrees Celsius or 89.6 degrees Fahrenheit in emergencies, but marginal redundancy -- combined with poorly planned computing hardware additions -- can cause serious overheating and thermal shutdowns within a short time after a cooling unit has quit.
Prevent data center cooling failures
Some think a solution is placing redundant coolers next to normal coolers in a raised floor design, but that's not dependable. When air emanates from different locations, there will be some difference in air flow pattern when a normal or redundant unit is operating, or when both run together. This seemingly small difference causes data center temperature variations that can result in significant hot spots.
Thermal indicators are a good first step, but it's impractical to turn off cooling units every time hardware changes just to see what overheats. The best way to avoid problems, particularly in redundant designs, is to model the cooling with CFD, which creates a 3D model of the data center, including specific cooling systems and rack heat loads. The program uses this information to solve thousands of complex partial differential equations that form an analysis of the air flow. The model delivers both color-coded graphics and data tables showing air quantity, velocity, temperature and pressure at every point in the room, plus under-floor in raised floor installations. It is then easy to see where extra cooling capacity exists and add new equipment there. It's also easy to fail a cooling unit in the model, rerun the computations, and see how well the redundancy works.