Diagram decorative image

Fascinating FMEA: When a Recovery is Actually a Failure

“Failure” is a common word – one commonly understood by the layperson. Something broke, and some broader system stopped functioning as a result. My pencil broke, so I can’t write words (I can’t hold the pencil nub). My car’s lug nuts sheared and the tire fell off, so it doesn’t work as a car anymore. There’s a hole in my shoe, and now if I step in a puddle, my foot gets wet.

Many engineering systems do not respond to component failure in such an obvious way, or they are not well-served by such casual analysis. Indeed, some events are serious failures when they don’t seem like failures at all. The formalized study of this system-level behavior is termed Failure Mode Effects Analysis (FMEA), and the codification of such study is critical to the design and optimization of mission critical facilities.

The underlying idea of FMEA is to rigorously and comprehensively explore the systems-level effect of every plausible input and output of each system component. A system component may take input ‘A’ and generate output ‘B’. Without understanding the systemic consequence of ‘B’, it is impossible to know if a system-level failure has occurred.

Consider the following example. Recently, ESD performed commissioning services for a data center built by a leading U.S. telecommunications provider. This is a mission critical data center that must provide nearly 100% uptime. Of course, redundant cooling systems are necessary – this data center has four cooling systems configured to allow one or more to fail depending on the load. However, due to plenum dimensional restrictions, three and only three cooling systems may be operational at one time. If all four units are powered on, the plenum is at risk for of an overpressure event – and the plenum may fail.

Consider the nature of these cooling units. The units are not designed to order, but instead are off-the-shelf HVAC systems intended for use in critical environments. Suppose there is a power failure on a redundant supply circuit for a unit that is commanded power-off. What is that cooling unit likely to do when power is restored? For most configurations, an HVAC system that experienced a power loss to a single component (perhaps it was down for maintenance?) will control based on the building automation system commands once power is restored. This is a natural assumption, and holds true for the vast majority of applications. When your house loses power, and the power comes back on, so do the lights. This is the expected behavior, and it is appropriate for that system. Unfortunately, this does not always hold true for equipment designed for mission critical spaces, and it did not hold true for the cooling units discussed.

In the high-uptime data center example, this is not so. The fourth unit may have been commanded inactive by the control software but that command was sent prior to the loss of power. Once power was restored, the unit did not look to the automation system for instructions but rather activated spuriously upon receiving regular standby power. At this point, all four cooling units were running, and this condition would have persisted until the next lead/lag schedule rotation occurred. Usually, this would not be considered a failure. By themselves, these units would naturally want to enable immediately after a power loss to recover environmental control as quickly as possible. However, in this specific mission-critical system, a “hot”-powered standby unit must not accidentally come on-line if the supply power is temporarily lost.

FMEA is a powerful framework to evaluate the effects of a unit failure. We consider, in every case, the response of each individual unit to the entire system. This necessarily populates an entire set of failure trees (each rather like a family tree) of combined inputs, outputs and events. By studying and analyzing the overall system-level impact of any individual component event, we can better understand what component events may cause a system-level failure and what component events may be tolerated in operation.

Project commissioning includes, but is not limited to, FMEA analysis. This methodological and systemic evaluation seeks to find these blind spots, where a component unit responds to off-nominal input according to the behavior the component designer expected, but not necessarily in a way that would please the system-level designer.