HE.net, Fremont-2 outage?

Seth Mattinen sethm at rollernet.us
Wed Nov 4 20:28:41 UTC 2009


Joe Greco wrote:
> 
> Yup.  Related: "100% availability" is a marketing person's dream; it 
> sounds good in theory but is unattainable in practice, and is a reliable
> sign of non-100%-reliability.
> 
> The most common way to gain "100% availability" is to avoid testing
> under load.  This surely protects the equipment against a whole slew of
> failures in the less-used portions of your power systems, but also
> protects you from detecting them outside your Hour(s) Of Greatest Need.

Not testing under load is silly, IMHO. Does it work? Maybe. If it does
something strange during testing it's attended, expected, and utility is
available to fall back on. Starting your generator only means it'll turn
over and idle, not that it'll provide power under load all the way to
the racks.

Some people may prefer a colo that never risks it and therefore never
does more than idle the genset to claim 100% uptime. Others may prefer
one that won't promise 100% everything but does load tests. I'd rather
have a test go wrong while utility is available rather than a failed
backup with no utility hoping the power comes back before the UPS dies
or the room cooks itself. Both extremes are available to choose from if
you do your research before picking a colo.


> And even for those who follow best practices...  You can inspect and 
> maintain things until you're blue in the face.  One day a contractor 
> will drop a wrench into a PDU or UPS or whatever and spectacular things
> will happen.  Or a battery develops a strange fault.
>
> You do live load testing, you'll lose now and then.  It's best to simply
> assume no single circuit is 100% reliable.  You should be able to get
> two circuits from separate power systems and the combination of the two
> should really closely approximate 100%, but even there...  it isn't.
> 

Separate power systems are overrated, especially if the fire department
ends up being involved for some reason. (Re: the infamous gas leak
story.) And of course with increased complexity comes increased risk of
failure and longer downtime to diagnose and repair. There is no perfect
balance.

~Seth




More information about the NANOG mailing list