San Francisco Power Outage

Owen DeLong owen at delong.com
Wed Jul 25 01:01:52 UTC 2007



On Jul 24, 2007, at 4:57 PM, Patrick Giagnocavo wrote:

>
>
> On Jul 24, 2007, at 6:54 PM, Seth Mattinen wrote:
>>
>> I have a question: does anyone seriously accept "oh, power  
>> trouble" as a reason your servers went offline? Where's the  
>> generators? UPS? Testing said combination of UPS and generators?  
>> What if it was important? I honestly find it hard to believe  
>> anyone runs a facility like that and people actually *pay* for it.
>>
>
> Sad that the little Telcove DC here in Lancaster, PA, that Level3  
> bought a few months ago, has weekly full-on generator tests where  
> 100% of the load is transferred to the generator, while apparently  
> large DCs that are charging premium rates, do not.


I am not familiar with the operational details of 365 Main, but, I  
suspect that
they, like most datacenters, probably do have weekly generator and  
transfer
test procedures.

However, there are lots of things that can go wrong that are not  
covered by
generators and transfer tests:

It is possible to cascade fail a power distribution system in a  
number of
ways. It is possible for someone to connect things out of phase during a
maintenance procedure in such a way that everything is fine until a
transfer occurs, then, all hell breaks loose (ever seen what happens
when a large CRAC unit starts trying to run backwards because the
3 Phase rotation is out of order?)

There are also things that can go wrong in the transfer process (like
putting the UPS and Generators on the bus together some degrees
out of phase).

Most of these things become far more likely and far harder to avoid as
the amount of power and the number of units in the system increases.

I'm not defending the situation at 365 Main. I don't have any first hand
knowledge.  I'm just saying that the mere fact that they are dark for
several hours today does not necessarily mean that they don't do
weekly full-on generator tests.

I have no idea what the root cause of today's outage is.  I will be
interested in hearing from any credible source as to any actual details,
but, I'm betting that right now, any such credible source is a bit busy.

Owen




More information about the NANOG mailing list