FYI Netflix is down

AP NANOG nanog at armoredpackets.com
Mon Jul 2 15:41:00 UTC 2012


While I was working for a wireless telecom company our primary 
datacenter was knocked off the power grid due to weather, the generators 
kicked on and everything was fine, till one generator was struck by 
lighting and that same strike fried the control panel on the second 
one.  Considering the second generator had no control panel we had no 
means of monitoring it for temp, fuel, input voltage (when it came 
back), output voltage, surge protection, or ultimately if the generator 
spiked to go full voltage due to a regulator failure.  Needless to say 
we had to shut the second generator down for safety reasons.

While in the military I seen many generators struck by lighting as well.

Im not saying Amazon was not at fault here, but I can see where this is 
possible and happens more frequently than one might think.

I hate to play devils advocate here, but you as the customer should 
always have backups to your backups, and practice these fail-overs on a 
regular basis.  Otherwise you are the fault here, no one else...

-- 

Thank you,

Robert Miller
http://www.armoredpackets.com

Twitter: @arch3angel

On 7/2/12 11:01 AM, Dan Golding wrote:
>> -----Original Message-----
>> From: Todd Underwood [mailto:toddunder at gmail.com]
>>
>> scott,
>>
>>>> This was not a cascading failure.  It was a simple power outage
> Actually, it was a very complex power outage. I'm going to assume that what happened this weekend was similar to the event that happened at the same facility approximately two weeks ago (its immaterial - the details are probably different, but it illustrates the complexity of a data center failure)
>
> Utility Power Failed
> First Backup Generator Failed (shut down due to a faulty fan)
> Second Backup Generator Failed (breaker coordination problem resulting in faulty trip of a breaker)
>
> In this case, it was clearly a cascading failure, although only limited in scope. The failure in this case, also clearly involved people. There was one material failure (the fan), but the system should have been resilient enough to deal with it. The system should also have been resilient enough to deal with the breaker coordination issue (which should not have occurred), but was not. Data centers are not commodities. There is a way to engineer these facilities to be much more resilient. Not everyone's business model supports it.
>
> - Dan
>
>
>>>> Cascading failures involve interdependencies among components.
>>>
>>> Not always.  Cascading failures can also occur when there is zero
>>> dependency between components.  The simplest form of this is where
>> one
>>> environment fails over to another, but the target environment is not
>>> capable of handling the additional load and then "fails" itself as a
>>> result (in some form or other, but frequently different to the mode
>> of the original failure).
>>
>> indeed.  and that is an interdependency among components.  in
>> particular, it is a capacity interdependency.
>>
>>> Whilst the Amazon outage might have been a "simple" power outage,
>> it's
>>> likely that at least some of the website outages caused were a
>>> combination of not just the direct Amazon outage, but also the flow-
>> on
>>> effect of their redundancy attempting (but failing) to kick in -
>>> potentially making the problem worse than just the Amazon outage
>> caused.
>>
>> i think you over-estimate these websites.  most of them simply have no
>> redundancy (and obviously have no tested, effective redundancy) and
>> were simply hoping that amazon didn't really go down that much.
>>
>> hope is not the best strategy, as it turns out.
>>
>> i suspect that randy is right though:  many of these businesses do not
>> promise perfect uptime and can survive these kinds of failures with
>> little loss to business or reputation.  twitter has branded it's early
>> failures with a whale that no only didn't hurt it but helped endear the
>> service to millions.  when your service fits these criteria, why would
>> you bother doing the complicated systems and application engineering
>> necessary to actually have functional redundancy?
>>
>> it simply isn't worth it.
>>
>> t
>>
>>>    Scott




More information about the NANOG mailing list