FYI Netflix is down

Mon Jul 2 15:01:59 UTC 2012

> -----Original Message-----
> From: Todd Underwood [mailto:toddunder at gmail.com]
> 
> scott,
> 
> >>
> >> This was not a cascading failure.  It was a simple power outage

Actually, it was a very complex power outage. I'm going to assume that what happened this weekend was similar to the event that happened at the same facility approximately two weeks ago (its immaterial - the details are probably different, but it illustrates the complexity of a data center failure)

Utility Power Failed
First Backup Generator Failed (shut down due to a faulty fan)
Second Backup Generator Failed (breaker coordination problem resulting in faulty trip of a breaker)

In this case, it was clearly a cascading failure, although only limited in scope. The failure in this case, also clearly involved people. There was one material failure (the fan), but the system should have been resilient enough to deal with it. The system should also have been resilient enough to deal with the breaker coordination issue (which should not have occurred), but was not. Data centers are not commodities. There is a way to engineer these facilities to be much more resilient. Not everyone's business model supports it.

- Dan

> >>
> >> Cascading failures involve interdependencies among components.
> >
> >
> > Not always.  Cascading failures can also occur when there is zero
> > dependency between components.  The simplest form of this is where
> one
> > environment fails over to another, but the target environment is not
> > capable of handling the additional load and then "fails" itself as a
> > result (in some form or other, but frequently different to the mode
> of the original failure).
> 
> indeed.  and that is an interdependency among components.  in
> particular, it is a capacity interdependency.
> 
> > Whilst the Amazon outage might have been a "simple" power outage,
> it's
> > likely that at least some of the website outages caused were a
> > combination of not just the direct Amazon outage, but also the flow-
> on
> > effect of their redundancy attempting (but failing) to kick in -
> > potentially making the problem worse than just the Amazon outage
> caused.
> 
> i think you over-estimate these websites.  most of them simply have no
> redundancy (and obviously have no tested, effective redundancy) and
> were simply hoping that amazon didn't really go down that much.
> 
> hope is not the best strategy, as it turns out.
> 
> i suspect that randy is right though:  many of these businesses do not
> promise perfect uptime and can survive these kinds of failures with
> little loss to business or reputation.  twitter has branded it's early
> failures with a whale that no only didn't hurt it but helped endear the
> service to millions.  when your service fits these criteria, why would
> you bother doing the complicated systems and application engineering
> necessary to actually have functional redundancy?
> 
> it simply isn't worth it.
> 
> t
> 
> >
> >   Scott