FYI Netflix is down

Sat Jun 30 20:24:41 UTC 2012

scott,

>>
>> This was not a cascading failure.  It was a simple power outage
>>
>> Cascading failures involve interdependencies among components.
>
>
> Not always.  Cascading failures can also occur when there is zero dependency
> between components.  The simplest form of this is where one environment
> fails over to another, but the target environment is not capable of handling
> the additional load and then "fails" itself as a result (in some form or
> other, but frequently different to the mode of the original failure).

indeed.  and that is an interdependency among components.  in
particular, it is a capacity interdependency.

> Whilst the Amazon outage might have been a "simple" power outage, it's
> likely that at least some of the website outages caused were a combination
> of not just the direct Amazon outage, but also the flow-on effect of their
> redundancy attempting (but failing) to kick in - potentially making the
> problem worse than just the Amazon outage caused.

i think you over-estimate these websites.  most of them simply have no
redundancy (and obviously have no tested, effective redundancy) and
were simply hoping that amazon didn't really go down that much.

hope is not the best strategy, as it turns out.

i suspect that randy is right though:  many of these businesses do not
promise perfect uptime and can survive these kinds of failures with
little loss to business or reputation.  twitter has branded it's early
failures with a whale that no only didn't hurt it but helped endear
the service to millions.  when your service fits these criteria, why
would you bother doing the complicated systems and application
engineering necessary to actually have functional redundancy?

it simply isn't worth it.

t

>
>   Scott