Amazon diagnosis

Sun May 1 21:35:29 UTC 2011

> They apparently had a redundant primary network and, on top of that, a
> secondary network.  The secondary network, however, did not have the
> capacity of the primary network.
> Rather than failing over from the active portion of the primary network
> to the standby portion of the primary network, they inadvertently failed
> the entire primary network to the secondary.  This resulted in the
> secondary network reaching saturation and becoming unusable.
> There isn't anything that can be done to mitigate against human error.
> You can TRY, but as history shows us, it all boils down the human that
> implements the procedure.  All the redundancy in the world will not do
> you an iota of good if someone explicitly does the wrong thing.  ...
> This looks like it was a procedural error and not an architectural
> problem.  

A sage sayeth sooth: 

      "For any 'fool-proof' system, there exists 
       a *sufficiently*determied* fool capable of
       breaking it."

It would seem that the validity of that has just been re-confirmed.  <wry grin>

It is worthy of note that it is considerably harder to protect against
accidental stupidity than it is to protect againt intentional malice.
('malice' is _much_ more predictable, in general.  <wry grin>)

