Amazon diagnosis

Sun May 1 19:50:37 UTC 2011

> I am still waiting for proof that single points of failure can
> realistically be completely eliminated from any moderately complicated
> network environment / application. So far, I think murphy is still
> winning on this one.
> 
> Good job by the AWS team however, I am sure your new procedures and
> processes will receive a shakeout again, and it will be interesting to
> see how that goes. I bet there will be more to learn along this road
> for
> us all.
> 
> Mike-

>From my reading of what happened, it looks like they didn't have a
single point of failure but ended up routing around their own
redundancy.

They apparently had a redundant primary network and, on top of that, a
secondary network.  The secondary network, however, did not have the
capacity of the primary network.

Rather than failing over from the active portion of the primary network
to the standby portion of the primary network, they inadvertently failed
the entire primary network to the secondary.  This resulted in the
secondary network reaching saturation and becoming unusable.

There isn't anything that can be done to mitigate against human error.
You can TRY, but as history shows us, it all boils down the human that
implements the procedure.  All the redundancy in the world will not do
you an iota of good if someone explicitly does the wrong thing.  In this
case it is my opinion that Amazon should not have considered their
secondary network to be a true secondary if it was not capable of
handling the traffic.  A completely broken network might have been an
easier failure mode to handle than a saturated network (high packet loss
but the network is "there").

This looks like it was a procedural error and not an architectural
problem.  They seem to have had standby capability on the primary
network and, from the way I read their statement, did not use it.