Amazon diagnosis

Sun May 1 20:32:07 UTC 2011

On Sun, May 01, 2011 at 12:50:37PM -0700, George Bonser wrote:
> 
> From my reading of what happened, it looks like they didn't have a
> single point of failure but ended up routing around their own
> redundancy.
> 
> They apparently had a redundant primary network and, on top of that, a
> secondary network.  The secondary network, however, did not have the
> capacity of the primary network.
> 
> Rather than failing over from the active portion of the primary network
> to the standby portion of the primary network, they inadvertently failed
> the entire primary network to the secondary.  This resulted in the
> secondary network reaching saturation and becoming unusable.
> 
> There isn't anything that can be done to mitigate against human error.
> You can TRY, but as history shows us, it all boils down the human that
> implements the procedure.  All the redundancy in the world will not do
> you an iota of good if someone explicitly does the wrong thing.
>   [ ... ]
> 
> This looks like it was a procedural error and not an architectural
> problem.  They seem to have had standby capability on the primary
> network and, from the way I read their statement, did not use it.

The procedural error was putting all the traffic on the secondary
network.  They promptly recognized that error, and fixed it.  It's
certainly true that you can't eliminate human error.

The architectural problem is that they had insufficient error recovery
capability.  Initially, the system was trying to use a network that was
too small; that situation lasted for some number of minutes; it's no
surprise that the system couldn't operate under those conditions and
that isn't an indictment of the architecture.  However, after they put
it back on a network that wasn't too small, the service stayed
down/degraded for many, many hours.  That's an architectural problem. 
(And a very common one.  Error recovery is hard and tedious and more
often than not, not done well.)

Prodecural error isn't the only way to get into that boat.  If the
wrong pair of redundant equipment in their primary network failed
simultanesouly, they'd have likely found themselves in the same boat: a
short outage caused by a risk they accepted: loss of a pair of
rundundant hardware; followed by a long outage (after they restored the
network) caused by insufficient recovery capability.

Their writeup suggests they fully understand these issues and are doing
the right thing by seeking to have better recovery capability.  They
spent one sentence saying they'll look at their procedures to reduce
the risk of a similar procedural error in the future, and then spent
paragraphs on what they are going to do to have better recovery should
something like this occur in the future.

(One additional comment, for whoever posted that NetFlix had a better
architecture and wasn't impacted by this outage.  It might well be that
NetFlix does have a better archiecture and that might be why they
weren't impacted ... but there's also the possibility that they just
run in a different region.  Lots of entities with poor architecture
running on AWS survived this outage just fine, simply by not being in
the region that had the problem.)

     -- Brett