Amazon diagnosis

Paul Graydon paul at
Sun May 1 20:03:49 UTC 2011

On 5/1/2011 9:29 AM, Jeff Wheeler wrote:
> On Sun, May 1, 2011 at 2:18 PM, Andrew Kirch<trelane at>  wrote:
>> Sure they can, but as a thought exercise fully 2n redundancy is
>> difficult on a small scale for anything web facing.  I've seen a very
>> simple implementation for a website requiring 5 9's that consumed over
>> $50k in equipment, and this wasn't even geographically diverse.  I have
> What it really boils down to is this: if application developers are
> doing their jobs, a given service can be easy and inexpensive to
> distribute to unrelated systems/networks without a huge infrastructure
> expense.  If the developers are not, you end up spending a lot of
> money on infrastructure to make up for code, databases, and APIs which
> were not designed with this in mind.
> These same developers who do not design and implement services with
> diversity and redundancy in mind will fare little better with AWS than
> any other platform.  Look at Reddit, for example.  This is an
> application/service which is utterly trivial to implement in a cheap,
> distributed manner, yet they have failed to do so for years, and
> suffer repeated, long-duration outages as a result.  They probably buy
> a lot more AWS services than would otherwise be needed, and truly have
> a more complex infrastructure than such a simple service should.
> IT managers would do well to understand that a few smart programmers,
> who understand how all their tools (web servers, databases,
> filesystems, load-balancers, etc.) actually work, can often do more to
> keep infrastructure cost under control, and improve the reliability of
> services, than any other investment in IT resources.
If you want a perfect example of this, consider Netflix.  Their 
infrastructure runs on AWS and we didn't see any downtime with them 
throughout the entire affair.
One of the interesting things they've done to try and enforce 
reliability of services is an in house service called Chaos Monkey who's 
sole purpose is to randomly kill instances and services inside the 
infrastructure.  Courtesy of Chaos Monkey and the defensive programming 
it enforces, nothing is dependent on each other, you will always get at 
least some form of a service.  For example if the recommendation engine 
dies, then the application is smart enough to catch that and instead 
return a list of the most popular movies, and so on.  There is an 
interesting blog from their Director of Engineering about what they 
learned on their migration to AWS, including using less chatty APIs to 
reduce the impact of typical AWS latency:


More information about the NANOG mailing list