OT: Question/Netflix issues?
paul at paulgraydon.co.uk
Wed Mar 23 20:42:04 CDT 2011
On 03/23/2011 09:41 AM, sillywizard at rs4668.com wrote:
> "Lyndon Nerenberg (VE6BBM/VE7TFX)"<lyndon at orthanc.ca> wrote:
>>> Guess that move to Amazon EC2 wasn't such a good idea. First reddit,
>>> now netflix.
>> FWIW, at $DAYJOB we haven't been able to run out a pool of a couple of
>> dozen EC2 instances for more than two weeks (since last June) without
>> at least one of them going down. The same number of hardware servers
>> we ran ourselves in Peer1 ran for a couple of years with no unplanned
>> Amortized over five years, Peer1 colo + hardware is also cheaper than
>> the equivalent EC2 cost.
>> Hey everyone! Join the cloud, and stand in the pissing rain.
> Interesting, because we run 120 with almost no issues whatsoever (3 failures over the past 12 months, none of which caused downtime). I've never had an EBS volume fail in the 18 months we've used them. IMHO, the "issues" with the cloud are almost always at a layer above the infrastructure.
Reddit has routinely had EBS volumes either outright fail (2 major
outages in the last month/month and a half, both caused by several EBSs
vanishing), or show some not insignificant degradation in performance,
and it seems barely a month goes by when I don't hear someone on twitter
talking about similar with their infrastructures. Most of the problems
I've heard about do seem to revolve around EBS, however, rather than
their other services. It may be just the nature of people to pick on
and shout about the biggest targets, but I'm reasonably sure almost all
the problems I hear about relating to cloud services revolve around
Amazon and rarely their competitors.
When it comes to other layers in the infrastructure probably one of the
most talked about problems is network latency between instances.
Netflix had to specifically re-engineer their platform because of it
(and other major users talk of similar changes). There is almost
certainly an argument to be made that the outcome of the forced
re-engineering is a good thing as it's generally boosting resilience,
but that it's been forced on them in such a way surely should also be of
some cause for concern also.
Reddit seem to be working hard to make their platform as resilient as
possible to their routine problems cause by the infrastructure. One of
their outgoing dev's gave a pretty interesting read on the problems
they'd experience with Amazon:
I absolutely do think cloud hosting / virtual servers have value and use
and shouldn't be underestimated or written off as a fad, but I'm also
not entirely convinced at the moment that Amazon is a vendor to
particularly trust with such services, I'd probably also argue that
anyone keeping their eggs in one basket and relying on a single vendor
for such services is taking a significant risk. There are plenty of
tools and libraries out there to help provide a standard API for rolling
out servers on different platforms. It seems crazy not to take
advantage of the flexibility the cloud offers to remove as many SPOFs as
More information about the NANOG