OT: Question/Netflix issues?

Thu Mar 24 01:42:04 UTC 2011

On 03/23/2011 09:41 AM, sillywizard at rs4668.com wrote:
> "Lyndon Nerenberg (VE6BBM/VE7TFX)"<lyndon at orthanc.ca>  wrote:
>
>>> Guess that move to Amazon EC2 wasn't such a good idea. First reddit,
>>> now netflix.
>>> http://techblog.netflix.com/2010/12/four-reasons-we-choose-amazons-cloud-as.html
>> FWIW, at $DAYJOB we haven't been able to run out a pool of a couple of
>> dozen EC2 instances for more than two weeks (since last June) without
>> at least one of them going down.  The same number of hardware servers
>> we ran ourselves in Peer1 ran for a couple of years with no unplanned
>> outages.
>>
>> Amortized over five years, Peer1 colo + hardware is also cheaper than
>> the equivalent EC2 cost.
>>
>> Hey everyone! Join the cloud, and stand in the pissing rain.
>>
>> --lyndon
>>
> Interesting, because we run 120 with almost no issues whatsoever (3 failures over the past 12 months, none of which caused downtime). I've never had an EBS volume fail in the 18 months we've used them. IMHO, the "issues" with the cloud are almost always at a layer above the infrastructure.
>
> --L
>
Reddit has routinely had EBS volumes either outright fail (2 major 
outages in the last month/month and a half, both caused by several EBSs 
vanishing), or show some not insignificant degradation in performance, 
and it seems barely a month goes by when I don't hear someone on twitter 
talking about similar with their infrastructures.  Most of the problems 
I've heard about do seem to revolve around EBS, however, rather than 
their other services.  It may be just the nature of people to pick on 
and shout about the biggest targets, but I'm reasonably sure almost all 
the problems I hear about relating to cloud services revolve around 
Amazon and rarely their competitors.

http://highscalability.com/blog/2010/12/20/netflix-use-less-chatty-protocols-in-the-cloud-plus-26-fixes.html
When it comes to other layers in the infrastructure probably one of the 
most talked about problems is network latency between instances.  
Netflix had to specifically re-engineer their platform because of it 
(and other major users talk of similar changes).   There is almost 
certainly an argument to be made that the outcome of the forced 
re-engineering is a good thing as it's generally boosting resilience, 
but that it's been forced on them in such a way surely should also be of 
some cause for concern also.
Reddit seem to be working hard to make their platform as resilient as 
possible to their routine problems cause by the infrastructure.  One of 
their outgoing dev's gave a pretty interesting read on the problems 
they'd experience with Amazon: 
http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_down_for_6_of_the_last_24_hours/c1l6ykx

I absolutely do think cloud hosting / virtual servers have value and use 
and shouldn't be underestimated or written off as a fad, but I'm also 
not entirely convinced at the moment that Amazon is a vendor to 
particularly trust with such services, I'd probably also argue that 
anyone keeping their eggs in one basket and relying on a single vendor 
for such services is taking a significant risk.  There are plenty of 
tools and libraries out there to help provide a standard API for rolling 
out servers on different platforms.  It seems crazy not to take 
advantage of the flexibility the cloud offers to remove as many SPOFs as 
possible.

Paul