FYI Netflix is down

Tue Jul 3 20:00:07 UTC 2012

Jon Lewis wrote:
> It seems like if you're going to outsource your mission critical
> infrastructure to "cloud" you should probably pick at least 2
> unrelated cloud providers and if at all possible, not outsource the
> systems that balance/direct traffic...and if you're really serious
> about it, have at least two of these setup at different facilities
> such that if the primary goes offline, the secondary takes over. If a
> cloud provider fails, you redirect to another.

Really, you need at least three independent providers. One primary
(A), one backup (B), and one "witness" to monitor the others for
failure. The witness site can of course be low-powered, as it is not
in the data plane of the applications, but just participates in the
control plane. In the event of a loss of communication, the majority
clique wins, and the isolated environments shut themselves down. This
is of course how any sane clustering setup has protected against
"split brain" scenarios for decades.

Doing it the right way makes the cloud far less cost-effective and far
less "agile". Once you get it all set up just so, change becomes very
difficult. All the monitoring and fail-over/fail-back operations are
generally application-specific and provider-specific, so there's a lot
of lock-in. Tools like RightScale are a step in the right direction,
but don't really touch the application layer. You also have to worry
about the availability of yet another provider!
-- 
RPM