2008.02.18 NANOG 42 Keynote talk--Amazon, and taming complex systems

Mon Feb 18 20:58:39 UTC 2008

Wow.  I just gotta say again--wow!  Kudos to Josh for pulling
in a very key pair of speakers on a very important topic for
all of us.  I stopped jotting down notes from the slides,
and focused on what they were saying very quickly, because
the lessons learned are so crucial for any network attempting
to scale.  Apologies in advance for typos, etc. that leapt in
while I was typing.

And definitely read through the presentation--I thought we
were doing well at rolling out datacenters in 5 weeks, but
these guys completed pwned us, rolling out datacenters
in two weeks!!

URL for talk is at
http://www.nanog.org/mtg-0802/keynote.html

Matt

2008.02.18 Amazon Keynote Talk

Josh Snowhorn introduces the keynote speaker
as the program committee member who got Amazon
to present.

Tom Killalea, VP of technology,
Dan Cohn, principle engineer

Earth's most consumer-centric company
for the past 13 years

Consumers
Sellers
Developers

Consumers and Sellers
Provide a place where people can find and discover
anything they may want to buy online

800,000 sq foot warehouse in Nevada, have about
13 buildings spread around doing fullfillment.

Sortation devices make sure the right people
get the right products.

Software Developers

Web Scale computing services...
want to give people access to their resources;
free up developers from doing the heavy lifing
of launching a web service, and focus on the
interesting bits.

Don't deal with Muck, focus on APIs.

Let developers focus on delivering solutions.

Developers want a few key items:

Storage
Computing
Queues
Queries

bandwidth utilized (Amazon web services and website)
the orange line is the historical website traffic.
The blue line is vendor services, the web services
business is now a larger bandwidth consumer than
the internal websites.
They've been doing rapid growth, but have also
increased the ratio of network devices to engineers.

Jim Gray
A plantetary-scale distributed system operated
by a single part time operator.

Can we provide infrastructure "muck" so the
network engineers don't have to worry about it?

Goal is to abstract it as much as possible.

How should we trade off consistency, availability,
and network partition tolerance (CAP)
Eric Brewer claims you can have at most 2 of these
invariants for any shared-data system

This is a challenge of tradeoffs; hard consistency
is nearly impossible in very large systems, you
deal with versioning.

Real-time dynamic dependency discovery?
Goal is to not have it be a static system.

Recovery-oriented computing?
Can you protect yourself from downstream damage?

Communications infrastructure that scale infinitely

Answers involve taking a holistic view

MAYA, Machine Anomaly Analyzer
maps server to remote service being called, showing
latency and health of the remote service
All of the content is scheduled by people, so
the dependency tree is different over time; can't
keep track of it over time, but you want to see
what's happening right at the moment.

They show a call to the main amazon page, and all
the tendrils are remote calls that have to happen
before the main page gets rendered.

Simplicity, auto-configuratabiliy across the whole
infrastructure stack

Needs a different approach to engineering; put the
cycle together, share objectives across the different
engineering disciplines involved in building and
designing the system.

Other constraints
"Software above the level of a single device"
Tim O'Reilly

Client requests come in to page rendering components,
then request routing to aggregator servcies,
which have request routing to services such as
Dynamo, S3, order database, etc.

Applications with fault zones wider than a single
rack.
(melted servers shown)

fault zones wider than a single datacenter

Latency graphs--a significant issue

Orange reads, red writes;
average latency, and 99.9% latencies.
Very accute attention paid to both; look at the bad
as well as average.
Aim for 99.9% latency, to see how they do with
convergence time, and fast restoration.

Change management
maintenance windows (none)
latency considerations
availability considerations

With no maintenance windows, they have to be very sensitive
to latency and availability, and they're sensitive to even
minor perturbations.

Hire good people, but even good people only good to 3 sigma
Even good people not so good with complex systems that scale
large and fast.

So, it's all about automation

self-configuring, self-verifying, n-scale automation for
network elements.

Want to make sure they don't have to have humans repeating
work; build blocks on each other.

APIs keep humans out of repetitive, error-prone processes.

Configuration of services that would typically involved
engineers, automated and called via APIs.

Think differently about policy

If...it isn't built into process
     you have to search for it
     it isn't auto-enforced   ....it may as well not exist

Systems verify and enforce policy

Make scaling simple
simplicity isn't achievable as a passive goal
it is a force that must be actively applied

Network simplicty at cscale
always anticipate the next order of mangitude of growth,
 even if it's a challenge

from a shell to a full-in production datacenter in 2 weeks.
This is their challenge and goal.

Make room to takel the big questions
eliminate mundane work.

What is mundane work?

Everything that can not be easily defined by heuristics
or templating

deployment of network deices

Operational problems too

Basically, automate everything that is normally limited
to human speed.

troubleshooting, mitigation, problem identification,
detail investigation, alarming, ticketing, workarounds,
repair,

Cool
but realistic?

Take what's implicit to network engineers, make it
explicit in policy and code.

How to start?

No need to throw out existing network or designs!

The network is the only authoritative resource that exists.

They take small parts, define the bits of policy, and
then write systems to check, and iterate.

Reconciliation of policy vs reality
creates a self-enforcing cycle
You find variances, you either update policy, or
enforce it.

Iteate on high impact/frquency problems

continue cycle, it's a forcing function to get
people to understand impact of policy decisions
and how to define them network-wide.

Results in normalized, consistent, repeatable deployments

It can take multiple iterations to get to simple

case study--multicast at amazon
very heavy user of multicast internally, about 7 years now.

simple multicast based approach to make publishing and
subscribing topics of interest really easy
  perhaps too easy
we quickly had more (s,g) state than any other network
 in the world
By over 2x

Once they made subscription of data resources easy,
adoption took off like wildfire, and growth surpassed
what they anticipated.
It was challenging forwarding and state tracking of
vendor hardware.

iterative design process for multicast infrastructure
PIM-SM (BSR)
PIM-SM (no SPT)
PIM-SM (static anycast RP)
PIM-SM (static anycast RP + MSDP)
 but all still require tracking s,g state which is
 too big
PIM-BIDIR
 each transition on live network, with no scheduled
 outages; very challenging.

1 year later, we got it to 'simple' again on BIDIR;
been using BIDIR in core for two or three years now.

Systems architecture as a part of holistic simplicity.
loosely homogeneous
 rows of largely identical racks
built as-needed
accessed through APIs
 property makes request when they need resource
On-demand assignment
On-demand release

GenericCapacityRequest
placed pseudo-randomly into slots in the datacenter
programmatic increases and decreases as required
requires host level application deployment architecture

thousands of applications doing this
leads to very interesting network flows if unconstrained

leads to complexity again if unconstrained.
DiscoveryService
NetworkLocalityService (what's the closest/best of available
 resource of that type)
Answer is informed by definition of topology

NetworkLocalityService
Large-Scale soa
fragmented capacity allocation, just like any other component
need to provide a programmatic answer to the question
 of network locality
NetworkLocality(srcIP, dandiatedestIP1, canddestIP2, etc)

Applications meet the network
application and network teams both work together to
figure out programatically
is it the network or the application

Network engineers get to care about the application design,
they sit down with developers, figure out the interaction
model, the constraints, and how the design of the application
can work with the network; and likewise, how the network
can grow to work with the application needs.

They work with application developers, whether they work
for Amazon or not.

set Amazon's servers on fires, not yours!
www.acadweb.wwu.edu/dbrunner/

New York Times, single person working there on
putting all their historical articles online;
couldn't build the entire datacenter model,
got his bossses credit card,

Loaded newspapers from 1851 to 1922 into s3
Set up a hadoop processing cluster on EC2
churned through all 11 million articles in just under
24 hours using 100 EC2 instances
Added 1.5TB of publically usable content in 24 hours.

Spent $100 with Amazon to do it.

http:://www.amazon.com/jobs/

Q: Randy Bush asks about the outage last Friday.
Well, dependency mapping is a hard case.
you need to do edge case mapping, backoff and
recovery, edge cases can cause other systems to
tumble down; it's challengign in SOA, where newer
applications being built mash together different
APIs.
Single failure, had expanding effects at higher
levels; how do you make sure edge case doesn't
affect the common platform.

Q: Anne Johnson, interested in concept of locality;
how you decide the 'best' server is for a request.
There has to be more to it than just IP addresses,
right?  How do you define closest?
Best isn't necessarily the closest, 1/2 ms doesn't
matter as much; it's basically an ordered list
of available capacity with latency tiers.
Trying to figure out how to expose it to external
developers.
One thing that does look intersting is fault zones;
which pieces of infrastructure can fail together.
Some level of fate sharing analysis; what pieces
share fate, which ones can be isolate.

Q: Todd Underwood, Renesys; Amazon transitions from
a company that sells items to a company that provides
services; is this something that has been schemed
over time, or was it a recent decision?  For many
people, network is just cost center that spends
too much.
This wasn't part of the business plan, it wasn't
amazing prescience, it was that they'd built
amazing technology that only they were using;
for their ecommerce platform, they started
sharing useful primatives with outside entities,
very happy with how small companies are using
them, and shaping nature of the development.
Many of the early decisions on policy and design
were under assumption they were always going to
be internal; since then, have changed to make
sure tools can be externalized, with that thought
going into initial design.

Q: Dan Blair, Cisco, managing complex information
flows in hopefully simple ways from user
perspective; do they care about data flow
symmetry or asymmetry?
Depends on what level; symmetry is important
for some pieces, for anycast deployments for
example.  But symmetry is more on network design
than flow levels mainly.
Perturbations happen both on small scale and
large scale, they tend to be localized, and mostly
symmetric.

Q: Dino, Cisco--do you see your edge cases increasing,
decreasing, are they on network, application, etc?
BIDIR example is radical example of simplification;
it's been stable for three years now.  Simplicity
is the first design principle they consider when
people get in the room.  Are the edge cases increasing?
They try to do rapid prototyping, provide minimal
functionality early, goal is to interate from there;
don't release full featured services initially, then
customers influence over time.
If you wanted to add new functionality, would you
add a new protocol, or bolt onto an existing protocol?
Would be evaluated on a case by case basis.
More protocols is not desirable and increases opex,
in general.  As they look at larger and larger systems,
they often do find a need for some additional protocols.

Centralization vs Distribution--which way do they lean?
They go for distributed systems, use clean, hardened
APIs, as self-describing as possible.

Thanks to everyone for the time, and thanks to Josh for
twisting their arms hard enough to come here.

30 minute break now.  Back here in 30 minutes.  NOT lunch!
PGP key signing now, Joel will handle it, follow him to
get your keys signed!