2009.10.20 NANOG47 Day 2 notes, morning sessions

Matthew Petach mpetach at netflight.com
Tue Oct 20 12:08:41 CDT 2009

Here's my notes from this morning's sessions.  :)

Off to lunch now!


2009.10.20 NANOG day 2 notes, first half

Dave Meyer kicks things off at 0934 hours
Eastern time.

Survey!  Fill it out!

Cathy Aaronson will start off with a rememberance
of Abha Ahuja.  She mentored, chaired working
groups, she helped found the net-grrls group;
she was always in motion, always writing software
to help other people.  She always had a smile, always
had lots to share with people.
If you buy a tee shirt, Cathy will match the donation.

John Curran is up next, chairman of ARIN
Thanks to NANOG SC and Merit for the joint meeting;
Add your operator perspective!
Vote today in the NRO number council election!
You can vote with your nanog registration email.

Join us tonight for open policy hour (this room)
and happy hour (rotunda)

Participate in tomorrow's IPv6 panel discussion
and the rest of the ARIN meeting.

You can also talk to the people at the election
help desk.

During the open policy hour, they'll discuss the
policies currently on the table.

And please join in the IPv6 panel tomorrow!

If you can, stay for the ARIN meeting, running
through Friday.

This includes policy for allocation of ASN blocks
to RIRs
Allocation of IPv4 blocks to RIRs
Open access to IPv6 (make barriers even lower)
IPv6 multiple discrete networks (if you have non
 connected network nodes)
Equitable IPv4 run-out (what happens when the free
 pool gets smaller and smaller!)

Tomorrow's Joint NANOG panel
 IPv6--emerging success stories
Whois RESTful web service
Lame DNS testing
Use of ARIN templates
 consultation process ongoing now; do we want to
 maintain email-based access for all template types?

Greg Hankins is up next for 40GbE and 100GbE
standards update--IEEE P802.3ba

Lots of activity to finalize the new standards specs
 many changes in 2006-2008 as objectives first developed
After draft 1.0, less news to report as task force
 started comment resolution and began work towards the
 final standard
 Finished draft 2.2 in august, crossing Is, dotting Ts
 Working towards sponsor ballot and draft 3.0
On schedule for delivery in June 2010

Copper interface moved from 10meter to 7meter.
100m on multimode,
added 125m on OM4 fiber, slightly better grade.

CFP is the module people are working towards as
a standard.

Timeline slide--shows the draft milestones that
IEEE must meet.  It's actually hard to get hardware
out the door based around standards definitions.
If you do silicon development and you jump in too
fast, the standard can change under you; but if you
wait too long, you won't be ready when the standard
is fully ratified.
July 2009, Draft 2 (2.2), no more technical changes,
so MSAs have gotten together and started rolling
out pre-standard cards into market.

Draft 3.0 is big next goal, it goes to ballot for
approval for final standards track.
After Draft 3.0, you'll see people start ramping
up for volume production.

Draft 2.x will be technically complete for WG ballot

tech spec finalized
first gen pre-standard components have hit market
technology demonstrations and forums

New media modules:
QSFP modules
created for high density short reach interfaces
 (came from Infiniband)
Used for 40GBASE-CR4 and 40GBASE-SR4

CXP modules
proposed for infiniband and 100GE
12 channels
100GbE uses 10 of 12 channels
used for 100GBASE-10

CFP Modules
long reach apps
big package
used for SR4, LR4, SR10, LR4, ER4
about twice the size of a Xenpak

100G and 40G options for it.

MPO/MTP cable
multi-fiber push-on
high-density fiber option
12 fiber MPO uses 8 fibers
 24 fiber MPO cable, uses 20 fibers
this will make cross connects a challenge

Switches and Routers
several vendors working on pre-standard cards,
you saw some at beer and gear last night.
Alcatel, Juniper

First gen tech will be somewhat expensive and
low density
 geared for those who can afford it initially and
 really need it.
 Nx10G LAG may be more cost effective
 higher speed interfaces will make 10GbE denser and
Density improves as vendors develop higher capacity
 systems to use these cards
  density requires > 400Gbps/slot for 4x100GbE ports
Cost will decrease as new technology becomes feasible.

Future meetings
September 2009, Draft 2.2 comment resolution
Nov 2009 plenary
 Nov 15-20, Atlanta
 Draft 3.0 and sponsor ballot


You have to go to meeting to get password for the
draft, unfortunately.

Look at your roadmap for next few years
get timelines from your vendors
 optical gear, switches, routers
 server vendors
 transport and IP transit providers, IXs
figure out what is missing and ask for it
 will it work with your optical systems
 what about your cabling infrastructure
 40km 40GbE
 Ethernet OAM
 Jumbo frames?

There's no 40km offering now; if you need it,
start asking for it!

Demand for other interfaces
 standard defines a flexible architecture, enables
 many implementations as technology changes
Expect more MSAs as tech develops and becomes cost
 serial signaliing spec
 duplex MMF spec
 25Gbps signalling for 100GbE backplane and copper
Incorporation of Energy Efficient Ethernet (P802.3az)
to reduce energy consumption during idle times.

Traffic will continue to increase
Need for TbE is already being discussed by network
Ethernet will continue to evolve as network requirements

Question, interesting references.

Dani Roisman, PeakWeb
RSTP to MST spanning tree migration in a live datacenter

Had to migrate from a Per-vlan RSTP to MST on a
highly utilized network
So, minimal impact to a live production network
define best practices for MST deployment that will
 yield maximal stability and future flexibility
Had minimal reference material to base this on

Focus on this is about real-world migration details
read white papers and vendor docs for specs on each

The environment:
managed hosting facility
needed flexibility of any vlan to any server, any rack
each customer has own subnet, own vlan

Dual-uplinks from top-of-rack switches to core.

High number of STP logical port instances
using rapid pvst on core
Multiple VLAN*interface count = logical port instances
Too many spanning tree instances for layer 3 core switch
concerns around CPU utilization, memory, other resource
exhaustion at the core.

Vendor support: per-vlan STP
Cisco: per-vlan is the default config, cannot switch
to single-instance STP
foundry/brocade offers per vlan mode to interoperate
 with cisco
Juniper MX and EX offers vstp to interoperate
Force10 FTOS

Are we too spoiled with per-vlan spanning tree?
don't need per-vlan spanning tree, don't want to
utilize alternate path during steady-state since
we want to guarantee 100% capacity during
failure scenario

collapse from per-vlan to single-instance STP
Migrate to standards-based 802.1s MSTP
(multiple spanning tree--but really going to fewer
spanning trees!)

MST introduces new configuration complexity
all switches within region must have same
vlan-to-mst mapping

means any vlan or mst change must be done
universally to all devices in site.

issues with change control; all rack switches
must be touched when making single change.

Do they do one MST that covers all vlans?
Do they pre-create instances?
do all vendors support large instance numbers?
No, some only support instances 1-16

Had to do migration with zero downtime if possible
Used a lab environment with some L3 and L2 gear

Found a way to get it down to one STP cycle of 45secs

Know your roots!  Set cores to "highest" STP priority
(lowest value)

Set rack switches to lower-than-default to ensure
they never become root.
Start from roots, then work your way down.

MSTP runs RSTP for backwards compatability
choose VLAN groups carefully.

Instance numbering
some only support small number, 1-16

starting point
 all devices running 802.1w
 core 1 root at 8192
 core 2 root at 16384

You can pre-config all the devices with spanning
tree mapping, but they don't go live until final
command is entered
Don't use vlan 1!
set mst priority for your cores and rack switches.
don't forget MST 0!
vlan 1 hangs out in MST 0!

First network hit; when you change core 1 to
spanning mode mst

step 2, core2 moves to mst mode; brief blocking

step 3; rack switches, one at a time, go into
brief blocking cycle.

Ongoing maintenance
all new devices must be pre-configured with identical
 MST params
any vlan to instance mapping changes, do to core 1
no protocol for MST config propagation
 vtp follow-on?

MST adds config complexity
MST allows for great multi-vendor interoperability in
 a layer 2 datacenter
only deployed a few times--more feedback would be

Leo Bicknell, ISC; he's done several; he points
half rack switches at one core, other half at
other core; that way in core failure, only half
of traffic sloshes; also, on that way with traffic
on both sides, failed links showed up much more
Any device in any rack has to support any vlan
is a scaling problem.  Most sites end up going
to Layer3 on rack switches, which scales much
A: Running hot on both sides, 50/50 is good for
making sure both paths are working; active/
standby allows for hidden failures.  But
since they set up and then leave, they
needed to make sure what they leave behind is
simple for the customer to operate.
The Layer3 move is harder for managed hosting,
you don't know how many servers will want in a
given rack switch.

Q: someone else comes to mic, ran into same
type of issue.  They set up their network
to have no loops by design.
Each switch had 4x1G uplinks; but when they
had flapping, it tended to melt CPU.
Vendor pushed them towards Layer3, but they
needed flexibility for any to any.
They did pruning of vlans on trunk ports;
but they ended up with little "islands" of
MST where vlans weren't trunked up.
Left those as odd 'separate' root islands,
rather than trying to fix them.

A: So many services are built around broadcast
and multicast style topologies that it's hard
to mode to Layer3, especially as virtualization
takes off; the ability to move instances around
the datacenter is really crucial for those
virtualized sites.

David Maltz, Microsoft Research
Datacenter challenges--building networks for agility

brief characterization of "mega" cloud datacenters
based on industry studies
traffic pattern characteristics in data centers
VL2--virtual layer 2
 network virtualization
 uniform high capacity

Cloud service datacenter
50k-200k servers
scale-out is paramount; some services have 10s of
 servers, others 10s of 1000s.
servers divided up among hundreds of services

Costs for servers dominates datacenter cost:
servers 45%, power ifrastructure 25%,

maximiize useful work per dollar spent
ugly secret: 10-30% CPU utilization considered "good"
 in datacenters
 servers not doing anything at all
 server are purchased rarely (quarterly)
 reassigning servers is hard
 every tenant hoards servers
solution: more agility: any server, any service

Network diagram showing L3/L2 datacenter model
higher in datacenter, more expensive gear, designed
for 1+1 redundancy, scale-up model, higher in model
handles higher traffic levels.
Failure higher in model is more impactful.
10G off rack level, rack level 1G
Generally about 4,000 servers per L2 domain

network pod model keeps us from dynamically
growing/shrinking capacity
VLANs used to isolate properties from each otehr
IP addresses topologically determined by ARs
Reconfig of IPs and vlan trunks is painful,
 error-prone, and takes time.

No performance isolation (vlan is reachability
isolation only)
one service sending/receiving too much stomps on
other services

Less and less capacity available for each server
as you go to higher levels of network: 80:1 to 240:1

2 types of apps: inward facing (HPC) and outward
facing.  80% of traffic is internal traffic; data
mining, ad relevance, indexing, etc.

dynamic reassignment of servers and map/reduce
style computations means explicit TE is almost

Did a detailed study of 1500 servers on 79 ToR

Look at every 5-tuple for every connection.

Most of the flows are 100 to 1000 bytes; lots
of bursty, small traffic.
But most bytes are part of flows that are 100MB
are larger.  Huge dichotomy not seen on internet
at large.
median of 10 flows per server to other servers.

how volatile is traffic?  cluster the traffic
matrices together.
IF you use 40-60 clusters, cover a day's worth
of traffic.  More clusters gives better fit.
traffic patterns change nearly constantly.
80th percentile is 100s; 99 percentile is 800s

server to server traffic matrix; most of the
traffic is diagonal; servers that need to
communicate tend to be grouped to same
top of rack switch.

but off-rack communications slow down the
whole set of server communications.

Faults in datacenter:
high reliability near top of tree, hard to accomplish
 maintenance window, unpaired router failed.

0.3% of failure events knocked out all members of
a network redundancy group
 typically at lower layers of network, but not always

developers want network virtualization; want a model
where all their servers, and only their servers are
plugged into an ethernet switch.
Uniform high capacity
Performance isolation
Layer2 semantics
 flat addressing; any server use any IP address
 broadcast transmissions

VL2: distinguishing design principles
randomize to cope with volatility
separate names from locations
leverage strengths of end systems
build on proven network technology

what enables a new solution now?
programmable switches with high port density
Fast, cheap, flexible (broadcom, fulcrum)
 20 port 10G switch--one big chip with 240G
List price, $10k
 small buffers (2MB or 4MB packet buffers)
 small forwarding table; 10k FIB entries

flexible environment; general purpose network
processor you can control.

centralized coordination
 scale-out datacenters are not like enterprise networks
 centralized services already control/monitor health and
  role of each server (Autopilot)
 Centralized control of traffic

Clos network:
ToR connect to aggs, aggs connect to intermediate node
 switches; no direct cross connects.
The bisection bandwidth between each layer is the same,
 so there's no need for oversubscription

You only lose 1/n chunk of bandwidth for a single
box; so you can have automated reboot of a device
to try to bring it back if it wigs out.

Use valiant load balancing
every flow is bounced off a random intermediate switch
provably hotspot free for any admissable traffic matrix
works well in practice.

Use encapsulation on cheap dumb devices.
two headers; outer header is for intermediate switch,
 intermediate switch pops outer header, inner header
 directs packet to destination rack switch.
MAC-in-MAC works well.

leverage strength of endsystems
shim driver at NDIS layer, trap the ARP, bounce to
VL2 agent, look up central system, cache the lookup,
all communication to that dest no longer pays the
lookup penalty.

You add extra kernel drivers to network stack when
you build the VM anyhow, so it's not that crazy.

Applications work with application addresses
 AAs are flat names; infrastructure addresses invisible
 to apps

How to implement VLB while avoiding need to update
state to every host on every topology change?
many switches are optimized for uplink passthrough;
so it seems to be better to bounce *all* traffic
through intermediate switches, rather than trying
to short-circuit locally.
The intermediate switches all have same IP address,
so they all send to the same intermediate IP, it
picks one switch.
You get anycast+ECMP to get fast failover and good
valiant load balancing.
They've been growing this, and found nearly perfect
load balancing.
All-to-all shuffle of 500MB shuffle among 75 servers;
get within 94% of perfect balancing; they charge for
the extra overhead for extra headers.
NICs aren't entirely full duplex; about 1.8Gb not 2Gb

Provides good performance isolation as well; as one
service starts up, it has no impact on the service
being running steady state.

VLB does as well as adaptive routing (TE using
oracle) on datacenter traffic
 worst link is 20% busier with VLB; median is same.
And that's assuming perfect knowledge of future
traffic flows.

Related work:
wow that went fast!

Key to economic data is agility!
 any server any service
 network is largest blocker
right network model to create is virtual layer 2
 per service
VL2 uses:
 name-location separation
 end systems

Q: Joe Provo--shim only applied to intra-datacenter
traffic; external traffic is *NOT* encapsulated?
A: Yes!

Q: This looks familiar to 802.1aq in IEEE; when you
did the test case, how many did you look at moving
across virtualized domains?
A: because they punt to centralized name system,
there is no limit to how often servers are switched,
or how many servers you use; you can have 10 servers
or 100,000 servers; they can move resources on 10ms
Scalability is how many servers can go into VL2 "vlan"
and update the information.
In terms of number of virtual layer 2 environments,
it's looking like 100s to 1000s.
IEEE is looking at MAC-in-MAC for silicon based benefits;
vlans won't scale, so they use 802.1h header, gives
them 16M possibility, use IS-IS to replace spanning tree.
Did they look at moving entire topologies, or just servers?
They don't want to move whole topology, just movement in
the leaves.

Todd Underwood, Google; separate tenants, all work for
the same company, but they all have different stacks,
no coordination among them.  this sounds like a
competing federation within the same company; why
does microsoft need this?
A: If you can handle this chaos, you can handle
And in addition to hosting their own services, they
also do hosting of other outsourced services like
exchange and sharepoint.
Microsoft has hundreds of internal properties
Q: this punts on making the software side working
together, right?  Makes the network handle it at
the many to many layer.

Q: Dani, Peakweb--how often is the shim lookup happening,
is it start of every flow?
A: Yes, start of every flow; that works out well; you
could aggregate, have a routing table, but doing it
per dest flow works well.
Q: Is it all L3, or is there any spanning tree involved?
A: No, network is all L3.
Q: Did you look at woven at all?
A: Their solution works to about 4,000 servers, but it
doesn't scale beyond that.

Break for 25 minutes now,

11:40 start time.  We'll pop in a few more lightning

Somebody left glasses at beer and gear, reg desk has
them.  :)

Break now!

Vote for SC members!!

Next up, Mirjam Kuhne, RIPE NCC,
RIPE Labs, new initiative of RIPE NCC
First there was RIPE, the equivalent of NANOG,
then NCC came into existence to handle the
meeting cordination, registrar, handled mailing
lists, etc.

RIPE Labs is a website, and a platform and a tool
 for the community
You can test and evaluate new tools and prototypes
contribute new ideas

why RIPE labs?
faster, tighter innovation cycle
 provide useful prototypes to you earlier
 adapt to the changing environment more quickly
closer involvement of the community
 make feedback and suggestions faster and more


many of the talks here are perfect candidates for
material to post on labs, to get feedback from your
colleagues, get research results, post new findings.

How can it benefit you?
get involved, share information, discover others
working on similar issues, get more exposure.

Few rules:
 free and civil discussion between individuals
 anyone can read content
 register before contributing
no service guarantee
 content can disappear based on
  community feedback
  legal or abuse issues
  too little resources

What's on RIPE Labs?
DNS Lameness measurement tool
REX, the resource explorer
Intro to internet number resource database
IP address usage movies
16-bit ASN exhaustion data
NetSent next gen information service

Please take a look and participate!
mir at ripe.net or labs at ripe.net

Q: Cathy Aaronson notes that ISP security
BOF is looking for place to disseminate
information; but they should probably get
in touch with you!

Kevin Oberman is up next, from ESnet
DNSSec Basics--don't fear the signer!
why you should sign your data sooner rather
 than later
 this is your one shot to experiment with signing
 when you can screw up and nobody will care!
 later, you screw up, you disappear from the net.

DNSSEC uses public crypto, similar to SSH
DNSSEC uses anchor trust system, NOT PKI!  No certs!
Starts at root, and traces down.

Root key is well known
Root knows net key
net knows es key
es key signs *.es.net

Perfect time to test and experiment without fear.

Once you publish keys, and people validate, you
 don't want to experiment and fail--you will
signing your information has no impact.
Only when you publish your keys will it have impact.

It is REALLY getting closer!
Root will be signed 2010

Org and Gov are signed now
com and net should be signed 2011
Multiple ccTLDs are signed; .se led the way,
 and have lots of experience; only once did they
 disappear, and that was due to missing dot in
 config file; not at all DNSSEC related.
Registration issues still being worked on
 transfers are of particular concern
 an unhappy losing registrar could hurt you!

Until your parent is ready
Develop signing policies and procedures
test, test, and test some more
 key re-signing
 key rolls
 management tools
find out how to transfer the initial key to your parent
 (when parent decides)
 this is a trust issue--are you really "big-bank.com"

If you're brave
 you can test validation (very few doing it--test on
 internal server first!!) -- if this breaks, your
 users will hurt (but not outside world)
You can give your public keys to the DLV (or ITARs)
  this can hurt even more!
(DLV is automated, works with BIND out of box, it's
simpler, but you can chose which way to go)

What to sign?
Forward zone is big win
 reverse zone has less value
may not want to sign some or all reverse or forward zones

signing involves 2 types of keys
 ZSK, KSK, zone data key and key for sending keys to parent
keys need to be rolled regularly
if all keys and signatures expire, you lose all access,

use two active keys
 data resigned by 2 newest keys
sign at short intervals compared to expiration to
 allow time to fix things.

new keys require parent to be notified.
ksks are 'safe', not on network (rotate annually)

Wait for BIND 9.7, it'll make your life much

There are commerical shipping products out there.

Make sure there are at least 2 people who can
run it, in case one gets hit by a bus.

Read NIST SP800-81
SP800-81r1 is out for comment now
Read BIND admin reference manual.

Once in a lifetime opportunity!!

Arien Vijin, AMS-IX
an MPLS/VPLS based internet exchange
(started off as a coax cable between routers)
then became Cisco 5500 switch, AMSIX version 1,
then 2001 went to Foundry switches at gig, version 2,
version 3 has optical switching

AMSIX version 3 vs AMSIX vs 4
June 2009 version 3
 six sites, 2 with core switches in middle
 two star networks
E, FE, GE, N*GE connections on BI-15K or RX8 switches
N*10GE connextions resilient connected on switching
 platform (MLX16 or MLX32)
two separate networks, one active at any moment in
selection of active network by VSRP
 inactive network switch blocks ports to prevent loops
photonic switch basically flips from one network to the
other network.

Network had some scaling problems at the end.
Until now, they could always just buy bigger
boxes in the core to handle traffic.

Summer of 2009, they realized there was no sign of
a bigger switch on the horizon to replace the core.

core switches fully utilized with 10GE ports
 limits ISL upgrade
 no other switches on market
platform failover introduces short link flap on all
 10GE customer ports--this leads to BGP flaps
 with more 10G customers this becomes more of an issue

AMSIX version 4 requirements
scale to 2x port count
keep resilience in platform, but reduce impact on
 failover (photonic switch layer)
increase amount of 10G customer ports on access switches
 more local switching
migrate to single architecture platform
 reduce management overhead
use future-proof platform that supports 40GE and 100GE
 2010/2011 fully standardized

They moved to 4 central core switches, all meshed
together; every edge switch has 4 links, one to each
Photonic switch for 10G members, to have redundancy
for customers.

MPLS/VPLS-based peering platform
scaling of core switches by adding extra switches in
 4 LSPs between each pair of access switches
 primary and secondary (backup) paths defined

 bfd for fast detection of link failures
RSVP-TE signalled LSPs over predefined paths
 primary/secondary paths defined
VPLS instance per vlan
 static defined VPLS peers (LDP signalle)
 load balanced over parallel LSPs over all core routers
Layer 2 ACLs instead of port security
 manual adjustment for now
 (people have to call with new MAC addresses)

Now they're P/PE routers, not core and access
switches.  ^_^;

Resilience is handled by LSP switchover from
primary to secondary path; totally transparent
to access router.
If whole switch breaks down, photonic switch
is used to flip all customers to the secondary
So, they can only run switches at 50% to allow
for photonic failover of traffic.

How to migrate the platform without customer

Build new version of photonic switch control daemon (PSCD)
 No VSRP traps, but LSP state in MPLS cloud
develop configuration automation
 describe network in XML, generate configuration from this
Move non MPLS capable access switches behind MPLS
routers and PXC as a 10GE customer connection
Upgrade all non MPLS capable 10GE access switches to
 Brocade MLX hardware
Define migration scenario with no customer impact

2 colocation sites only for simplicity
double L2 network
VSRP for master/slave selection and loop protection

Move GE access behind PXC

Migrate one half to MPLS/VPLS network

Use PXC to move traffic to MPLS/VPLS network, test
for several weeks.

After six weeks, did the second half of the network.

Now, two separate MPLS/VPLS networks.

Waited for traffic on all backbone links to drop
below 50%; split uplinks to hit all the core P
devices; at that point, traffic then began using
paths through all 4 P router cores.

Traffic load balancing over multiple core switches
 solves scaling issues in the core
Increased stability of the platform
 Backbone failures are handled in the MPLS cloud and
 not seen at the access level
Access switch failures are handled by PXC for single
 pair of switches only

Operational experience
BFD instability
 High LC CPU load caused BFD timeouts
 resolved by increasing timers
Bug: ghost tunnels
 double "up" event for LSP path
 results in unequal load balancing
 should be fixed in next patch release

multicast replication
 replication done on ingress PE, not in core
 only uses 1st link of aggregate of 1st LSP
 with PIM-SM snooping traffic is balanced over multiple
 links but has serious bugs
 bugfixes and load balancing fixes scheduled for future
  code releases.

Ripe TTM boxes used to measure delay through the fabric,
GPS timestamps.
Enormous amounts of jitter in the fabric, delays up to
40ms in the fabric.

Attempts from TTM, send 2 packets per minute, with some
 entropy change (source port changes)
VPLS CAM age out after 60s
for 24-port aggregates, traffic often passes a port
 without programming (CPU learning), high delay
does not affect real-world traffic, hopefully
will look to change CAM timing

packet is claustraphobic?
customer stack issue

increased stability
 backbone failures handled by MPLS (not seen by customers)
 access switch failures handled for a single pair of
  switches now
easier debugging of customer ports
 swap to different using glimmerglass
config generation
 absolute necessity due to large size MPLS/VPLS configs

Scalability (future options)
 bigger core
 more ports

Some issues were found, but nothing that materially
 impacted customer traffic
Traffic load-sharing over multiple links is good.

Q: did anything change for gigE access customers,
or are they still homed to one switch?
A: nothing changed for gigE customers; glimmerglass
is single-mode optical only, and they're too
expensive for cheap GigE ports.
no growth in 1G ports; no more FE ports; it's
really moving to a 10G only fabric.

RAS and Avi are up next
Future of Internet Exchange Points

Brief recap of history of exchange points
0th gen--throw cable over wall; PSI and Sprint
conspire to bypass ANS; third network wanted in,
MAE-East was born

1st commercial gen: FDDI, ethernet; multi-access,
had head of line blocking issues.

2nd gen: ATM exchange points, from AADS/PBNAP to
the MAEs, peermaker

3rd gen: GigE exchange points, mostly nonblocking
internal switches, PAIX, rise of Equinix, LINX,

4th gen: 10G exchange points, upgrades, scale-out
 of existing IXes through 2 or 3 revs of hardware

Modern exchange points are almost exclusively
ethernet based; cheap, no ATM headaches
10GE and Nx10GE have been primary growth for years.
Primarily flat L2 VLAN
 IX has IP block (/24 or so)
 each member router gets 1 IP
 any member can talk to any other via L2
 some broadcast (ARP) traffic is needed
  well policed

Large IX toplogy (LINX), running 8x10G or 16x10G
trunks between locations

What's the problem?
L2 networks are easy to disrupt
forwarding loops easy to create
broadcast storms easy to create, no TTL
takes down not only exchange point, but overwhelms
 peering router control plane as well
today we work around these issues by
 locking down port to single MAC
  hard coded, or learn single MAC only
 single directly connected router port allowed
 careful monitoring of member traffic with sniffers
 good IXes have well trained staff for rapid responses

most routers have poor L2 stat tracking
options in use:
Netflow from member router
 no MAC layer info, can't do inbound traffic
 some platforms can't do netflow well at all
SFlow from member routers or from IX operator
 still sampled, off by 5% or more
MAC accounting from member router
 not available on vast majority of platforms today
None integrate well with provider 95th percentile
 billing systems
IXs are a poor choice for delivering billed services
If you can't bill, you can't sell services over the

Anyone can talk to anyone else
vulnerable to traffic injection
poor accounting options make this hard to detect.
 when detected, easy to excuse
less security available for selling paid transit
Vulnerable to Denial of Service attacks
 can even be delivered from the outside world if
 the IX IP block is announced (as is frequently the case)
Vulnerable to traffic interception, ARP/CAM manipulation

difficult to scale and debug large layer 2 networks
redundancy provided through spanning-tree or similar
 backup-path protocols
large portions of network placed into blocking mode to
 provide redundancy.

poor controls over traffic rates and or QoS
difficult to manage multi-router redundancy
multiple routers see the same IX/24 in multiple places
creates an "anycast" effect to the peer next-hops
can result in blackholing if there is an IX segmentation
or if there is an outage which doesn't drop link state.

Other issues:
inter-network jumbo-frames support is difficult
no ability to negotiate per-peer MTU
almost impossible to find common acceptable MTU for

service is constrained to IP only between two routers
 can't use for L2 transport handoff

Avi talks about shared broadcast domain architecture
on the exchange points today.

Alternative is to use point to point virtual circuits,
like the ATM exchanges.

Adds overhead to setup process
adds security, accountablity advantages

Under ethernet, you can do vlans using 802.1q
handoff multiple virtual circuit vlans.

Biggest issue is limited VLAN ID space
limited to 4096 possible IDs--12-bit ID space
vlan stacking can scale this in transport
but VLANs in this are global across system
Means a 65 member exchange would completely
fill up the VLAN ID with a full mesh.
Traditional VLAN rewrites don't help either.

Now, the exchange also has to be arbiter of all
the VLANs used on the exchange.
Many customers use layer3 switch/routers, so the
vlan may be global across the whole device.

To get away from broadcast domain without using
strict vlans, we need to look at something else.

MPLS as transport rather than Ethernet
solves vlan scaling problems
MPLS pseudowires are 32bits; 4billion VCs
VLAN ID not carried with the packet, used only on handoffs
VLAN IDs not a shared resource anymore
Solves VLAN ID conflict problems
 members chose vlan ID per VC handoff
 no requirements for vlan IDs to match on each end
solves network scaling problems
 using MPLS TE far more flexible than L2 protocols
 allows the IX to build more complex topologies,
  interconnect more locations, and more efficiently
  utilize resources.

The idea is to move the exchange from L2 to L3 to
scale better, give more flexibility, and do better
debugging.  You can get better stats, you can do
parallel traffic handling for scaling and redundancy,
and you see link errors when they happen, they aren't
masked by blocked ports.

 each virtual circuit would be isolated and secure
 no mechanism for a third party to inject or sniff traffic
 significantly reduced DOS potential
 Most provide SNMP measurement for vlan subints
 Members can accurately meaasure traffic on each VC
 without "guestimation"
 capable of integrating with most billing systems.

Now you can start thinking about selling transport
over exchange points, for example

Takes the exchange point out of the middle of the
traffic accounting process.

 with more accountability and security, you can offer
 paid services
 support for "bandwidth on demand" now possible
 no longer constrained to IP-only or one-router-only
  can be used to connect transport circuits, SANs, etc.
 JumboFrame negotiation possible, since MTU is per

Could interconnect with existing metro transport
 Use Q-in-Q vlan stacking to extend the network onto
  third party infrastructures
 imagine a single IX platform to service thousands of

Could auto-negotiate VC setup using a web portal

Current exchanges mostly work
 with careful engineering to protect the L2 core
 with limited locations and chassis
 siwth significant redundancy overhead
 for IP services only
A new kind of exchange point would be better
 could transform a "peering only" platform into a
 new "ecosystem" to buy and sell services on.

Q: Arien from AMS-IX asks about MTU--why does it matter?
A: it's for the peer ports on both sides.
Q: they offer private interconnects at AMS-IX, but nobody
 wants to do that, they don't want to move to a tagged
 port.  They like having a single vlan, single IP to
 talk to everyone.
A: The reason RAS doesn't do it is that it's limited in
 scale, you have to negotiate the vlan IDs with each side;
 there's a slow provisioning cycle for it; it needs to
 have same level of speed as what we're used to on IRC.
 Need to eliminate the fees associated with the VLAN
 setup, to make it more attractive.
 It'll burn IPs as well (though for v6, that's not so much
 of an issue)
 Having people peer with the route-server is also useful
 for people who don't speak the language who use the
 route servers to pass routes back and forth.
The question of going outside amsterdam came up, but
the member forbade it, so that it wouldn't compete with
other transit and transport providers.
But within a metro location, it could open more locations
to participate on the exchange point.

The challenge in doing provisioning to many locations
is something that there is a business model for within
the metro region.

Anything else, fling your questions at lunch; return at
1430 hours!

LUNCH!! And Vote!  And fill out your survey!!

More information about the NANOG mailing list