2008.02.19 NANOG 42 100G forwarding challenges

Tue Feb 19 22:08:53 UTC 2008

Wow.  It's good to back on IPv4.  Apologies for the multiple
copies of the previous set of notes, gmail got very, very
confused about going through the NAT-PT gateway it
seems.  Next time I'll try sending them from a different
mail provider.  ^_^;;

As pointed out, the i-triple-e has more than 2 e's in
it...thanks for pointing out my goof, I'll make sure
I count my e's more carefully in the future.  :)

Matt

2008.02.19 100G forwarding challenges panel

Ted Seely/Igor Gashinsky, moderators

Joel Goergen -- force 10
Dave Tsaing  -- cisco
Aris Wong -- foundry

Why are we here?

We needed 100G a year ago
Standard won't be ready until 2010
Challenges to building hardware still.

Scale
IX operators need it for scaling the
exchange points
16x10G LAGS already being planned.

Slide showing high-performance computing
clusters that need 8x2x10G connectivity.

Joel from force10 goes first.
VP of technology, chief scientist
100G forwarding architectural challenges

OIF meeting in Brussels, 2005, with Alcatel,
Force10, Marconi, and a few others.
Was unable to convince the people that by
2007 people would require speeds greater
than 10G LAG groups.

It's been an upwards struggle, and if it
wasn't for the participation of end users
to help systems companies, peers at Cisco,
Foundry, Force10, would never be able to
convince manufacturers of components that
high speed optics, SERDES would be needed

Looking at where we are today vs July 2005,
there's full agreement that the project needs
to happen, but is lacking agreement on the
commonality for moving forward.  Please, give
your input to your vendors so we can get
802.3ba correct.

Chassis design even come into play when planning
for 100G speeds.
 lower system BER
 connectors
 N+1 switch fabric
 reduced EMI
 clean power routing architecture
 thermal and cooling
 cable management
Must meet regulatory standards.

Those of us here need to really speak up now
to make sure that the requirements get put in
place first.

Backplane/channel signalling--most backplanes
now handle differential signalling up to 6.25G;
that's great, up to a point.
No longer A+B fabric, now N+1 fabrics.
You can't cool a simple A+B adequately with dozens
of 6.25G channels into a single card.
So now, it's distributed switch fabrics with a
portion of the traffic on each.
spread out the signalling from multiple linecards
to multiple fabrics.

In 2 years, no advances in backplane signalling
to move us beyond 6.25G point.
They're packing more and more ports onto each
card to reduce the price per port to where
market will bear.
Single 100G blade can be done now, but nobody
buys single port gigE cards; there's already
40G OC768, but that's single port of SONET.
To amortize costs on gig side to get it into
price point market wants.
Need 25G optical/electrical signalling on the
backplane in order to hit the targets people
want.
He and his teams, and competitors are all
working on 25G backplane signalling; you'll
see specs being proposed in next 4 weeks; do
read them and ask questions now!

System BER--he and cisco have been strugging
with component vendors, that 10^-12 isn't a
reasonable thing to build into chassis; need
10^-15th built-in, and 10^-17th for testing
conditions.

Right now, 802.3ba still lists 10^-12 as the
target rate; at 100G, that's a significant
amount of packet drops, which isn't acceptable;
they're pushing internally to hit 10^-17.
The math is out there for testing those BERs.

Only one connector out there that works for 25G
right now; hopefully the technology will be
finished and ready by 2009/2010 timeframe.

People need to be concerned about these key
items when talking to their vendors!

Reduce thickness on backplane to reduce signal
issues, get better signal integrity.

Currently, if you use 3 or 6G signalling, backplane
will be 1/2" thick.
at 12 or 25G, backplane will be .25" thick.
Cost drops almost 200% at that point, which allows
you to shift costs to memory, for example.

Reduced EMI--not sure how 100G will perform
through agency approvals.  Will be SERDES,
clock, optics, and even feature related (more
features means more things happening on the
microprocessor, more emissions).

Power; in 2003/2004, as systems vendor, could
have significant noise on 3.3V, 2.5V, 1.5V
supplies, and things would generally work.
Today, you need very tight tolerances or you'll
blow the BER.

Thermal issues; we're adding 25% more power *in*
to try to do more; so energy efficient components
will be crucial.

Cable management will be crucial; we'll be looking
at 400+ gig per blade; how do we handle the cabling
for that?

Current copper 10G PHY takes 10-15W already; you
can't get same density as with fiber PHY that
takes 3W.

High speed will force narrow interfaces to get higher
signalling speeds through the SERDES paths.
You'll probably see 10x10 before 4x25, but 4x25 is what
will be needed to hit the densities needed on the
linecards.
Within 2012, will  probably hit the 4x25 channels
to reach real density levels.

We all need to stay involved, and keep giving our
feedback!

Density has to move to 400G to 500G per blade for
economics to work out.

Memory, and memory technologies will need to speed
up to touch packets at those line rates.
A huge amount of memory will be needed to handle
packets going through the box; cut-through and store
and forward will have to merge in the future.

Packet forwarding challenges at 100G
David Tiang, Cisco.

Linecard perspective.
it's 2.5x our last jump at 40G
more scalability
 global tables growing larger
 potential for prefix explosion as v4 runs out
 addres reselling->fragmentation
 IPv6
 Growth of VPN usage

More flexibility as Internet deals with evolution
 (eg v4-v6 transition, LISP, pt-mpt MPLS)

We'll need additional complexity as we go through
these transition periods; people will also want
future-proofing on the boxes they buy.
How many pps can you really handle?

Why programmability?
simple forwarding isn't so simple anymore.

Trade off of performance vs scalability.
Allow DRAM memory to be populated with more routes
to continue to scale.

Faster convergence in cases of routing changes.
Lots of indirection, allow for pointer changes
when updates happen, for faster convergence
but slower lookups.

End up with high bandwidth, highly flexible,
but also highly complex.

So from 40G to 100G, we have 2.5x MIPs, Memory,
TPS, Memory BW, FF-mhz,
aiming to keep same power profile
 no forklift upgrades to the datacenter
Mitigations
 silicon advances (110nm -> 90nm -> 65nm)
 lower voltages, capacitance, leakage current
 more efficient memory technologies (dram, sram, tcam)
 more efficient design (terminations, power supply design)

His focus is "can it fit in the power envelope"?

ASIC technology
ASIC 110nm at 1.2v vs 65nm at 1.0V
P(d) = V^2 * C * f
capacitance goes down as voltage goes down
clock input cpacitance 26% less
data input capacitance 50% less
Overall, get .45 multiplier; so, trying for 2.5x
it's 113%, a bit over.

Leakage current issues (man, these slides are dense,
download them and read them on your own)

Less gates per Gbps had same dynamic power, but less
static (leakage) power per Gbps.

increase ASIC freq from 250Mhz to 400Mhz, gets you up
by 126% on ASIC

Memory technologies
40G, FCRAM, 332Mbps/pin,
100G RLDRAM-II, 800Mbps/pin
DRAM power gain is 2.5x .55, 138%

TCAM
11.25W at 40G equiv Lups vs 7.5W at 100G equiv Lups
73% reduction in TCAM power (better TCAM cell design)

Other power savers
More efficient SI design
 internal terminations vs external Thevenin terminations
 on memory lines
Integrated serdes on ASICS
replacement of some SRAMs with SDD
efficient power supply design
 discrete designs optimized per load zone decreases power
 loss through DC-DC converter
Integration of service processor function (10W)

End result is that they're able to roughly maintain the
same power profile for the new ports.
Mostly due to TCAM power reductions.
ASIC power went up, had to play tricks to offset it.

Silicon advantages get us 90% of the way to 100G
challenge, but silicon is lagging behind bandwidth
growth curve.

The remaining 10% comes from more efficient designs
which have limited reusability and repeatability;
over time, it's going to suck more power, and put
out more heat.

John Burger and Aris Wong, Foundry Networks
100G challenges, QoS, Policer issues,

Agenda
key components
challenges
potential solutions.

Key components required for 100G
optics, phy/mac
packet processor
traffic manager, switch fabric interface
system backplane

everybody needs high speed memory, high speed fabric.
speed ups along those pathways are unavoidable.

CMOS ASIC process technology
supporting chips all need to run faster to support 100G
65nm down to 45nm technology will allow higher speed
chips to do faster signalling with less power.

various decisions that need to be made after packet
arrives; header lookups launched, and rewrites are
done, traffic manager does scheduling, shaping,
and buffer management.

The policer and QoS system needs to function much
faster in 100G environment;
QoS at 100G, needs to manage flows faster
has to be able to prioritize and assign drop precedence
as the packets go past.
scheduler has to be able to handle the forwarding.

Policer
 typically dual leaky bucket algorithm
 decisions
  forward, drop packet
  mark green, yellow,

Slide showing example of dual leaky bucket model.

challenges at 100GE
packet rate can be up to 150Mpps per port
dual leaky bucket algorithm can be compute intensive,
and the timing is critical to implement
scheduling is harder too.

technology node at 65nm or below is needed to ease
timing challenges

divide and conquer approach, multiple instantiations
of policers, but brings coherency challenges

how about pipelining instructions?

Customers have been speaking up pointing out they need
100G to keep up with network growth
challenges in forwarding architectures
 higher packet rate
 higher bandwidth

Thank you!

Questions between panelists

Q: system bit error rates; 100GE PHY as well.
You want to push vendor to better raw error
rates; what about error correction, retransmission,
other ways to get to better system error rate.
A: Going from 10^-12 to 10^-14, issue is more for
component vendor than system vendor; adds time,
adds cost to delivery.
Comes down to time and costs, may need to reduce
10^-17 goal for now.
Possible feed forward erorr correction.
But FEC adds 8-20% to bandwidth requirement, which
means more heat, more power, etc.
DFE, multitap transmitters, currently DFE
requires 6 taps, which means latency; for
store and forward switch, adds more and more
delay into the unit.
So BER, FEC, DFE, there's tradeoffs; for DFE,
it would take 16+ taps, which adds way more
delays; BER of 10^-15 really is the way to
go.

Q: estimate for 25G signalling?
A: he's putting 25% of his resources to get 25G
signalling; the stepping stone is first to agree
on objective, then agree on steps to take.
If we can agree this year, can get the specs
nailed down and start getting components rolled
by 2010, which will allow linecards out by 2012
timeframe.

Igor asks how far away are we from boxes shipping
that won't cost $10M each?
A: Draft 1.0 is Nov 2008 currently, may go into
2009.  Once the draft is done, they can start working
on how to meet the spec on the hardware side.
The challenge is to get those first generation
linecards out without breaking the bank on them
or making them cost a bundle.
Get 40G out there, reasonably priced without a
lot of features; then focus on expanding
features and density out to 2014.
May do a pre-standard version to get feet wet
first.

Q: What do you have in mind after 100G--are we
at the end of what silicon can handle?
They're looking at 45nm, 35nm, they should be
able to go to 250G before running out of silicon
steam--that's on the forwarding side.  Will need
to figure out how to cool it, and how to cool
the box around it, which goes back to the
datacenter side.

Q: Anton Kapella, 5 nines, scaling rates higher
and higher, binary encoding frequency is getting
more challenge, can we do multilevel encoding to
get better bit rates, what about going right to
optics?
A: F10 labs, work shows that they'll be able to
extend serdes to 75G per differential pair through
coding and signalling methods, without OFDM, but
using multilevel coding.  pam4, pam8, quam as
it's been used... pam4 has issues; should be 1/log2(n);
but pam4 isn't scaling to the math at the moment.
Not attractive for the next gen coding scheme.
lucent/alcatel doing dual-binary; full pulse
transmission, 25 tap dfe at tx and rx side.
A few weeks from now, a new encoding will be
published with bandwidth of 1/n, similar to
NRZ.  Could implement on current backplane
technologies, which is important; it's very
expensive to replace backplanes, etc.
On optical side--bell labs did optical backplanes
in 1998 and 2000; but the manufacturability of
them is really tough.  deterministic jitter, optical
power, etc; still need to do optical to electrical
conversion, even if you do it right on die; hoping
to hold off on needing to do that until 2016.
In multichassis, using optics to distribute switch
fabrics out, they're expensive, hard to manufacture,
if you can do it electrically stick with that for now.

Thanks to all the panelists, and it's break time now!