ATM (was Re: too many routes)
Sean M. Doran
smd at clock.org
Fri Sep 12 03:58:01 UTC 1997
Cool, I love being talked down to by old guys. It's
refreshing and doesn't happen nearly frequently enough.
I'm almost at a loss to figure out where to begin with the
scattershot flamefest you sent. Almost. Let's start
> Lets see you allocate an ESF B8ZS Clear Channel T1 over
PDH is dead.
POTS is only alive because you can emulate PDH still, and
extracting a single DS0 from SDH is easy, and because the
POTS user interface is well understood by a very large
installed user base. I don't expect POTS as perceived by
the end-user to change much over time.
End-to-end POTS is already dying. Worldcom is making a
big deal over relatively simple technology which shuffles
fax traffic over the Internet. There goes alot of
long-haul POTS right there. Deutsche Telekom is tight
with Vocaltec and already has a tariff for
voice-over-the-Internet. It's crusty in implementation
because you end up dialling a local access number in DT
land and talk to another dialler on the remote end which
makes a more local phone call. However, there are neat
plans for SS7 and neater plans for doing clever things
with interpreting DTMF.
There is a local distribution plant problem however there
are a number of people working on aggregating up local
access lines into VC11/VC12, dropping that into a
POP-in-a-box at STM-16 and pulling out STM-16c to a big
crunchy IP router. In this model POTS and historical
telco voice and data schemes become services rather than
However, emulating the incredibly ugly phone network is
secondary to enabling the evolution of more and more
complicated and interesting applications; it's also less
cost-effective for the moment than running parallel
networks but converting away from PDH (which, remember, is
BTW, your formatting sucks to the point that your note is
unreadable and unquotable without fmt(1) or fill-paragraph.
> Ahhh, tag switching, I am on that particular holy grail
You might want to examine my comments on the mpls list at
some point. "Holy Grail"? No. That was Noel. I know we
look alike and stuff, but he sees a great deal of promise
in MPLS while I am somewhat sceptical about the
implementation and utility in practice.
>How many parallel paths have you ran on
>layer 3 ? Ever watched the variability ? ( * shiver * )
>Now, tell me parallel paths on IP are smooth with todays
Hum, not more than six hours ago I believe I was
telling Alan Hannan about the various interim survival
techniques in the migration path from 7k+SSP -> decent
I guess you must have me beat experientially.
All I ever did was sit down with pst and tli and hack and
slash at the 10.2-viktor caching scheme to try to get
traffic to avoid moving over to the stabler of the two
lines between ICM-DC and RENATER.
Oh that and helping beat on CEF/DFIB packet-by-packet load
balancing before my last retirement.
So unfortunately I'm really not in a position to comment
on today's technology or the variability of parallel paths
with Cisco routers using any of the forwarding schemes
from fast to cbus to route-cache to flow to fib. (To be
honest I never really figured out wtf optimum was :) ).
The reason that you see "strange" or at least "unsmooth"
load balancing along parallel paths is that except with
fib and slow switching cisco had always forwarded packets
towards the same destination out the same interface, and
load balancing was performed by assigning (upon a cache
fault) particular destinations to particular interfaces.
(Leon was invented to blow away cached entries so that
over time prefixes would slosh about from one interface to
another as they were re-demand-filled into the cache.
Points if you know who Leon and Viktor are. Hint: they're
both as dead as PDH.)
With CEF these days you can load-balance on a per-packet
basis. This has the side effect that you cannot guarantee
that packets will remain in sequence if the one way delay
across the load-balanced paths is off by more than about
half a packet transmission time. However, you also get
much more even link utilization and no ugly
cache/uncache/recache at frequent intervals (which really
sucks because unfortunately you have to push a packet
through the slow path at every recache).
So anyway, as I was saying, I'm ignorant about such
>Audio sounds great with lots of variability
So if you aren't holding a full-duplex human-to-human
conversation you introduce delay on the receiver side
proportional to something like the 95th percentile and
throw away outliers. If you're holding a full-duplex
long-distance human-to-human conversation you can use POTS
(which is dying but which will live on in emulation) and
pay lots of money or you can use one of a number of rather
clever VON member packages and pay alot less money but put
up with little nagging problems. For local or toll-free
stuff, to expect better price-performance from an end-user
perspective now would require taking enormous doses of
> > You want bounded delay on some traffic profiles that
> > approach having hard real time requirements. (Anything
> > that has actual hard real time requirements has no
> > business being on a statistically multiplexed network, no
> > matter what the multiplexing fabric is).
> Such as voice? Why do you think SDM was created in the first place?
> Or do you mean like a military application, 2ms to respond to a nuke....
> That is when channel priorities come into play.
I wasn't around for the invention of statistical muxing,
but I'm sure there are some people here who could clarify
with first-hand knowledge (and I'll take email from you,
thanks :) ). If it was created for doing voice, I'm going
to be surprised, because none of the voice literature I've
ever looked was anything but circuit-modeled with TD
muxing of DS0s because that is how God would design a
Um, ok, why is it my day for running into arguments about
real time. Hmm...
"Real time" events are those which must be responded to
by a deadline, otherwise the value of the response decays.
In most real time applications, the decay curve varies
substantially with the value of a response dropping to
zero after some amount of time. This is "soft real
time". "Hard real time" is used when the decay curve is
vertical, that is, if the deadline is passed the response
to the event is worthless or worse.
There are very few hard real time things out there.
Anything that is truly in need of hard real time response
should not be done on a statmuxed network or on a wide PDH
network (especially not since PDH is dead in large part
because the propagation delay is inconsistent and
unpredictable thanks to bitstuffing) unless variance in
propagation delay is less than the window for servicing
the hard real time event.
Soft real time things can be implemented across a wide
variety of unpredictable media depending on the window
available to service the real time events and the slope of
the utility decay function.
For instance, interactive voice and video have a number of
milliseconds leeway before a human audience will notice
lag. Inducing a delay to avoid missing the end of the
optimal window for receiving in-sequence frames or blobs
of compressed voice data is wise engineering, particularly
if the induced delay is adjusted to avoid it itself
leading to loss of data utility.
> However, I have NEVER failed to get the bandwith
> "promised" in our nets.
Sure, the problem is with mixing TCP and other window-based
congestion control schemes which rely on implicit feedback
with a rate-based congestion control scheme, particularly
when it relies on explicit feedback. The problem is
exacerbated when the former overlaps the latter, such that
only a part of the path between transmitter and receiver
is congestion controlled by the same rate-based explicit
What happens is that in the presence of transient
congestion unless timing is very tightly synchronized
(Van Jacobson has some really entertaining rants about
this) the "outer loop" will react by either hovering
around the equivalent of the CIR or by filling the pipe
until the rate based mechanism induces queue drops.
In easily observable pathological cases there is a stair
step or vacillation effect resembling an old TCP sawtooth
pattern rather than the much nicer patterns you get from a
modern TCP with FT/FR/1321 stamps/SACK.
In other words your goodput suffers dramatically.
> But, doesn't that same thing happen when you over-run the receiving
> router ?????
Yes, and with OFRV's older equipment the lack of decent
buffering (where decent output buffering is, per port,
roughly the bandwidth x delay product across the network)
was obvious as bandwidth * delay products increased.
With this now fixed in modern equipment and WRED
available, the implicit feedback is not so much dropped
packets as delayed ACKs, which leads to a much nicer
subtractive slow-down by the transmitter, rather than a
multiplicative backing off.
So, in other words, in a device properly designed to
handle large TCP flows, you need quite a bit of buffering
and benefit enormously from induced early drops.
As a consequence, when the path between transmitter and
receiver uses proper, modern routers, buffer overruns
should never happen in the face of transient congestion.
Unfortunately this is easily seen with many popular
rate-based congestion-control schemes as they react to
Finally another ABR demon is in the decay of the rate at
which a VS is allowed to send traffic, which in the face
of bursty traffic (as one tends to see with most TCP-based
protocols) throttles goodput rather dramatically. Having
to wait an RTT before an RM cell returns tends to produce
unfortunate effects, and the patch around this is to try
to adjust the scr contract to some decent but low value
and assure that there is enough buffering to allow a VS's
burst to wait to be serviced and hope that this doesn't
worsen the bursty pattern by bunching up alot of data
until an RM returns allowing the queue to drain suddenly.
> Ahhh.. We await the completion, and proper interaction of RM, ILMI,
> and OAM.
> These will, (and in some cases already DO), provide that information
> back to the router/tag switch.
> Now do they use it well ?????
> That is a different story....
The problem is that you need the source to slow
transmission down, and the only mechanism to do that is to
delay ACKs or induce packet drops. Even translating
FECN/BECN into source quench or a drop close to the source
is unhelpful since the data in flight will already lead to
feedback which will slow down the source.
The congestion control schemes are essentially
> > Delay across any fabric of any decent size is largely
> > determined by the speed of light.
> Where in the world does this come from in the industry.
> Maybe I am wrong, but Guys, do the math. The typical run across the
> North American Continent
> is timed at about 70ms. This is NOT being limited by the speed of
That would be round-trip time.
> Light can travel around the world 8 times in 1 second. This means it
> can travel
> once around the world (full trip) in ~ 120 ms. Milliseconds, not
> So, why does one trip across North america take 70ms...
Light is slower in glass.
> Hint, it is not the speed of light. Time is incurred encoding, decoding,
> and routing.
Kindly redo your calculation with a decent speed of light
value. Unfortunately there is no vacuum between something
in NYC and something in the SF Bay area.
> BTW this (70ms median across the US) comes from a
> predominantly ATM network. Actually, I am quoting
Oh now THERE's a reliable source. "Hi my name is Frank
and ATM will Just Work. Hi my name is Warren and ATM is
fantastic.". Ugh. (kent bait kent bait kent bait)
> > Therefore, unless ABR
> > is deliberately inducing queueing delays, there is no way
> > your delay can be decreased when you send lots of traffic
> > unless the ATM people have found a way to accelerate
> > photons given enough pressure in the queues.
> More available bandwidth = quicker transmission.
> Ie: at 1000kb/s available, how long does it take to transmit 1000kb ? 1
> Now, at 2000kb/s available, how long does it take ? 1/2 second.
> What were you saying ?
At higher bandwidths bits are shorter not faster.
Repeat that several times.
Whether you are signalling at 300bps or at
293875983758917538924372589bps, the start of the first bit
arrives at the same time.
> Why do you think you have "centi"-second delays in the first place.
Because photons and electrons are slow in glass and copper.
> I would check yours, but I find time for a packet to
> cross a router > backplane to be < 1ms, route >
> determination in a traditional router can take up to
> 20 ms (or more), > and slightly less than a 1 ms, >
> if it is in cache. When I said cross a backplane, I
> meant "From > hardware ingress to egress", ie to be
You are still stuck thinking of routers as things which
demand-fill a cache by dropping a packet through a slow
path. This was an artefact of OFRV's (mis)design, and the
subject of many long and interesting rants by Dennis
Ferguson on this list a couple of years ago.
Modern routers simply don't do this, even the ones from OFRV.
> traceroute to cesium.clock.org (22.214.171.124), 30 hops max, 40 byte
> 6 core2-fddi3-0.san-francisco.yourtransit.net (-.174.56.2) 567 ms
> 154 ms
> 292 ms
> >>>>>>>>>>>>>> Tell me this is a speed of light issue.
> >>>>>>>>>>>>>> From the FDDI to the HSSI on the same router.
This has nothing to do with the router's switching or
route lookup mechanism. Router requirements allow routers
to be selective in generating ICMP messages, and cisco's
implementation on non-CEF routers will hand the task of
generating ICMP time exceededs, port unreachables and echo
replies to the main processor, which gets to the task as a
low priority when it's good and ready. If the processor
is doing anything else at the time you get rather long
delays in replies, and if it's busy enough to start doing
SPD you get nothing.
This gets talked about quite frequently on the NANOG
list. I suggest you investigate the archives. I'm sure
Michael Dillon can point you at them. He's good at that.
> PING cesium.clock.org (126.96.36.199): 56 data bytes
> 64 bytes from 188.8.131.52: icmp_seq=0 ttl=243 time=93 ms
> 64 bytes from 184.108.40.206: icmp_seq=1 ttl=243 time=78 ms
> 64 bytes from 220.127.116.11: icmp_seq=2 ttl=243 time=79 ms
> 64 bytes from 18.104.22.168: icmp_seq=3 ttl=243 time=131 ms
> 64 bytes from 22.214.171.124: icmp_seq=4 ttl=243 time=78 ms
> 64 bytes from 126.96.36.199: icmp_seq=5 ttl=243 time=81 ms
> 64 bytes from 188.8.131.52: icmp_seq=6 ttl=243 time=75 ms
> 64 bytes from 184.108.40.206: icmp_seq=7 ttl=243 time=93 ms
> Nice and stable, huh. If this path were ATM switched (Dorian, I will
> respond to you in another post)
> it would have settled to a stable latency.
There is extraordinary congestion in the path between your
source and cesium.clock.org, and cesium is also rather
busy being CPU bound on occasion. There is also a
spread-spectrum radio link between where it lives (the
land of Vicious Fishes) and "our" ISP (Toad House), and
some of the equipment involved in bridging over that is
If you were ATM switching over that link you would see the
same last-hop variability because of that physical level
It works great for IP though, and I quite happily am
typing this at an emacs thrown up onto my X display in
Scandinavia across an SSH connection.
> Flow switching does a route determination once per flow, after that
> the packets are switched down a predetermined path "The Flow". Hence the
> term "flow switching". This reduces the variability of
> the entire flow.
Um, no it doesn't. As with all demand-cached forwarding
schemes you have to process a packet heavily when you have
a cache miss. Darren Kerr did some really neat things to
make it less disgusting than previous demand-cached
switching schemes emanating out of OFRV, particularly
with respect to gleaning lots of useful information out of
the side-effects of a hand-tuned fast path that was
designed to account for all the header processing one
Flow switching does magic matching of related packets to
cache entries which describe the disposition of the packet
that in previous caching schemes could only be determined
by processing individual packets to see if they matched
various access lists and the like. It's principal neat
feature is that less per-packet processing means more pps
MPLS is conceptually related, btw.
Flow switching does not improve queueing delays or speed
up photons and electrons, however, nor does it worsen
them, therefore the effect of flow switching on
variability of normal traffic is nil.
Flow switching has mechanisms cleverer than Leon the
Cleaner to delete entries from the cache and consequently
there are much reduced odds of a cache fault during a
long-lived flow that is constantly sending at least
occasional traffic. You may see this as reducing
variability. I see it as fixing a openly-admitted design flaw.
> However, I should also point out that much of your
> argument is based in TCP. Most multimedia
> (Voice/Audio/Video) content does not focus on TCP, but
> UDP/Multicast. What does your slow start algorithm get
> you then ?
WRED and other admissions control schemes are being
deployed that will penalize traffic that is out of
profile, i.e., that doesn't behave like a reasonable TCP
behaves. Most deployed streaming technologies have taken
beatings from ISPs (EUNET, for example with CUSEEME,
Vocaltec and Progressive with a wide range of ISPs) and
have implemented congestion avoidance schemes that closely
mimic TCP's, only in some cases there is no retransmission
> PS MAC Layer Switching, and ATM switching are apples and oranges.
> Although, one could be used to do the other.
> (Told you Dorian)
P.S.: You have some very entertaining and unique expansion
of a number of acronyms in general and relating to
ATM in particular. I'm curious how you expand "PDH".
Personally I favour "Pretty Damn Historical",
although other epithets come to mind.
P.P.S.: It's dead.
More information about the NANOG