Ungodly packet loss rates

Wed Oct 23 17:47:27 UTC 1996

Independent of the fact that the traceroute shows that packets are not
taking the most direct route (both Best and BBN are at Mae-West), your
complaint about high loss is valid.

This is an industry problem, but the problem is not as universal as
you may think.  Merit has been measuring loss NAP to NAP.  We've been
looking at this data and trying to provide useful summaries.  Its kind
of drafty since Merit's data collection is not so good, but you can
get a vague idea of the conditions from the page:

  (This page is temporary - don't point to it.)
  http://www.brookfield.ans.net/ans/netnow/netnow.html

We're not sure of the accuracy of this data or the statistical
validity.  In particular 1) Merit's method of storage doesn't provide
markers between data sets, 2) cummalitive probabilities of 20 packet
samples are fairly worthless for looking at the probability of losses
below 5%.  (The perl programs are all in the same directory so if you
see any bugs please tell me. :)

\begin{aside} 
Probability of 1% to 4% loss could be estimated by assuming data sets
are uncorrelated.  For example, the chance of 2.5% loss or more P(2.5)
is (1 - ((1 - P(5))**2)) * (1 - P(10)), or in english the estimate
probability that one of two samples is zero times the probability that
the other point is no worse than 5%.  Or something like that.  Apply
Bayes theorem of joint probabilities.  But is the data uncorrelated?
\end{aside}

Curtis

ps- wrt what's being done about it, this is an attempt to provide
insights into loss rates as per your point #2 but more for the purpose
of meaningful measure than any kind of guarantee.

   2. Develop meaningful quality-of-service standards that can be used
      to guarantee reasonable performance in terms of end-to-end drop
      rates, delays, and downtime.

Though our loss is among the lowest, we're adding circuits to bring it
back so it is zero most of the time, where it used to be and where it
belongs.  I guess this is the reason for the "state of the Internet"
segment of NANOG tommorrow.  Other providers can speak for themselves.

pps- Its been a long enough thread already.  Sorry to add to the
thread.  :(

------- Forwarded Message

Received: from interlock.ans.net (interlock.ans.net [147.225.5.5]) by brookfield.ans.net (8.7.3/8.7.3) with SMTP id NAA05470 for <curtis at brookfield.ans.net>; Mon, 21 Oct 1996 13:59:17 -0400 (EDT)
Received: by interlock.ans.net id AA05244
  (InterLock SMTP Gateway 3.0 for regional-techsers at ans.net);
  Mon, 21 Oct 1996 13:59:45 -0400
Received: by interlock.ans.net (Internal Mail Agent-5);
  Mon, 21 Oct 1996 13:59:45 -0400
Received: by interlock.ans.net (Internal Mail Agent-4);
  Mon, 21 Oct 1996 13:59:45 -0400
Received: by interlock.ans.net (Internal Mail Agent-3);
  Mon, 21 Oct 1996 13:59:45 -0400
Received: by interlock.ans.net (Internal Mail Agent-2);
  Mon, 21 Oct 1996 13:59:45 -0400
Received: by interlock.ans.net (Internal Mail Agent-1);
  Mon, 21 Oct 1996 13:59:45 -0400
To: noc at tlg.net, help at uunet.uu.net, nanog at merit.net, ops at bbnplanet.com
Cc: barb at velvet.com, ianp at darktower.demon.co.uk
From: jbash at velvet.com
Subject: Ungodly packet loss rates
Date: 	Mon, 21 Oct 1996 10:42:00 -0700
Message-Id: <96Oct21.104256-0700pdt.18972-3+3 at blue.velvet.com>
Sender: owner-nanog at merit.edu

[Resent... I stupidly used the wrong address for the NANOG list]

This is being sent to the "help-line" addresses of several Internet
providers because they're not providing what I consider appropriate
service. It's being sent to the NANOG mailing list because it
represents what I believe to be an industry-wide problem.

I'm just a lowly end user, and perhaps I shouldn't intrude into the
councils of the Wise and the Great, but this is just a bit
ridiculous. Attached is a traceroute from my home machine to the
system I'm trying to work on over a TELNET session. It looks like
there's about a 40-percent overall round-trip loss rate, most or all
of it apparently introduced in the Alternet and BBN Planet backbones.
This is not a transient condition; it's been going on for at least
several days, and similar things happen all the time.

I think we can all agree that a 40-percent loss rate isn't an
acceptable level of service in an IP network. It's certainly making it
annoying and frustrating for me to try to work. It's also driving up
the load on the network by provoking retransmissions. A corporate
internal network running at that loss rate would probably be considered
to be in collapse.

I pay TLGnet (now Best) an agreed-upon amount of money every month,
nominally in exchange for a reasonable level of Internet service. I
think that part of TLGnet's obligation under that arrangement is to
contract for reliable backbone service. Likewise, the other end of the
path (Cisco systems, for whom I am emphatically not, *not, *NOT*
speaking here) pays BBN what I suspect to be a very large amount of
money indeed for DS3 service, presumably in the expectation that most
of the packets that go into the DS3 will come out of the network
somewhere. Alternet presumably has agreements with both TLGnet and
BBN. That puts everybody on the hook.

I fully understand that it's difficult to provide reliable service in
an exponentially-growing network. I'm aware that everybody's already
using the fastest lines they can get, and connecting the fastest
routers to them. I know that links are being added. I appreciate that
both lines and equipment are very expensive, and that adding lines
serves to complicate an already amazingly complex router configuration
situation. I understand that cash-flow issues (as well as
convincing-the-bean-counters issues) are involved. I sympathize...

... but the fact remains that I'm not getting the level of service I
think I'm entitled to, nor are other end users. Not only that, but if
the level of service gets any lower, the Net will become so painful
to use that I'll start wondering why I bother. While reducing my Net
use might be good for my mental health, I don't think anybody wants to
see users abandoning the Net because of poor service.

So, what's to be done about it? Assuming that all technical means are
being pursued, and from what I've seen on various mailing lists I
believe they probably are, the only thing left is a management
fix. May I make the probably-sacreligous suggestion that the industry
as a whole, and the providers I've mentioned in particular, show
greater concern for the quality of service provided, and
specifically--

   1. Stop taking on new customers (or other traffic sources) until
      existing customers can be provided with an appropriate level of
      service.

   2. Develop meaningful quality-of-service standards that can be used
      to guarantee reasonable performance in terms of end-to-end drop
      rates, delays, and downtime.

   3. Reexamine both pricing levels and the Internet pricing model, to
      make sure that there's enough money available to fund a usable
      level of service.

Yes, this means giving up some business. That's one of the costs of
honoring your agreements... and of not alienating an entire generation
of customers.

Thank you for your attention. Although I usually scan at least the
subject lines of messages sent to the NANOG list, I'm temporarily
without access to the news server on which I ordinarily read the
list. For the next few days, I won't be able to answer replies not sent
to me directly.

					-- J. Bashinski

blue% traceroute -a -q 25 -Q champagne.cisco.com
traceroute to checkpoint-sj.cisco.com (171.69.10.37), 30 hops max, 40 byte packets
 1  tongue.velvet.com (206.14.77.65)  (2.8 ms/3.6 ms(+-0.9 ms)/15.8 ms) 25/25 (100.00%)
 2  tlg-cust-link.tlg.net (140.174.151.93)  (39.0 ms/47.5 ms(+-10.3 ms)/134.8 ms) 25/25 (100.00%)
 3  mae-west.tlg.net (198.32.136.22)  (40.7 ms/45.6 ms(+-9.2 ms)/57.0 ms) 25/25 (100.00%)
 4  905.Hssi3-0.GW1.SCL1.ALTER.NET (137.39.133.89)  (43.9 ms/49.1 ms(+-9.9 ms)/60.8 ms) 25/25 (100.00%)
 5  Fddi0-0.CR1.SCL1.Alter.Net (137.39.19.5)  (43.5 ms/48.8 ms(+-9.8 ms)/57.0 ms) 25/25 (100.00%)
 6  Hssi3-0.San-Jose3.CA.Alter.Net (137.39.100.1) * *  (45.3 ms/52.5 ms(+-11.1 ms)/77.6 ms) 23/25 (92.00%)
 7  Fddi0-0.San-Jose6.CA.Alter.Net (137.39.27.12) *  (44.4 ms/49.6 ms(+-10.2 ms)/60.8 ms) 24/25 (96.00%)
 8  Hssi1-0.Palo-Alto2.CA.ALTER.NET (137.39.101.162)  (46.7 ms/52.6 ms(+-10.5 ms)/61.6 ms) 25/25 (100.00%)
 9  Fddi1-0.Palo-Alto3.CA.Alter.Net (137.39.47.7) * * * * *  (57.9 ms/82.0 ms(+-21.6 ms)/248.5 ms) 20/25 (80.00%)
10  decwrl.bbnplanet.net (198.32.176.5) * * * * * *  (50.4 ms/60.8 ms(+-14.0 ms)/68.9 ms) 19/25 (76.00%)
11  paloalto-br1.bbnplanet.net (4.0.1.57) * * * * * *  (52.7 ms/63.6 ms(+-14.7 ms)/82.3 ms) 19/25 (76.00%)
12  paloalto-cisco.bbnplanet.net (131.119.0.196) * * * * * * *  (57.0 ms/66.2 ms(+-15.7 ms)/80.7 ms) 18/25 (72.00%)
13  * 131.119.26.10 (131.119.26.10) * * * * *  (54.8 ms/65.6 ms(+-15.1 ms)/75.6 ms) 19/25 (76.00%)
14  sj-wall-2.cisco.com (192.31.7.34) * * * * *  (48.2 ms/77.2 ms(+-18.4 ms)/151.9 ms) 20/25 (80.00%)
15  * * * * * sj-eng-corp2.cisco.com (198.92.1.130) * * * * *  (58.5 ms/70.3 ms(+-18.2 ms)/81.2 ms) 15/25 (60.00%)
16  * eng-atm-gw2.cisco.com (171.69.4.129) * * * * * * * * *  (61.9 ms/67.1 ms(+-17.4 ms)/78.7 ms) 15/25 (60.00%)
17  sj-eng-corp1.cisco.com (171.69.5.10) * * * * * * *  (52.2 ms/64.9 ms(+-15.4 ms)/81.8 ms) 18/25 (72.00%)
18  checkpoint-sj.cisco.com (171.69.10.37) * * * * * * * * * * * * * * * *  (58.1 ms/78.6 ms(+-30.0 ms)/202.2 ms) 9/25 (36.00%)

------- End of Forwarded Message