Ungodly packet loss rates

Tue Oct 22 00:47:00 UTC 1996

[Quotes mercilessly reordered]

I'm amazed at the attitude I'm getting from this list. You are,
collectively, in the business of running a large network. I am a
paying user of that network. The network is not delivering appropriate
performance, as measured most importantly by the time I and others
spend waiting around for characters to echo, Web pages to display, and
whatnot. This time is long far more often than it's historically
been, and far more often than a reasonable person might expect.

Although my immediate complaint is prompted by a specific incident,
such incidents are so common as to constitute a continuing, pervasive
pattern. Because of the structure of the network, this pattern affects
customers of all providers, not just the immediately responsible
ones. Although many problems do exist at user sites, it's clear that
many problems also exist within the network itself.

So I complain, and suggest that you should look into reducing network
growth to a level you can really manage, and setting standards of
performance for yourselves and one another.

Do you say "Yes, that's a good idea"? No. Do you say "No, that won't
work because <x>"? No. Do you say "We think we have a handle on the
problem, and you can expect it to go away soon"?. No. Do you say "We
don't think we can make the problem go away no matter what we do, so
we'll try to do a better job of explaining the expected level of
service to new users (and to old users who are losing the level of
service they've been used to)?". No. Do you refer me to some existing
document, prepared either by my own ISP or by NANOG or some other
group, describing the quality of service I'm to expect, and point out
to me that what I'm asking for is more than it guarantees? No.

As far as I can tell, nobody's acknowledged that there's a problem.
You really seem to believe that the quality of service provided over
the Internet as a whole, as opposed to within any particular
provider's network, is acceptable.

What I hear is "Quit whining", or in one case, "Quit whining, idiot".

mrbill> No, I beleive the person who recommended that suggested you shop around
mrbill> for the best provider *to start out with*, not bitch, whine, and moan
mrbill> when your connection is not 100% perfect through the one you 
mrbill> currently have.

I think there's a big difference between complaining about a
connection "not [being] 100% perfect" and complaining about a huge
packet loss rate making a path (and indeed all paths between me and at
least one very major network) nearly unusable. There's even more of a
difference between complaining about a single incident of such a loss
rate and complaining about a pervasive pattern of such incidents.

Are you saying that I should accept bursty periods of 10-second
character echo times, continuing for 4 or 5 days? I'm sorry, but that
sort of congestion inside a network backbone demonstrates gross
overload. It takes a lot to drive a network to that point in the
presence of TCP congestion avoidance, even with lots of short
connections.

Are you suggesting that I find a provider that never gives me a path
through a congested network? I'm sorry, but given the number of
congested networks out there, and how quickly the congestion moves
around, and the plain fact that some sites are connected *via*
congested networks, I don't believe that's possible.

I also think it's unreasonable to expect users to choose their providers
based on which sites they're communicating with. Users should be able
to expect acceptable levels of service to any site (yes, provided
that site itself has adequate capacity). ISPs are in the business
of providing usable service, not providing the service it's convenient
for them to provide.

Take my own case. I didn't get this connection to let me talk to
Cisco; I already had facilities for that. I got it for general access
to various random stuff on the Net. Unless it gives me usable
connectivity to the *whole* Net (including Cisco, but only
incidentally), it's not doing what I bought it for... and it's not
doing what the people I bought it from sell it for, either.

If I were going to put really heavy demands on the network, I could
see being told I needed to connect somewhere close to my target.
That's not what's going on here; we're talking about a TELNET
connection. At a more basic level, if the Net can't be made usable
for at least Web access from almost anywhere to almost anywhere, then
what's the point of building it at all?

mrbill> I dont see where a temporary network problem such as you describe
mrbill> should result in a message being sent to the various ISPs and the
mrbill> NANOG list.

You misunderstand my point; the message wasn't really about the
immediate problem; that was merely an example.

A problem with my own stuff caused me to really rely on services I've
been paying for for a long time. When I started using those services
for serious interactive work, they failed me, and they continued to
fail me for several days. I was reminded of how bad things on the Net
at large really were, and motivated to investigate what was going on
in this particular case.

Having established to a reasonable degree of certainty that the
problem isn't on my end and isn't on Cisco's end, and that the problem
has gone on for several days, I feel justified in complaining to the
ISPs involved.

As far as the question of the problem being temporary, well, yes, it's
temporary. Everything is temporary. You and I are decidedly temporary.
If "temporary" in this case were 10 seconds, I'd agree with you. 4
days is, however, a ridiculously long lifetime for a double-digit drop
rate in a major network backbone. When was the last time you saw a
significant part of the telephone network become almost unusable for 4
days?

Having seen similar problems all too often in the past, and having
heard complaints about such problems from other users, I feel
justified in recommending that an industry group, presumably concerned
with quality of service, consider the matter.

The issue isn't this particular failure. The issue is the industry's
inability to manage the network appropriately. If this were an isolated
incident, it would be acceptable, if annoying. The fact is, however,
that some large part of the network is either down or degraded almost
all the time. I believe that the reason for that is that the network
is being grown at a faster rate than the industry can coordinate
properly.

Go Web surfing. Count the number of sites you can't reach when you
*know* that the problem isn't local overloading at either end of the
connection. Count the number of stalls you get when you're loading the
pages that *do* work. Do you really consider that an appropriate level
of service? Now multiply the annoyance factor by 10, and you'll get
the idea what it's like for interactive users.

mrbill> My suggestion:  quit bitching and wait for your FR connection to be
mrbill> restored,

I beg your pardon, but I think I'm entitled to "bitch" whenever a
service I'm paying for isn't being delivered in a satisfactory way.
I assure you that I'd expect my provider to complain very loudly
if I stopped paying my bills on time.

mrbill> or reconfigure your current equipment (if you work at Cisco,
mrbill> it shouldn't be TOO hard).

Regardless of how hard it may or may not be, I shouldn't have to do
it. I've paid for a service that *should*, if it were working
properly, save me from having to do it. Your opinion as to whether I
really need that service is irrelevant... and amazingly arrogant.

In this case, I'd have to either take down network services that some
friends of mine depend on, or come up with another computer. Doing
one or the other is the only way I can maintain the air gap between
Cisco and the Internet.

Now, on technical issues (and my mistakes thereon):

mikedoug>  How in the hell can you expect a 100% success rate over (1) a slow
mikedoug>  modem link, and (2) to *ANY* site on the world.  Hell, do you have
mikedoug>  any *CLUE*--I know you don't--how many sites on the net have servers
mikedoug>  behind 28.8 links???  How great a packet loss do you expect when you
mikedoug>  access them??  Is that provider dependent???  *ANY* site--really?

Sigh. I have to admit that my language was wrong. When I said "any
point" (I did not say "any site"), I meant "the edge of any ISP's
network". Any IP path with a double-digit loss rate (or, generally,
any single link with, say, a 5 percent loss rate) is grossly
overloaded, but I can only hold ISPs responsible for capacity planning
out to the edges of their own networks. In the present case, most of
the loss is being introduced in the middle of Alternet's DS3 backbone.

On a well-managed network, I can and should expect a loss rate just
slightly above the rate intrinsic to TCP's flow control, given that
the data traffic is overwhelmingly TCP. I don't know what the
intrinsic rate is, but--

  1. I'd be pretty confident in guessing it's less than 5 percent.

  2. It's a *lot* less than 40 percent. It's a lot less than 20 percent.

  3. It doesn't create gross degradation of interactive service.

As I realized shortly after I sent my message, 1 percent really
*isn't* a reasonable expectation for a TCP/IP loss rate, since TCP
uses packet loss as a flow-control feedback mechanism, and will force
the loss rate along any path above 1 percent. My only excuse for this
error is that the networks I used to work with were either run in
uncongested mode (not as uncommon as you might think), or were not
pure IP networks. At the time, most hosts had even worse congestion
response than they have now, and you had to overengineer the network
if you wanted it to work right.

As for the rest...

jbash> > It doesn't look to me as though the loss is being introduced at the
jbash> > NAPS. If you look at the trace, you'll see that significant loss
jbash> > starts to appear within Alternet, well after MAE-west. It looks as
jbash> > though more loss appears inside BBN's network, although it's difficult
jbash> > to tell because of the already large Alternet loss.

mrbill> Traceroute is *not* a good tool to diagnose packet loss problems.
mrbill> I've had traceroute tell me that a packet loss problem was between
mrbill> two points 3-4 hops "out", when actually it was with the T-1 at 
mrbill> my site, the "first hop" in the trace.

emv> Traceroute is less useful a tool than you think in the face of congestive
emv> loss.  Routers can and do selectively prioritize the queueing packets
emv> based on their type, and if I were a network operator I would have no
emv> hesitation about dropping traceroute or ping packets to low priority.

Unfortunately, traceroute is what's available. Ed's point about
priority queues (and fair queues, and whatever else is out there this
week) is a good one, and I withdraw the assertion that the loss rate
is 40 percent; obviously I can't really trust the absolute loss rates
I get from ping and traceroute. Again, I plead rustiness (or maybe
complete obsolescence)... my real-world experience predates useful
priority queueing.

The TCP connection itself reports about a 20 percent retransmission
rate in one direction, and that may be a more reasonable estimate of
the actual loss than the 40 percent I get from ping and traceroute.

Given enough probes, however, traceroute should still show
discontinuities in packet loss at congestion points. I think I was
doing enough probes... 25 per hop, and the trace I sent wasn't the
only one I took.

In fact, I now have confirmation that most, or maybe all, of my loss
is (or maybe was... loss is down quite a bit as I write this) being
caused by a major overload on a link inside Alternet's
backbone. Apparently some kind of routing reconfiguration (possibly by
a third party) at MAE-west dumped a lot of traffic into an Alternet
DS3 that wasn't overloaded before.

None of which is really relevant to the basic problem, which is that
this service level makes interactive sessions nearly unusable, and even Web
access a bit painful... regardless of where the drops happen.

				-- J. Bashinski