Ungodly packet loss rates

Tue Oct 22 01:19:50 UTC 1996

Okay.  I'll hopefully solve all your problems in one paragraph, vs. your 
bible of well written critique of the Internet:

You get what you pay for or you pay what you get for.  Since you chose a
substandard choice or Cisco did, obviously one of you needs to seek legal
action immediately.  I can, of course for a fee, recommend a good attorney.
If you realize that by paying less, you should obviously get less, as most 
people understand then you wouldn't have bitched and moaned to the entire
NANOG community.  Of course, being the good samaritan that I am can end
your troubles with a free dial-up account on our network, whereby I make 
no guarantees of the performance, and of course this account would be free.
The only agreement I attach to this deal is that you take this thread off-line
and to the courts.

Rob
Exodus Communications Inc.

> [Quotes mercilessly reordered]
> 
> I'm amazed at the attitude I'm getting from this list. You are,
> collectively, in the business of running a large network. I am a
> paying user of that network. The network is not delivering appropriate
> performance, as measured most importantly by the time I and others
> spend waiting around for characters to echo, Web pages to display, and
> whatnot. This time is long far more often than it's historically
> been, and far more often than a reasonable person might expect.
> 
> Although my immediate complaint is prompted by a specific incident,
> such incidents are so common as to constitute a continuing, pervasive
> pattern. Because of the structure of the network, this pattern affects
> customers of all providers, not just the immediately responsible
> ones. Although many problems do exist at user sites, it's clear that
> many problems also exist within the network itself.
> 
> So I complain, and suggest that you should look into reducing network
> growth to a level you can really manage, and setting standards of
> performance for yourselves and one another.
> 
> Do you say "Yes, that's a good idea"? No. Do you say "No, that won't
> work because <x>"? No. Do you say "We think we have a handle on the
> problem, and you can expect it to go away soon"?. No. Do you say "We
> don't think we can make the problem go away no matter what we do, so
> we'll try to do a better job of explaining the expected level of
> service to new users (and to old users who are losing the level of
> service they've been used to)?". No. Do you refer me to some existing
> document, prepared either by my own ISP or by NANOG or some other
> group, describing the quality of service I'm to expect, and point out
> to me that what I'm asking for is more than it guarantees? No.
> 
> As far as I can tell, nobody's acknowledged that there's a problem.
> You really seem to believe that the quality of service provided over
> the Internet as a whole, as opposed to within any particular
> provider's network, is acceptable.
> 
> What I hear is "Quit whining", or in one case, "Quit whining, idiot".
> 
> mrbill> No, I beleive the person who recommended that suggested you shop around
> mrbill> for the best provider *to start out with*, not bitch, whine, and moan
> mrbill> when your connection is not 100% perfect through the one you 
> mrbill> currently have.
> 
> I think there's a big difference between complaining about a
> connection "not [being] 100% perfect" and complaining about a huge
> packet loss rate making a path (and indeed all paths between me and at
> least one very major network) nearly unusable. There's even more of a
> difference between complaining about a single incident of such a loss
> rate and complaining about a pervasive pattern of such incidents.
> 
> Are you saying that I should accept bursty periods of 10-second
> character echo times, continuing for 4 or 5 days? I'm sorry, but that
> sort of congestion inside a network backbone demonstrates gross
> overload. It takes a lot to drive a network to that point in the
> presence of TCP congestion avoidance, even with lots of short
> connections.
> 
> Are you suggesting that I find a provider that never gives me a path
> through a congested network? I'm sorry, but given the number of
> congested networks out there, and how quickly the congestion moves
> around, and the plain fact that some sites are connected *via*
> congested networks, I don't believe that's possible.
> 
> I also think it's unreasonable to expect users to choose their providers
> based on which sites they're communicating with. Users should be able
> to expect acceptable levels of service to any site (yes, provided
> that site itself has adequate capacity). ISPs are in the business
> of providing usable service, not providing the service it's convenient
> for them to provide.
> 
> Take my own case. I didn't get this connection to let me talk to
> Cisco; I already had facilities for that. I got it for general access
> to various random stuff on the Net. Unless it gives me usable
> connectivity to the *whole* Net (including Cisco, but only
> incidentally), it's not doing what I bought it for... and it's not
> doing what the people I bought it from sell it for, either.
> 
> If I were going to put really heavy demands on the network, I could
> see being told I needed to connect somewhere close to my target.
> That's not what's going on here; we're talking about a TELNET
> connection. At a more basic level, if the Net can't be made usable
> for at least Web access from almost anywhere to almost anywhere, then
> what's the point of building it at all?
> 
> mrbill> I dont see where a temporary network problem such as you describe
> mrbill> should result in a message being sent to the various ISPs and the
> mrbill> NANOG list.
> 
> You misunderstand my point; the message wasn't really about the
> immediate problem; that was merely an example.
> 
> A problem with my own stuff caused me to really rely on services I've
> been paying for for a long time. When I started using those services
> for serious interactive work, they failed me, and they continued to
> fail me for several days. I was reminded of how bad things on the Net
> at large really were, and motivated to investigate what was going on
> in this particular case.
> 
> Having established to a reasonable degree of certainty that the
> problem isn't on my end and isn't on Cisco's end, and that the problem
> has gone on for several days, I feel justified in complaining to the
> ISPs involved.
> 
> As far as the question of the problem being temporary, well, yes, it's
> temporary. Everything is temporary. You and I are decidedly temporary.
> If "temporary" in this case were 10 seconds, I'd agree with you. 4
> days is, however, a ridiculously long lifetime for a double-digit drop
> rate in a major network backbone. When was the last time you saw a
> significant part of the telephone network become almost unusable for 4
> days?
> 
> Having seen similar problems all too often in the past, and having
> heard complaints about such problems from other users, I feel
> justified in recommending that an industry group, presumably concerned
> with quality of service, consider the matter.
> 
> The issue isn't this particular failure. The issue is the industry's
> inability to manage the network appropriately. If this were an isolated
> incident, it would be acceptable, if annoying. The fact is, however,
> that some large part of the network is either down or degraded almost
> all the time. I believe that the reason for that is that the network
> is being grown at a faster rate than the industry can coordinate
> properly.
> 
> Go Web surfing. Count the number of sites you can't reach when you
> *know* that the problem isn't local overloading at either end of the
> connection. Count the number of stalls you get when you're loading the
> pages that *do* work. Do you really consider that an appropriate level
> of service? Now multiply the annoyance factor by 10, and you'll get
> the idea what it's like for interactive users.
> 
> mrbill> My suggestion:  quit bitching and wait for your FR connection to be
> mrbill> restored,
> 
> I beg your pardon, but I think I'm entitled to "bitch" whenever a
> service I'm paying for isn't being delivered in a satisfactory way.
> I assure you that I'd expect my provider to complain very loudly
> if I stopped paying my bills on time.
> 
> mrbill> or reconfigure your current equipment (if you work at Cisco,
> mrbill> it shouldn't be TOO hard).
> 
> Regardless of how hard it may or may not be, I shouldn't have to do
> it. I've paid for a service that *should*, if it were working
> properly, save me from having to do it. Your opinion as to whether I
> really need that service is irrelevant... and amazingly arrogant.
> 
> In this case, I'd have to either take down network services that some
> friends of mine depend on, or come up with another computer. Doing
> one or the other is the only way I can maintain the air gap between
> Cisco and the Internet.
> 
> Now, on technical issues (and my mistakes thereon):
> 
> mikedoug>  How in the hell can you expect a 100% success rate over (1) a slow
> mikedoug>  modem link, and (2) to *ANY* site on the world.  Hell, do you have
> mikedoug>  any *CLUE*--I know you don't--how many sites on the net have servers
> mikedoug>  behind 28.8 links???  How great a packet loss do you expect when you
> mikedoug>  access them??  Is that provider dependent???  *ANY* site--really?
> 
> Sigh. I have to admit that my language was wrong. When I said "any
> point" (I did not say "any site"), I meant "the edge of any ISP's
> network". Any IP path with a double-digit loss rate (or, generally,
> any single link with, say, a 5 percent loss rate) is grossly
> overloaded, but I can only hold ISPs responsible for capacity planning
> out to the edges of their own networks. In the present case, most of
> the loss is being introduced in the middle of Alternet's DS3 backbone.
> 
> On a well-managed network, I can and should expect a loss rate just
> slightly above the rate intrinsic to TCP's flow control, given that
> the data traffic is overwhelmingly TCP. I don't know what the
> intrinsic rate is, but--
> 
>   1. I'd be pretty confident in guessing it's less than 5 percent.
> 
>   2. It's a *lot* less than 40 percent. It's a lot less than 20 percent.
> 
>   3. It doesn't create gross degradation of interactive service.
> 
> As I realized shortly after I sent my message, 1 percent really
> *isn't* a reasonable expectation for a TCP/IP loss rate, since TCP
> uses packet loss as a flow-control feedback mechanism, and will force
> the loss rate along any path above 1 percent. My only excuse for this
> error is that the networks I used to work with were either run in
> uncongested mode (not as uncommon as you might think), or were not
> pure IP networks. At the time, most hosts had even worse congestion
> response than they have now, and you had to overengineer the network
> if you wanted it to work right.
> 
> As for the rest...
> 
> jbash> > It doesn't look to me as though the loss is being introduced at the
> jbash> > NAPS. If you look at the trace, you'll see that significant loss
> jbash> > starts to appear within Alternet, well after MAE-west. It looks as
> jbash> > though more loss appears inside BBN's network, although it's difficult
> jbash> > to tell because of the already large Alternet loss.
> 
> mrbill> Traceroute is *not* a good tool to diagnose packet loss problems.
> mrbill> I've had traceroute tell me that a packet loss problem was between
> mrbill> two points 3-4 hops "out", when actually it was with the T-1 at 
> mrbill> my site, the "first hop" in the trace.
> 
> emv> Traceroute is less useful a tool than you think in the face of congestive
> emv> loss.  Routers can and do selectively prioritize the queueing packets
> emv> based on their type, and if I were a network operator I would have no
> emv> hesitation about dropping traceroute or ping packets to low priority.
> 
> Unfortunately, traceroute is what's available. Ed's point about
> priority queues (and fair queues, and whatever else is out there this
> week) is a good one, and I withdraw the assertion that the loss rate
> is 40 percent; obviously I can't really trust the absolute loss rates
> I get from ping and traceroute. Again, I plead rustiness (or maybe
> complete obsolescence)... my real-world experience predates useful
> priority queueing.
> 
> The TCP connection itself reports about a 20 percent retransmission
> rate in one direction, and that may be a more reasonable estimate of
> the actual loss than the 40 percent I get from ping and traceroute.
> 
> Given enough probes, however, traceroute should still show
> discontinuities in packet loss at congestion points. I think I was
> doing enough probes... 25 per hop, and the trace I sent wasn't the
> only one I took.
> 
> In fact, I now have confirmation that most, or maybe all, of my loss
> is (or maybe was... loss is down quite a bit as I write this) being
> caused by a major overload on a link inside Alternet's
> backbone. Apparently some kind of routing reconfiguration (possibly by
> a third party) at MAE-west dumped a lot of traffic into an Alternet
> DS3 that wasn't overloaded before.
> 
> None of which is really relevant to the basic problem, which is that
> this service level makes interactive sessions nearly unusable, and even Web
> access a bit painful... regardless of where the drops happen.
> 
> 
> 				-- J. Bashinski
>