Shady areas of TCP window autotuning?

Mon Mar 16 18:37:31 UTC 2009

On Mon, Mar 16, 2009 at 09:09:35AM -0500, Leo Bicknell wrote:
> The result is that if the vendor targeted 100ms of buffer you now
> have 400ms of buffer, and really bad lag.

Well, this is one of the reasons why I hate the fact that we're
effectively stuck in a 1500 MTU world. My customers are vastly
concerned with the quantity of data they can transmit per unit of
latency. You may be more familiar with this termed as "through-put".
Customers beat us operators and engineers up over it every day. TCP
window tuning does help that if you can manage the side effects. A
larger default layer 2 MTU (why we didn't change this when GE came
out, I will never understand) would help even more by reducing the
total number of frames necessary to transmit a packet across a give
wire.

> As network operators we have to get out of the mind set that "packet
> drops are bad"

Well, thats easier said than done and arguably not realistic. I got
started in this business when 1-3% packet loss was normal and
expected. As the network has grown, the expectation for 0% loss in all
cases has grown with it. You have to remember that in the early days,
the network itself was expected to guarentee data delivery. (ie X.25)
Then the network improved and that burdon was cast on the host
devices. Well, technology has continued to improve to the point where
you litterally can expect 0% packet loss in relatively confined
areas. (Say, Provider X in Los Angeles to user Y in San Jose.) But as
you go further afield, such as from LAX to Israel, expectations have
to change. Today, that mindset is not always there.

As you illude to, this has also bred applications that are almost
entirely intollerant of packet loss and extremely sensitive to
jitter. (VOIP people, are you listening?) Real time gaming is a great
example. Back in the days when 99% of us were on modems, any loss or
varying delay between the client and the user made the difference
between an enjoyable session and nothing but frustration and it was
often hit and miss. A congested or dirty link in the middle of the
path destroyed the user's experience. This is further compounded by
the ever increasingly international participation in some of these
services which means that 24x7 requirements render the customers and
their users more and more sensitive to maintenance activities. (There
can be areas where there is no "after hours" in which to do this
stuff.) Add to this that as media companies expand their use of the
network that customers have forced providers to write into their SLAs
performance based metrics that, rather than simple uptime, now require
often arbitrary guarentees of latency and data loss and you've got a
real problem for operations and engineering.

Techniques that can help improve network integrity are worth
exploring. The difficulty is in proving these techniques under a wide
array of circumstances, getting them properly adopted, and not having
vendors or customers arbitrarily break them because of improper
understanding, poor implementations, or bad configs (PMTUD, anyone?)

Going forward, this sort of thing is going to be more and more
important and harder and harder to get right. I'm actually glad to see
this particular thread appear and will be quite interested in what
people have to say on the matter.

-Wayne

---
Wayne Bouchard
web at typo.org
Network Dude
http://www.typo.org/~web/