buffer bloat and packet pacing

Saku Ytti saku at ytti.fi
Thu Sep 3 10:56:50 UTC 2015


Hey,

In past few years there's been lot of talk about reducing buffer
depths, and many seem to think vendors are throwing memory on the
chips for the fun of it.

If we look at some particularly pathological case. Let's assume sender
is CDN network with 40GE connected server and receiver is 10GE
connected. There is 300ms latency between them.

10Gbps * 300ms = 375MB, is the window size the client wants to be able
to fill its pipe to the 40GE sender.
However TCP does not normally pace packets inside the window, so 40GE
server will flood the window as fast as it can, instead of limiting
itself to 10Gbps, optimally it'll send at linerate. While receiver can
only serialise them 10GE out, causing majority of that 375MB ending up
in the sender side switch/router buffers.
If we can't buffer that, then the receiver cannot receive at 10Gbps,
as window size will shrink. Is this a problem? What rate should you be
able to expect to get and at what latency? Usually contracts to
customers won't have any limitations on bandwidth achievable on given
latency and writing such down might make you appear inferior to your
competitor.

Perhaps this is unrealistic case, however if you run the numbers in
much less pathological cases, you'll still end up having much larger
buffer needs than large number of switch chips out there have.

Some new ones, like JNPR QFX10k and Broadcom Jericho come with much
larger buffers than predecessors, and will be able to deal with what I
hope are most practical cases.

Linux actually these days does have bandwidth estimator for TCP
sessions, but it's not used by default for anything, it's just for
consumption for other layers so they can do something about it. And I
believe in 'tc' you can use these to cause packet pacing inside a
window.
QUIC and MinimaLT, AFAIK, do bandwidth estimation and packet pacing by default.

In perfect world, we'd be done now. Receiver side switch can do with
very small buffers, few packets should suffice. However, if network
itself is congested, the bandwidth estimation keeps sinking, and these
well-behaved streams are losing to the aggressive TCP streams, and
you'll end up having 0bps estimations.
So perhaps the bandwidth estimator should be application aware, and
never report lower estimate than what is practical for given
application, so that it could compete fairly with aggressive streams,
up-to required rate.

Information I'd love to have, is how large window sizes do TCP
sessions peak at, in real network? Some CDN network must be collecting
these stats. I'd love to see rough statistics. <1% go over 100MB? 2%
between 50MB-100MB? ... few large brackets of distribution of window
sizes by some CDN offering content download (GGC, OpenConnect are not
interesting, as they won't send large files).
Also, are some CDN's already implementing packet pacing inside window?
If so, how? Do they have lower limit to it?

Some related URLs:
https://lwn.net/Articles/645115/
https://lwn.net/Articles/564978/
http://www.ietf.org/proceedings/88/slides/slides-88-iccrg-6.pdf
http://www.ietf.org/proceedings/84/slides/slides-84-iccrg-2.pdf
-- 
  ++ytti



More information about the NANOG mailing list