best current practice: buffers

Sat Dec 19 15:46:26 UTC 2020

Baldur Norddahl <baldur.norddahl at gmail.com> writes:

> Hello
>
> What is the best current practice for buffer size? For customer facing
> ports, core network ports and transit links?
>
> We have a buffer problem, discovered by a customer that moved their servers
> to a cloud service some distance away. That resulted in a drastic reduced
> transfer speed between their office and the cloud service. Nothing much
> could be done since we, like so many others, have switches with extreme
> fast port speeds (48x 10G, 4x 100G) with a tiny shared 12 MB buffer.
>
> Now the time has come to upgrade that hardware to something that does have
> plenty of buffer capacity, so I am planning out what the settings should be.
>
> I have read this paper
> http://web.stanford.edu/class/cs244/papers/sizing-router-buffers-redux.pdf
> which claims not much buffer is needed at all. And I think they are
> completely wrong. In the paper they assume we are trying to get as much
> throughput on a congested core or transit port. But we always make sure
> those are not congested, so that misses the mark completely. For core ports
> we are concerned about microbursts and for customer ports we do care about
> a single TCP session being able to get the max throughput. The paper
> assumes there will be a lot of TCP sessions sharing the bandwidth, but that
> is not always the case with customer ports. It might be a guy downloading
> an ISO image with him being the only guy at the office.
>
> The common wisdom is to set the buffer size to one bandwidth-delay product.
> And also that bigger buffers than this is harmful. But that raises the
> question of what distance do I tune for? Amsterdam is 10 ms away. East
> coast USA is 100 ms.

There are a couple of trends here to be aware of: One is that the
proliferation of CDNs and localised clouds means RTTs for a lot of
bandwidth-heavy traffic is quite low these days. The second is that
newer TCP congestion control algorithms such as BBR make heavy use of
packet pacing which all but eliminates the microbursts of older TCPs.
BBR will run quite happily across a shallow-buffered link. Google is
using this pretty much across all their infrastructure, so that's one
major source of traffic (youtube, gcloud, etc) taken care of; not sure
about other CDNs, but I do believe several of them have at least been
experimenting with it...

What this means is that *if* buffer size is the only config knob you
have to twiddle, you're likely better off erring on the size of too
small a buffer than too big. The bufferbloat induced by overbuffering is
going to hurt your customers more than the slight loss of single-flow
legacy TCP performance of the occasional too-shallow buffer will.

> Also the new hardware (Juniper ACX710) does support more than one queue per
> port. Would it be possible to have that ISO download go into a queue for
> heavy streams and allow smaller streams to skip the line, so we do not see
> the downside of a heavy buffer (buffer bloat)? It is no longer just a
> simple matter of buffer size.

You're basically describing the FQ-CoDel algorithm (RFC8290) here: it
does per-flow queueing, and combines it with AQM on each queue +
automatic no-knobs prioritisation of short (and thus often
latency-sensitive) flows. It's really the gold standard, but
unfortunately it hasn't yet made it into big-iron routers yet.

> What are others doing to deliver the best possible performance to
> customers with regards to buffering?

As you note yourself above, just tuning the buffer size is a very blunt
tool which can't really be made to work for all scenarios. You really
want some kind of queue management algorithm enabled, i.e., an AQM which
will start dropping packets as the queue *starts* building up, possibly
combined with flow queueing.

Looking at the data sheet for that Juniper box, unfortunately it doesn't
seem to offer much in this space. It lists the RED AQM, which can be
made to work, but requires specific tuning to the link speed (and will
cripple the link if set wrong). Theoretically, Juniper should be able to
implement PIE (RFC8033) as a firmware update leveraging their existing
RED machinery; you could ask them for that?

As for flow queueing, the data sheet does mention Weighted Fair
Queueing, and you mention you can have more than one queue per port.
Configuring this with as many queues per port as you can, using
flow-based hashing to divide up traffic between them would not be
unreasonable. It won't have the smart prioritisation of FQ-CoDel, but
even a standard round-robin scheme can help separate out elephant flows
from the rest of the traffic and alleviate the bad effects of
bufferbloat.

-Toke