923Mbits/s across the ocean
Richard A Steenbergen
ras at e-gerbil.net
Sun Mar 9 18:52:54 UTC 2003
On Sun, Mar 09, 2003 at 08:29:16AM -0800, Cottrell, Les wrote:
>
> > Strange. Why is that? RFC 1323 is widely implemented, although not
> > widely enabled (and for good reason: the timestamp option kills header
> > compression so it's bad for lower-bandwidth connections). My guess is
> > that the OS can't afford to throw around MB+ size buffers for every TCP
> > session so the default buffers (which limit the windows that can be
> > used) are relatively small and application programmers don't override
> > the default.
>
> Also as the OS's are shipped they come with small default maximum window
> sizes (I think Linux is typically 64KB and Solaris is 8K), and so one
> has to get the sysadmin with root privs to change this.
This is related to how the kernel/user model works in relation to TCP.
TCP itself happens in the kernel, but the data comes from userland through
the socket interface, so there is a "socket buffer" in the kernel which
holds data coming from and going to the application. TCP cannot release
data from it's buffer until it has been acknowledged by the other side,
incase it needs to retransmit. This means TCP performance is limited by
the smaller of either the congestion window (determined by measuring
conditions along the path), or the send/recv window (determined by local
system resources).
However, you can't just blindly turn up your socket buffers to large
values and expect good results.
On the send size, the application transmitting is guaranteed to utilize
the buffers immediately (ever seen a huge jump in speed at the beginning
of a transfer, this is the local buffer being filled, and the application
has no way to know if this data is going out to the wire, or just to the
kernel). Then the network must drain the packets onto the wire, sometimes
very slowly (think about a dialup user downloading from your GigE server).
Setting the socket buffers too high can potentially result in an
incredible waste of resources, and can severely limit the number of
simultaneous connections your server can support. This is precisely why
OS's cannot ship with huge default values, because what may be appropriate
for your one-user GigE connected box might not be appropriate for someone
else's 100BASE-TX web server (and guess which setup has more users :P).
On the receive size, the socket buffers must be large enough to
accommodate all the data received between application read()'s, as well
as making sure they have enough available space to hold future data in the
event of a "gap" due to loss and the need for retransmission. However, if
the application fails to read() the data from the socket buffer, it will
sit there forever. Large socket buffers also opens the server up to
malicious attack causing non-swapable kernel memory to consume all
available resources, either locally (by someone dumping data over lots of
connections, or running an application which intentionally fails to read
data from the socket buffer), or remotely (think someone opening a bunch
of rate limited connections from your "high speed server"). It can even be
unintentional, but just as bad (think a million confused dialup users
accidentally clicking on your high speed video stream).
Some of this can be worked around by implementing what is called
auto-tuning socket buffers. In this case, the kernel would limit the
amount of data allowed into the buffer, by looking at the tcp session's
observed congestion window. This allows you to define large send buffers
without applications connected to slow receivers sucking up unnecessary
resourced. PSC has had example implementations for quite a while, and
recently FreeBSD even added this (sysctl net.inet.tcp.inflight_enable=1 as
of 4.7). Unfortunately, there isn't much you can do to prevent malicious
receive-side buffer attacks, short of limiting the overall max buffer
(FreeBSD implements this as an rlimit "sbsize").
Of course, you need a few other things before you can start getting into
end to end gigabit speeds. If you're transfering a file, you probably
don't want to be reading it from disk via the kernel just to send it back
to the kernel again for transmission, so various things like sendfile()
and zero copy implementations help get you the performance you need
locally. Jumbo frames help too, but their real benefit is not the
simplistic "hey look theres 1/3rd the number of frames/sec" view that many
people see. The good stuff comes from techniques like page flipping, where
the NIC DMA's data into a memory page which can be flipped through the
system straight to the application, without copying it throughout. Some
day TCP may just be implemented on the NIC itself, with ALL work
offloaded, and the system doing nothing but receiving nice page-sized
chunks of data at high rates of speed. IMHO the 1500 byte MTU of ethernet
will still continue to prevent good end to end performance like this for a
long time to come. But alas, I digress...
--
Richard A Steenbergen <ras at e-gerbil.net> http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
More information about the NANOG
mailing list