Thoughts on increasing MTUs on the internet

Simon Leinen simon at limmat.switch.ch
Fri Apr 13 23:39:22 UTC 2007


Ah, large MTUs.  Like many other "academic" backbones, we implemented
large (9192 bytes) MTUs on our backbone and 9000 bytes on some hosts.
See [1] for an illustration.  Here are *my* current thoughts on
increasing the Internet MTU beyond its current value, 1500.  (On the
topic, see also [2] - a wiki page which is actually served on a
9000-byte MTU server :-)

Benefits of >1500-byte MTUs:

Several benefits of moving to larger MTUs, say in the 9000-byte range,
were cited.  I don't find them too convincing anymore.

1. Fewer packets reduce work for routers and hosts.

   Routers:
 
   Most backbones seem to size their routers to sustain (near-)
   line-rate traffic even with small (64-byte) packets.  That's a good
   thing, because if networks were dimensioned to just work at average
   packet sizes, they would be pretty easy to DoS by sending floods of
   small packets.  So I don't see how raising the MTU helps much
   unless you also raise the minimum packet size - which might be
   interesting, but I haven't heard anybody suggest that.

   This should be true for routers and middleboxes in general,
   although there are certainly many places (especially firewalls)
   where pps limitations ARE an issue.  But again, raising the MTU
   doesn't help if you're worried about the worst case.  And I would
   like to see examples where it would help significantly even in the
   normal case.  In our network it certainly doesn't - we have Mpps to
   spare.
 
   Hosts:
 
   For hosts, filling high-speed links at 1500-byte MTU has often been
   difficult at certain times (with Fast Ethernet in the nineties,
   GigE 4-5 years ago, 10GE today), due to the high rate of
   interrupts/context switches and internal bus crossings.
   Fortunately tricks like polling-instead-of-interrupts (Saku Ytti
   mentioned this), Interrupt Coalescence and Large-Send Offload have
   become commonplace these days.  These give most of the end-system
   performance benefits of large packets without requiring any support
   from the network.

2. Fewer bytes (saved header overhead) free up bandwidth.

   TCP segments over Ethernet with 1500 byte MTU is "only" 94.2%
   efficient, while with 9000 byte MTU it would be 99.?% efficient.
   While an improvement would certainly be nice, 94% already seems
   "good enough" to me.  (I'm ignoring the byte savings due to fewer
   ACKs.  On the other hand not all packets will be able to grow
   sixfold - some transfers are small.)

3. TCP runs faster.

   This boils down to two aspects (besides the effects of (1) and (2)):

   a) TCP reaches its "cruising speed" faster.

      Especially with LFNs (Long Fat Networks, i.e. paths with a large
      bandwidth*RTT product), it can take quite a long time until TCP
      slow-start has increased the window so that the maximum
      achievable rate is reached.  Since the window increase happens
      in units of MSS (~MTU), TCPs with larger packets reach this
      point proportionally faster.

      This is significant, but there are alternative proposals to
      solve this issue of slow ramp-up, for example HighSpeed TCP [3].

   b) You get a larger share of a congested link.

      I think this is true when a TCP-with-large-packets shares a
      congested link with TCPs-with-small-packets, and the packet loss
      probability isn't proportional to the size of the packet.  In
      fact the large-packet connection can get a MUCH larger share
      (sixfold for 9K vs. 1500) if the loss probability is the same
      for everybody (which it often will be, approximately).  Some
      people consider this a fairness issue, other think it's a good
      incentive for people to upgrade their MTUs.

About the issues:

* Current Path MTU Discovery doesn't work reliably.

  Path MTU Discovery as specified in RFC 1191/1981 relies on ICMP
  messages to discover when a smaller MTU has to be used.  When these
  ICMP messages fail to arrive (or be sent), the sender will happily
  continue to send too-large packets into the blackhole.  This problem
  is very real.  As an experiment, try configuring an MTU < 1500 on a
  backbone link which has Ethernet-connected customers behind it.
  I bet that you'll receive LOUD complaints before long.

  Some other people mention that Path MTU Discovery has been refined
  with "blackhole detection" methods in some systems.  This is widely
  implemented, but not configured (although it probably could be with
  a "Service Pack").

  Note that a new Path MTU Discovery proposal was just published as
  RFC 4821 [4].  This is also supposed to solve the problem of relying
  on ICMP messages.

  Please, let's wait for these more robust PMTUD mechanisms to be
  universally deployed before trying to increase the Internet MTU.

* IP assumes a consistent MTU within a logical subnet.

  This seems to be a pretty fundamental assumption, and Iljitsch's
  original mail suggests that we "fix" this.  Umm, ok, I hope we don't
  miss anything important that makes use of this assumption.

  Seriously, I think it's illusionary to try to change this for
  general networks, in particular large LANs.  It might work for
  exchange points or other controlled cases where the set of protocols
  is fairly well defined, but then exchange points have other options
  such as separate "jumbo" VLANs.

  For campus/datacenter networks, I agree that the consistent-MTU
  requirement is a big problem for deploying larger MTUs.  This is
  true within my organization - most servers that could use larger
  MTUs (NNTP servers for example) live on the same subnet with servers
  that will never bother to be upgraded.  The obvious solution is to
  build smaller subnets - for our test servers I usually configure a
  separate point-to-point subnet for each of its Ethernet interfaces
  (I don't trust this bridging-magic anyway :-).

* Most edges will not upgrade anyway.

  On the slow edges of the network (residual modem users, exotic
  places, cellular data users etc.), people will NOT upgrade their MTU
  to 9000 byte, because a single such packet would totally kill the
  VoIP experience.  For medium-fast networks, large MTUs don't cause
  problems, but they don't help either.  So only a few super-fast
  edges have an incentive to do this at all.

  For the core networks that support large MTUs (like we do), this is
  frustrating because all our routers now probably carve their
  internal buffers for 9000-byte packets that never arrive.
  Maybe we're wasting lots of expensive linecard memory this way?

* Chicken/egg

  As long as only a small minority of hosts supports >1500-byte MTUs,
  there is no incentive for anyone important to start supporting them.
  A public server supporting 9000-byte MTUs will be frustrated when it
  tries to use them.  The overhead (from attempted large packets that
  don't make it) and potential trouble will just not be worth it.
  This is a little similar to IPv6.

So I don't see large MTUs coming to the Internet at large soon.  They
probably make sense in special cases, maybe for "land-speed records"
and dumb high-speed video equipment, or for server-to-server stuff
such as USENET news.

(And if anybody out there manages to access [2] or http://ndt.switch.ch/
with 9000-byte MTUs, I'd like to hear about it :-)
-- 
Simon.

[1] Here are a few tracepaths (more or less traceroute with integrated
    PMTU discovery) from a host on our network in Switzerland.
    9000-byte packets make it across our national backbone (SWITCH),
    the European academic backbone (GEANT2), Abilene and CENIC in the
    US, as well as through AARnet in Australia (even over IPv6).  But
    the link from the last wide-area backbone to the receiving site
    inevitably has a 1500-byte MTU ("pmtu 1500").

: leinen at mamp1[leinen]; tracepath www.caida.org
 1:  mamp1-eth2.switch.ch (130.59.35.78)                    0.110ms pmtu 9000
 1:  swiMA1-G2-6.switch.ch (130.59.35.77)                   1.029ms 
 2:  swiMA2-G2-5.switch.ch (130.59.36.194)                  1.141ms 
 3:  swiEL2-10GE-1-4.switch.ch (130.59.37.77)               4.127ms 
 4:  swiCE3-10GE-1-3.switch.ch (130.59.37.65)               4.726ms 
 5:  swiCE2-10GE-1-4.switch.ch (130.59.36.209)              4.901ms 
 6:  switch.rt1.gen.ch.geant2.net (62.40.124.21)          asymm  7   4.429ms 
 7:  so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22)        asymm  8  12.551ms 
 8:  abilene-wash-gw.rt1.fra.de.geant2.net (62.40.125.18) asymm  9 105.099ms 
 9:  64.57.28.12 (64.57.28.12)                            asymm 10 121.619ms 
10:  kscyng-iplsng.abilene.ucaid.edu (198.32.8.81)        asymm 11 153.796ms 
11:  dnvrng-kscyng.abilene.ucaid.edu (198.32.8.13)        asymm 12 158.520ms 
12:  snvang-dnvrng.abilene.ucaid.edu (198.32.8.1)         asymm 13 180.784ms 
13:  losang-snvang.abilene.ucaid.edu (198.32.8.94)        asymm 14 177.487ms 
14:  hpr-lax-gsr1--abilene-LA-10ge.cenic.net (137.164.25.2) asymm 20 179.106ms 
15:  riv-hpr--lax-hpr-10ge.cenic.net (137.164.25.5)       asymm 21 185.183ms 
16:  hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18 186.368ms 
17:  hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18 185.861ms pmtu 1500
18:  cider.caida.org (192.172.226.123)                    asymm 19 186.264ms reached
     Resume: pmtu 1500 hops 18 back 19 
: leinen at mamp1[leinen]; tracepath www.aarnet.edu.au
 1:  mamp1-eth2.switch.ch (130.59.35.78)                    0.095ms pmtu 9000
 1:  swiMA1-G2-6.switch.ch (130.59.35.77)                   1.024ms 
 2:  swiMA2-G2-5.switch.ch (130.59.36.194)                  1.115ms 
 3:  swiEL2-10GE-1-4.switch.ch (130.59.37.77)               3.989ms 
 4:  swiCE3-10GE-1-3.switch.ch (130.59.37.65)               4.731ms 
 5:  swiCE2-10GE-1-4.switch.ch (130.59.36.209)              4.771ms 
 6:  switch.rt1.gen.ch.geant2.net (62.40.124.21)          asymm  7   4.424ms 
 7:  so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22)        asymm  8  12.536ms 
 8:  ge-3-3-0.bb1.a.fra.aarnet.net.au (202.158.204.249)   asymm  9  13.207ms 
 9:  so-0-1-0.bb1.a.sin.aarnet.net.au (202.158.194.145)   asymm 10 217.846ms 
10:  so-3-3-0.bb1.a.per.aarnet.net.au (202.158.194.129)   asymm 11 275.651ms 
11:  so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6)     asymm 12 293.854ms 
12:  so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6)     297.989ms pmtu 1500
13:  tiny-teddy.aarnet.edu.au (203.21.37.30)              asymm 12 297.462ms reached
     Resume: pmtu 1500 hops 13 back 12 
: leinen at mamp1[leinen]; tracepath6 www.aarnet.edu.au
 1?: [LOCALHOST]                      pmtu 9000
 1:  swiMA1-G2-6.switch.ch                      1.328ms 
 2:  swiMA2-G2-5.switch.ch                      1.703ms 
 3:  swiEL2-10GE-1-4.switch.ch                  4.529ms 
 4:  swiCE3-10GE-1-3.switch.ch                  5.278ms 
 5:  swiCE2-10GE-1-4.switch.ch                  5.493ms 
 6:  switch.rt1.gen.ch.geant2.net             asymm  7   5. 99ms 
 7:  so-7-2-0.rt1.fra.de.geant2.net           asymm  8  13.239ms 
 8:  ge-3-3-0.bb1.a.fra.aarnet.net.au         asymm  9  13.970ms 
 9:  so-0-1-0.bb1.a.sin.aarnet.net.au         asymm 10 218.718ms 
10:  so-3-3-0.bb1.a.per.aarnet.net.au         asymm 11 267.225ms 
11:  so-0-1-0.bb1.a.adl.aarnet.net.au         asymm 12 299. 78ms 
12:  so-0-1-0.bb1.a.adl.aarnet.net.au         298.473ms pmtu 1500
12:  www.ipv6.aarnet.edu.au                   292.893ms reached
     Resume: pmtu 1500 hops 12 back 12 

[2] PERT Knowledgebase article: http://kb.pert.geant2.net/PERTKB/JumboMTU

[3] RFC 3649, HighSpeed TCP for Large Congestion Windows, S. Floyd,
    December 2003

[4] RFC 4821, Packetization Layer Path MTU Discovery. M. Mathis,
    J. Heffner, March 2007



More information about the NANOG mailing list