Thoughts on increasing MTUs on the internet

Fred Baker fred at cisco.com
Fri Apr 13 23:55:55 UTC 2007


I agree with many of your thoughts. This is essentially the same  
discussion we had upgrading from the 576 byte common MTU of the  
ARPANET to the 1500 byte MTU of Ethernet-based networks. Larger MTUs  
are a good thing, but are not a panacea. The biggest value in real  
practice is IMHO that the end systems deal with a lower interrupt  
rate when moving the same amount of data. That said, some who are  
asking about larger MTUs are asking for values so large that CRC  
schemes lose their value in error detection, and they find themselves  
looking at higher layer FEC technologies to make up for the issue.  
Given that there is an equipment cost related to larger MTUs, I  
believe that there is such a thing as an MTU that is impractical.

1500 byte MTUs in fact work. I'm all for 9K MTUs, and would recommend  
them. I don't see the point of 65K MTUs.

On Apr 14, 2007, at 7:39 AM, Simon Leinen wrote:

>
> Ah, large MTUs.  Like many other "academic" backbones, we implemented
> large (9192 bytes) MTUs on our backbone and 9000 bytes on some hosts.
> See [1] for an illustration.  Here are *my* current thoughts on
> increasing the Internet MTU beyond its current value, 1500.  (On the
> topic, see also [2] - a wiki page which is actually served on a
> 9000-byte MTU server :-)
>
> Benefits of >1500-byte MTUs:
>
> Several benefits of moving to larger MTUs, say in the 9000-byte range,
> were cited.  I don't find them too convincing anymore.
>
> 1. Fewer packets reduce work for routers and hosts.
>
>    Routers:
>
>    Most backbones seem to size their routers to sustain (near-)
>    line-rate traffic even with small (64-byte) packets.  That's a good
>    thing, because if networks were dimensioned to just work at average
>    packet sizes, they would be pretty easy to DoS by sending floods of
>    small packets.  So I don't see how raising the MTU helps much
>    unless you also raise the minimum packet size - which might be
>    interesting, but I haven't heard anybody suggest that.
>
>    This should be true for routers and middleboxes in general,
>    although there are certainly many places (especially firewalls)
>    where pps limitations ARE an issue.  But again, raising the MTU
>    doesn't help if you're worried about the worst case.  And I would
>    like to see examples where it would help significantly even in the
>    normal case.  In our network it certainly doesn't - we have Mpps to
>    spare.
>
>    Hosts:
>
>    For hosts, filling high-speed links at 1500-byte MTU has often been
>    difficult at certain times (with Fast Ethernet in the nineties,
>    GigE 4-5 years ago, 10GE today), due to the high rate of
>    interrupts/context switches and internal bus crossings.
>    Fortunately tricks like polling-instead-of-interrupts (Saku Ytti
>    mentioned this), Interrupt Coalescence and Large-Send Offload have
>    become commonplace these days.  These give most of the end-system
>    performance benefits of large packets without requiring any support
>    from the network.
>
> 2. Fewer bytes (saved header overhead) free up bandwidth.
>
>    TCP segments over Ethernet with 1500 byte MTU is "only" 94.2%
>    efficient, while with 9000 byte MTU it would be 99.?% efficient.
>    While an improvement would certainly be nice, 94% already seems
>    "good enough" to me.  (I'm ignoring the byte savings due to fewer
>    ACKs.  On the other hand not all packets will be able to grow
>    sixfold - some transfers are small.)
>
> 3. TCP runs faster.
>
>    This boils down to two aspects (besides the effects of (1) and  
> (2)):
>
>    a) TCP reaches its "cruising speed" faster.
>
>       Especially with LFNs (Long Fat Networks, i.e. paths with a large
>       bandwidth*RTT product), it can take quite a long time until TCP
>       slow-start has increased the window so that the maximum
>       achievable rate is reached.  Since the window increase happens
>       in units of MSS (~MTU), TCPs with larger packets reach this
>       point proportionally faster.
>
>       This is significant, but there are alternative proposals to
>       solve this issue of slow ramp-up, for example HighSpeed TCP [3].
>
>    b) You get a larger share of a congested link.
>
>       I think this is true when a TCP-with-large-packets shares a
>       congested link with TCPs-with-small-packets, and the packet loss
>       probability isn't proportional to the size of the packet.  In
>       fact the large-packet connection can get a MUCH larger share
>       (sixfold for 9K vs. 1500) if the loss probability is the same
>       for everybody (which it often will be, approximately).  Some
>       people consider this a fairness issue, other think it's a good
>       incentive for people to upgrade their MTUs.
>
> About the issues:
>
> * Current Path MTU Discovery doesn't work reliably.
>
>   Path MTU Discovery as specified in RFC 1191/1981 relies on ICMP
>   messages to discover when a smaller MTU has to be used.  When these
>   ICMP messages fail to arrive (or be sent), the sender will happily
>   continue to send too-large packets into the blackhole.  This problem
>   is very real.  As an experiment, try configuring an MTU < 1500 on a
>   backbone link which has Ethernet-connected customers behind it.
>   I bet that you'll receive LOUD complaints before long.
>
>   Some other people mention that Path MTU Discovery has been refined
>   with "blackhole detection" methods in some systems.  This is widely
>   implemented, but not configured (although it probably could be with
>   a "Service Pack").
>
>   Note that a new Path MTU Discovery proposal was just published as
>   RFC 4821 [4].  This is also supposed to solve the problem of relying
>   on ICMP messages.
>
>   Please, let's wait for these more robust PMTUD mechanisms to be
>   universally deployed before trying to increase the Internet MTU.
>
> * IP assumes a consistent MTU within a logical subnet.
>
>   This seems to be a pretty fundamental assumption, and Iljitsch's
>   original mail suggests that we "fix" this.  Umm, ok, I hope we don't
>   miss anything important that makes use of this assumption.
>
>   Seriously, I think it's illusionary to try to change this for
>   general networks, in particular large LANs.  It might work for
>   exchange points or other controlled cases where the set of protocols
>   is fairly well defined, but then exchange points have other options
>   such as separate "jumbo" VLANs.
>
>   For campus/datacenter networks, I agree that the consistent-MTU
>   requirement is a big problem for deploying larger MTUs.  This is
>   true within my organization - most servers that could use larger
>   MTUs (NNTP servers for example) live on the same subnet with servers
>   that will never bother to be upgraded.  The obvious solution is to
>   build smaller subnets - for our test servers I usually configure a
>   separate point-to-point subnet for each of its Ethernet interfaces
>   (I don't trust this bridging-magic anyway :-).
>
> * Most edges will not upgrade anyway.
>
>   On the slow edges of the network (residual modem users, exotic
>   places, cellular data users etc.), people will NOT upgrade their MTU
>   to 9000 byte, because a single such packet would totally kill the
>   VoIP experience.  For medium-fast networks, large MTUs don't cause
>   problems, but they don't help either.  So only a few super-fast
>   edges have an incentive to do this at all.
>
>   For the core networks that support large MTUs (like we do), this is
>   frustrating because all our routers now probably carve their
>   internal buffers for 9000-byte packets that never arrive.
>   Maybe we're wasting lots of expensive linecard memory this way?
>
> * Chicken/egg
>
>   As long as only a small minority of hosts supports >1500-byte MTUs,
>   there is no incentive for anyone important to start supporting them.
>   A public server supporting 9000-byte MTUs will be frustrated when it
>   tries to use them.  The overhead (from attempted large packets that
>   don't make it) and potential trouble will just not be worth it.
>   This is a little similar to IPv6.
>
> So I don't see large MTUs coming to the Internet at large soon.  They
> probably make sense in special cases, maybe for "land-speed records"
> and dumb high-speed video equipment, or for server-to-server stuff
> such as USENET news.
>
> (And if anybody out there manages to access [2] or http:// 
> ndt.switch.ch/
> with 9000-byte MTUs, I'd like to hear about it :-)
> -- 
> Simon.
>
> [1] Here are a few tracepaths (more or less traceroute with integrated
>     PMTU discovery) from a host on our network in Switzerland.
>     9000-byte packets make it across our national backbone (SWITCH),
>     the European academic backbone (GEANT2), Abilene and CENIC in the
>     US, as well as through AARnet in Australia (even over IPv6).  But
>     the link from the last wide-area backbone to the receiving site
>     inevitably has a 1500-byte MTU ("pmtu 1500").
>
> : leinen at mamp1[leinen]; tracepath www.caida.org
>  1:  mamp1-eth2.switch.ch (130.59.35.78)                    0.110ms  
> pmtu 9000
>  1:  swiMA1-G2-6.switch.ch (130.59.35.77)                   1.029ms
>  2:  swiMA2-G2-5.switch.ch (130.59.36.194)                  1.141ms
>  3:  swiEL2-10GE-1-4.switch.ch (130.59.37.77)               4.127ms
>  4:  swiCE3-10GE-1-3.switch.ch (130.59.37.65)               4.726ms
>  5:  swiCE2-10GE-1-4.switch.ch (130.59.36.209)              4.901ms
>  6:  switch.rt1.gen.ch.geant2.net (62.40.124.21)          asymm   
> 7   4.429ms
>  7:  so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22)        asymm  8   
> 12.551ms
>  8:  abilene-wash-gw.rt1.fra.de.geant2.net (62.40.125.18) asymm  9  
> 105.099ms
>  9:  64.57.28.12 (64.57.28.12)                            asymm 10  
> 121.619ms
> 10:  kscyng-iplsng.abilene.ucaid.edu (198.32.8.81)        asymm 11  
> 153.796ms
> 11:  dnvrng-kscyng.abilene.ucaid.edu (198.32.8.13)        asymm 12  
> 158.520ms
> 12:  snvang-dnvrng.abilene.ucaid.edu (198.32.8.1)         asymm 13  
> 180.784ms
> 13:  losang-snvang.abilene.ucaid.edu (198.32.8.94)        asymm 14  
> 177.487ms
> 14:  hpr-lax-gsr1--abilene-LA-10ge.cenic.net (137.164.25.2) asymm  
> 20 179.106ms
> 15:  riv-hpr--lax-hpr-10ge.cenic.net (137.164.25.5)       asymm 21  
> 185.183ms
> 16:  hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18  
> 186.368ms
> 17:  hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18  
> 185.861ms pmtu 1500
> 18:  cider.caida.org (192.172.226.123)                    asymm 19  
> 186.264ms reached
>      Resume: pmtu 1500 hops 18 back 19
> : leinen at mamp1[leinen]; tracepath www.aarnet.edu.au
>  1:  mamp1-eth2.switch.ch (130.59.35.78)                    0.095ms  
> pmtu 9000
>  1:  swiMA1-G2-6.switch.ch (130.59.35.77)                   1.024ms
>  2:  swiMA2-G2-5.switch.ch (130.59.36.194)                  1.115ms
>  3:  swiEL2-10GE-1-4.switch.ch (130.59.37.77)               3.989ms
>  4:  swiCE3-10GE-1-3.switch.ch (130.59.37.65)               4.731ms
>  5:  swiCE2-10GE-1-4.switch.ch (130.59.36.209)              4.771ms
>  6:  switch.rt1.gen.ch.geant2.net (62.40.124.21)          asymm   
> 7   4.424ms
>  7:  so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22)        asymm  8   
> 12.536ms
>  8:  ge-3-3-0.bb1.a.fra.aarnet.net.au (202.158.204.249)   asymm  9   
> 13.207ms
>  9:  so-0-1-0.bb1.a.sin.aarnet.net.au (202.158.194.145)   asymm 10  
> 217.846ms
> 10:  so-3-3-0.bb1.a.per.aarnet.net.au (202.158.194.129)   asymm 11  
> 275.651ms
> 11:  so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6)     asymm 12  
> 293.854ms
> 12:  so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6)     297.989ms  
> pmtu 1500
> 13:  tiny-teddy.aarnet.edu.au (203.21.37.30)              asymm 12  
> 297.462ms reached
>      Resume: pmtu 1500 hops 13 back 12
> : leinen at mamp1[leinen]; tracepath6 www.aarnet.edu.au
>  1?: [LOCALHOST]                      pmtu 9000
>  1:  swiMA1-G2-6.switch.ch                      1.328ms
>  2:  swiMA2-G2-5.switch.ch                      1.703ms
>  3:  swiEL2-10GE-1-4.switch.ch                  4.529ms
>  4:  swiCE3-10GE-1-3.switch.ch                  5.278ms
>  5:  swiCE2-10GE-1-4.switch.ch                  5.493ms
>  6:  switch.rt1.gen.ch.geant2.net             asymm  7   5. 99ms
>  7:  so-7-2-0.rt1.fra.de.geant2.net           asymm  8  13.239ms
>  8:  ge-3-3-0.bb1.a.fra.aarnet.net.au         asymm  9  13.970ms
>  9:  so-0-1-0.bb1.a.sin.aarnet.net.au         asymm 10 218.718ms
> 10:  so-3-3-0.bb1.a.per.aarnet.net.au         asymm 11 267.225ms
> 11:  so-0-1-0.bb1.a.adl.aarnet.net.au         asymm 12 299. 78ms
> 12:  so-0-1-0.bb1.a.adl.aarnet.net.au         298.473ms pmtu 1500
> 12:  www.ipv6.aarnet.edu.au                   292.893ms reached
>      Resume: pmtu 1500 hops 12 back 12
>
> [2] PERT Knowledgebase article: http://kb.pert.geant2.net/PERTKB/ 
> JumboMTU
>
> [3] RFC 3649, HighSpeed TCP for Large Congestion Windows, S. Floyd,
>     December 2003
>
> [4] RFC 4821, Packetization Layer Path MTU Discovery. M. Mathis,
>     J. Heffner, March 2007



More information about the NANOG mailing list