RINA - scott whaps at the nanog hornets nest :-)

Sat Nov 6 21:21:51 UTC 2010

> It's perfectly safe to have the L2 networks in the middle support the
> largest MTU values possible (other than maybe triggering an obscure
> Force10 bug or something :P), so they could roll that out today and
you
> probably wouldn't notice. The real issue is with the L3 networks on
> either end of the exchange, since if the L3 routers that are trying to
> talk to each other don't agree about their MTU valus precisely,
packets
> are blackholed. There are no real standards for jumbo frames out
there,
> every vendor (and in many cases particular type/revision of hardware
> made by that vendor) supports a slightly different size. There is also
> no negotiation protocol of any kind, so the only way to make these two
> numbers match precisely is to have the humans on both sides talk to
> each
> other and come up with a commonly supported value.

That is not a new problem.  That is also true to today with "last mile"
links (e.g. dialup) that support <1500 byte MTU.  What is different
today is RFC 4821 PMTU discovery which deals with the "black holes".

RFC 4821 PMTUD is that "negotiation" that is "lacking".  It is there.
It is deployed.  It actually works.  No more relying on someone sending
the ICMP packets through in order for PMTUD to work!

> There are two things that make this practically impossible to support
> at
> scale, even ignoring all of the grief that comes from trying to find a
> clueful human to talk to on the other end of your connection to a
third
> party (which is a huge problem in and of itself):
> 
> #1. There is currently no mechanism on any major router to set
multiple
> MTU values PER NEXTHOP on a multi-point exchange, so to do jumbo
frames
> over an exchange you would have to pick a single common value that
> EVERYONE can support. This also means you can't mix and match jumbo
and
> non-jumbo participants over the same exchange, you essentially have to
> set up an entirely new exchange point (or vlan within the same
> exchange)
> dedicated to the jumbo frame support, and you still have to get a
> common
> value that everyone can support. Ironically many routers (many kinds
of
> Cisco and Juniper routers at any rate) actually DO support per-nexthop
> MTUs in hardware, there is just no mechanism exposed to the end user
to
> configure those values, let alone auto-negotiate them.

Is there any gear connected to a major IX that does NOT support large
frames?  I am not aware of any manufactured today.  Even cheap D-Link
gear supports them.  I believe you would be hard-pressed to locate gear
that doesn't support it at any major IX.  Granted, it might require the
change of a global config value and a reboot for it to take effect in
some vendors.

http://darkwing.uoregon.edu/~joe/jumbo-clean-gear.html

> #2. The major vendors can't even agree on how they represent MTU
sizes,
> so entering the same # into routers from two different vendors can
> easily result in incompatible MTUs. For example, on Juniper when you
> type "mtu 9192", this is INCLUSIVE of the L2 header, but on Cisco the
> opposite is true. So to make a Cisco talk to a Juniper that is
> configured 9192, you would have to configure mtu 9178. Except it's not
> even that simple, because now if you start adding vlan tagging the L2
> header size is growing. If you now configure vlan tagging on the
> interface, you've got to make the Cisco side 9174 to match the
> Juniper's
> 9192. And if you configure flexible-vlan-tagging so you can support
> q-in-q, you've now got to configure to Cisco side for 9170.

Again, the size of the MTU on the IX port doesn't change the size of the
packets flowing through that gear.  A packet sent from an end point with
an MTU of 1500 will be unchanged by the router change.  A flow to an end
point with <1500 MTU will also be adjusted down by PMTU Discovery just
as it is now when communicating with a dialup end point that might have
<600 MTU.  The only thing that is going to change from the perspective
of the routers is the communications originated by the router which will
basically just be the BGP session.  When the TCP session is established
for BGP, the smaller of the two MTU will report an MSS value which is
the largest packet size it can support.  The other unit will not send a
packet larger than this even if it has a larger MTU.  Just because the
MTU is 9000 doesn't mean it is going to aggregate 1500 byte packets
flowing through it into 9000 byte packets, it is going to pass them
through unchanged.  

As for the configuration differences between units, how does that change
from the way things are now?  A person configuring a Juniper for 1500
byte packets already must know the difference as that quirk of including
the headers is just as true at 1500 bytes as it is at 9000 bytes.  Does
the operator suddenly become less competent with their gear when they
use a different value?  Also, a 9000 byte MTU would be a happy value
that practically everyone supports these days, including ethernet
adaptors on host machines.

> As an operator who DOES fully support 9k+ jumbos on every internal
link
> in my network, and as many external links as I can find clueful people
> to talk to on the other end to negotiate the correct values, let me
> just
> tell you this is a GIANT PAIN IN THE ASS. And we're not even talking
> about making sure things actually work right for the end user. Your
IGP
> may not come up at all if the MTUs are misconfigured, but EBGP
> certainly
> will, even if the two sides are actually off by a few bytes. The
> maximum
> size of a BGP message is 4096 octets, and there is no mechanism to pad
> a
> message and try to detect MTU incompatibility, so what will actually
> happen in real life is the end user will try to send a big jumbo frame
> through and find that some of their packets are randomly and silently
> blackholed. This would be an utter nightmare to support and diagnose.

So the router doesn't honor the MSS value of the TCP stream?  That would
seem like a bug to me.  I am not suggesting we set everything to the
maximum that it will support because that is different for practically
every vendor.  I am suggesting that we pick a different "standard" value
for the "middle" of the internet of 9000 bytes which practically
everything made these days supports.  Yes, having everyone set theirs to
different values can make for different issues but if we just picked
one, 9000, that everyone supports (you can use a larger MTU internally
if you are doing things like tunneling which adds additional overhead if
you want to maintain the original 9000 byte frame end to end) for the
interfaces between networks.

> Realistically I don't think you'll ever see even a serious attempt at
> jumbo frame support implemented in any kind of scale until there is a
> negotiation protocol and some real standards for the mtu size that
must
> be supported, which is something that no standards body (IEEE, IETF,
> etc) has seemed inclined to deal with so far. Of course all of this is
> based on the assumption that path mtu discovery will work correctly
> once
> the MTU valus ARE correctly configured on the L3 routers, which is a
> pretty huge assumption, given all the people who stupidly filter ICMP.
> Oh and even if you solved all of those problems, I could trivially DoS
> your router with some packets that would overload your ability to
> generate ICMP Unreach Needfrag messages for PMTUD, and then all your
> jumbo frame end users going through that router would be blackholed as
> well.

The ICMP filtration issue goes away with modern PMTUD that is now
supported in Windows, Solaris, Linux, MacOS, and BSD.  That is no longer
a problem for the end points. And I would highly recommend anyone
operating Linux systems in production to run at least 2.6.32 with
/proc/sys/net/ipv6/tcp_mtu_probing set to either 1 (blackhole recovery)
or 2 (active PMTU discovery probes) in order to avoid the PMTUD problems
we already have on the Internet.

The MTU issue between routers is only a problem for the traffic
originated and terminated between those routes.  The MSS might not be
accurate if there is a tunnel someplace between the two routers that
reduces the effective MTU between them but that is a matter of getting
router vendors to also support RFC 4821 themselves to detect and correct
that problem.  The tools are all there.  

We have already been operating for quite some time with mixed MTU and
effective MTU sizes with tunneling and various "last mile" issues.  This
adds nothing new to the mix and offers greatly improved performance in
both the transactions across the network and from the gear itself in
reduced CPU consumption to move a given amount of traffic.

See, I told you it was a hornets' nest :)