RINA - scott whaps at the nanog hornets nest :-)

George Bonser gbonser at seven.com
Sat Nov 6 17:49:19 CDT 2010

> The only thing this adds is trial-and-error probing mechanism per
> to try and recover from the infinite blackholing that would occur if
> your ICMP is blocked in classic PMTUD. If this actually happened in
> scale, it would create a performance and overhead penalty that is far
> worse than the original problem you're trying to solve.

I ran into this very problem not long ago when attempting to reach a
server for a very large network.  Our Solaris hosts had no problem
transacting with the server.  Our linux machines did have a problem and
the behavior looked like a typical PMTU black hole.  It turned out that
"very large network" tunneled the connection inside their network
reducing the effective MTU of the encapsulated packets and blocked ICMP
from inside their net to the outside.  Changing the advertised MSS of
the connection to that server to 1380 allowed it to work 

( ip route add <ip address> via <gateway> dev <device> advmss 1380 )

and that verified that the problem was an MTU black hole.  A little
reading revealed why Solaris wasn't having the problem but Linux did.
Setting the Linux ip_no_pmtu_disc sysctl to 1 resulted in the Linux
behavior matching the Solaris behavior.

> Say you have two routers talking to each other over a L2 switched
> infrastructure (i.e. an exchange point). In order for PMTUD to
> quickly and effectively, the two routers on each end MUST agree on the
> MTU value of the link between them. If router A thinks it is 9000, and
> router B thinks it is 8000, when router A comes along and tries to
> a 8001 byte packet it will be silently discarded, and the only way to
> recover from this is with trial-and-error probing by the endpoints
> after
> they detect what they believe to be MTU blackholing. This is little
> more
> than a desperate ghetto hack designed to save the connection from
> complete disaster.

Correct. Devices on the same vlan will need to use the same MTU.  And
why is that a problem?  That is just as true then as it is today.
Nothing changes.  All you are doing is changing from everyone using 1500
to everyone using 9000 on that vlan.  Nothing else changes.  Why is that
any kind of issue?

> The point where a protocol is needed is between router A and router B,
> so they can determine the MTU of the link, without needing to involve
> the humans in a manual negotiation process. 

When the TCP/IP connection is opened between the routers for a routing
session, they should each send the other an MSS value that says how
large a packet they can accept.  You already have that information
available. TCP provides that negotiation for directly connected

Again, nothing changes from the current method of operating. If I showed
up at a peering switch and wanted to use 1000 byte MTU, I would probably
have some problems.  The point I am making is that 1500 is a relic value
that hamstrings Internet performance and there is no good reason not to
use 9000 byte MTU at peering points (by all participants) since it A:
introduces no new problems and B: I can't find a vendor of modern gear
at a peering point that doesn't support it though there may be some
ancient gear at some peering points in use by some of the peers.

I can not think of a problem changing from 1500 to 9000 as the standard
at peering points introduces.  It would also speed up the loading of the
BGP routes between routers at the peering points.  If Joe Blow at home
with a dialup connection with an MTU of 576 is talking to a server at Y!
with an MTU of 10 billion, changing a peering path from 1500 to 9000
bytes somewhere in the path is not going to change that PMTU discovery
one iota.  It introduces no problem whatsoever. It changes nothing.

> If that doesn't prove my point about every vendor having their own
> definition of what # is and isn't supported, I don't know what does.
> Also, I don't know what exchanges YOU connect to, but I very clearly
> see
> a giant pile of gear on that list that is still in use today. :)

That is a list of 9000 byte clean gear.  The very bottom is the stuff
that doesn't support it.  Of the stuff that doesn't support it, how much
is connected directly to a peering point?  THAT is the bottleneck I am
talking about right now.  One step at a time.  Removing the bottleneck
at the peering points is all I am talking about.  That will not change
PMTU issues elsewhere and those will stand just exactly as they are
today without any change.  In fact it will ensure that there are *fewer*
PMTU discovery issues by being able to support a larger range of packets
without having to fragment them.

We *already* have SONET MTU of >4000 and this hasn't broken anything
since the invention of SONET.

More information about the NANOG mailing list