RINA - scott whaps at the nanog hornets nest :-)

Richard A Steenbergen ras at e-gerbil.net
Sat Nov 6 17:20:01 CDT 2010

On Sat, Nov 06, 2010 at 02:21:51PM -0700, George Bonser wrote:
> That is not a new problem.  That is also true to today with "last 
> mile" links (e.g. dialup) that support <1500 byte MTU.  What is 
> different today is RFC 4821 PMTU discovery which deals with the "black 
> holes".
> RFC 4821 PMTUD is that "negotiation" that is "lacking".  It is there. 
> It is deployed.  It actually works.  No more relying on someone 
> sending the ICMP packets through in order for PMTUD to work!

The only thing this adds is trial-and-error probing mechanism per flow, 
to try and recover from the infinite blackholing that would occur if 
your ICMP is blocked in classic PMTUD. If this actually happened in any 
scale, it would create a performance and overhead penalty that is far 
worse than the original problem you're trying to solve.

Say you have two routers talking to each other over a L2 switched 
infrastructure (i.e. an exchange point). In order for PMTUD to function 
quickly and effectively, the two routers on each end MUST agree on the 
MTU value of the link between them. If router A thinks it is 9000, and 
router B thinks it is 8000, when router A comes along and tries to send 
a 8001 byte packet it will be silently discarded, and the only way to 
recover from this is with trial-and-error probing by the endpoints after 
they detect what they believe to be MTU blackholing. This is little more 
than a desperate ghetto hack designed to save the connection from 
complete disaster.

The point where a protocol is needed is between router A and router B, 
so they can determine the MTU of the link, without needing to involve 
the humans in a manual negotiation process. Ideally this would support 
multi-point LANs over ethernet as well, so .1 could have an MTU of 9000, 
.2 could have an MTU of 8000, etc. And of course you have to make sure 
that you can actually PASS the MTU across the wire (if the switch in the 
middle can't handle it, the packet will also be silently dropped), so 
you can't just rely on the other side to tell you what size it THINKS it 
can support. You don't have a shot in hell of having MTUs negotiated 
correctly or PMTUD work well until this is done.

> Is there any gear connected to a major IX that does NOT support large 
> frames?  I am not aware of any manufactured today.  Even cheap D-Link 
> gear supports them.  I believe you would be hard-pressed to locate 
> gear that doesn't support it at any major IX.  Granted, it might 
> require the change of a global config value and a reboot for it to 
> take effect in some vendors.
> http://darkwing.uoregon.edu/~joe/jumbo-clean-gear.html

If that doesn't prove my point about every vendor having their own 
definition of what # is and isn't supported, I don't know what does. 
Also, I don't know what exchanges YOU connect to, but I very clearly see 
a giant pile of gear on that list that is still in use today. :)

> As for the configuration differences between units, how does that 
> change from the way things are now?  A person configuring a Juniper 
> for 1500 byte packets already must know the difference as that quirk 
> of including the headers is just as true at 1500 bytes as it is at 
> 9000 bytes.  Does the operator suddenly become less competent with 
> their gear when they use a different value?  Also, a 9000 byte MTU 
> would be a happy value that practically everyone supports these days, 
> including ethernet adaptors on host machines.

Everything defaults to 1500 today, so nobody has to do anything. Again, 
I'm actually doing this with people today on a very large network with 
lots of peers all over the world, so I have a little bit of experience 
with exactly what goes wrong. Nearly everyone who tries to figure out 
the correct MTU between vendors and with a third party network gets it 
wrong, at least some significant percentage of the time.

And honestly I can't even find an interesting number of people willing 
to turn on BFD, something with VERY clear benefits for improving failure 
detection time over an IX (for the next time Equinix decides to do one 
of their 10PM maintenances that causes hours of unreachability until 
hold timers expire :P). If the IX operators saw any significant demand 
they would have already turned it on already.

Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)

More information about the NANOG mailing list