[NANOG] Microsoft.com PMTUD black hole?

Nathan Anderson/FSR nathana at fsr.com
Tue May 6 21:29:03 UTC 2008


Iljitsch van Beijnum wrote:

> A more common approach is to rewrite the MSS option in all TCP SYNs  

[snip]

Yeah, we do this now, but the software that we have been using for PPPoE 
termination as well as for a huge portion of our clients (MikroTik 
RouterOS) doesn't do it correctly in my estimation when you flip on the 
automatic "change-tcp-mss" option...it rewrites the MSS in ALL SYNs 
passing through it, either coming OR going.  This has the effect of 
breaking communication with other hosts that actually have a SMALLER MSS 
than our PPPoE customers since our client will get a SYN+ACK from the 
remote host that we have rewritten to reflect a larger MSS than the 
remote host is capable of dealing with.  Because MikroTik rewrote both 
the SYNs generated by us as well as received by us, our customer's host 
is now under the impression that the lowest MSS between the two hosts 
matches its own.

At least that's the best theory I've come up with.  We can write (and 
have written) custom IP manglers on the MikroTik boxes that only touch 
SYNs generated by our clients, and only when the MSS is larger than a 
certain value (in order to honor MSSes even lower than that allowed by 
their PPPoE gateway).  But it's a PITA to deal with.  I'd just rather 
everyone follow protocol. :-P  Although we can't always expect everyone 
to do it by the book, I don't think it is too much to ask that those who 
operate sizable networks that nearly everyone is required to interact 
with on a daily basis (read: Microsoft) act responsibly.

> All of this even went so far that the IETF came up with RFC 4821,  
> which will do path MTU discovery by correlating lost packets with  
> packet sizes to determine the path MTU rather than depend on ICMP  
> messages.

What's funny is that I ran my tests from a Windows XP host with the 
recently-released Service Pack 3 installed, which is supposed to 
activate Microsoft's "PMTUD Black Hole Router Detection" by default 
(available pre-SP3 but apparently not turned on without a registry 
change).  I haven't read up on exactly how it's supposed to work, but I 
think the basic idea is that if the TCP connection is negotiated 
properly but it doesn't get a response beyond that, it will try lower 
and lower MSSes until it does.

However it works (or doesn't as the case may be), it didn't make a lick 
of difference.  I waited and waited for content to be delivered to me 
until eventually Microsoft's end sent me a TCP RST.

While I was poking at this, though, I had a thought...most IP stacks I 
believe keep a path MTU cache of some sort.  I know Windows does: if I 
send an ICMP packet with DF set that is larger than the PPPoE gateway 
can handle, I get something similar to the following:

C:\Documents and Settings\nathana>ping 64.126.160.1 -f -l 1472

Pinging 64.126.160.1 with 1472 bytes of data:

Reply from 64.126.142.249: Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
[...]

Next time that I try the same thing, Windows doesn't even bother trying 
to send the packet.  It looks at its PMTU table for that IP, and already 
KNOWS it is too big:

C:\Documents and Settings\nathana>ping 64.126.160.1 -f -l 1472

Pinging 64.126.160.1 with 1472 bytes of data:

Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
[...]

However, even when trying this with www.msnbc.msn.com, and with the 
MSNBC entry in its PMTU cache (and its IP set statically in my 'hosts' 
file so that Akamai/MS round-robin DNS doesn't screw with me during the 
test), when I tried to build a TCP connection to MSNBC from this same 
host, Windows told the remote host it had a 1460 MSS.

Now, although that makes sense, in order to avoid issues like the one we 
are facing with Microsoft, would it not make _more_ sense for the stack 
to look at the PMTU cache first, and then adjust its own MSS just for 
connections to that one host?  Maybe even send out an MTU - 40 ICMP 
packet to the host that we want to build a TCP connection with FIRST to 
get an ICMP type 3 code 4 response from the router in-between with the 
smaller MTU?

That would put the burden of PMTUD on the host requesting the TCP 
session rather than on the one responding, but if hosts were "smarter" 
like this it seems to me it might smooth out some of these issues.  The 
remote end could be "broken" with respect to PMTUD but it wouldn't matter.

Thoughts?

-- 
Nathan Anderson
First Step Internet, LLC
nathana at fsr.com




More information about the NANOG mailing list