[NANOG] Microsoft.com PMTUD black hole?
Nathan Anderson/FSR
nathana at fsr.com
Tue May 6 21:29:03 UTC 2008
Iljitsch van Beijnum wrote:
> A more common approach is to rewrite the MSS option in all TCP SYNs
[snip]
Yeah, we do this now, but the software that we have been using for PPPoE
termination as well as for a huge portion of our clients (MikroTik
RouterOS) doesn't do it correctly in my estimation when you flip on the
automatic "change-tcp-mss" option...it rewrites the MSS in ALL SYNs
passing through it, either coming OR going. This has the effect of
breaking communication with other hosts that actually have a SMALLER MSS
than our PPPoE customers since our client will get a SYN+ACK from the
remote host that we have rewritten to reflect a larger MSS than the
remote host is capable of dealing with. Because MikroTik rewrote both
the SYNs generated by us as well as received by us, our customer's host
is now under the impression that the lowest MSS between the two hosts
matches its own.
At least that's the best theory I've come up with. We can write (and
have written) custom IP manglers on the MikroTik boxes that only touch
SYNs generated by our clients, and only when the MSS is larger than a
certain value (in order to honor MSSes even lower than that allowed by
their PPPoE gateway). But it's a PITA to deal with. I'd just rather
everyone follow protocol. :-P Although we can't always expect everyone
to do it by the book, I don't think it is too much to ask that those who
operate sizable networks that nearly everyone is required to interact
with on a daily basis (read: Microsoft) act responsibly.
> All of this even went so far that the IETF came up with RFC 4821,
> which will do path MTU discovery by correlating lost packets with
> packet sizes to determine the path MTU rather than depend on ICMP
> messages.
What's funny is that I ran my tests from a Windows XP host with the
recently-released Service Pack 3 installed, which is supposed to
activate Microsoft's "PMTUD Black Hole Router Detection" by default
(available pre-SP3 but apparently not turned on without a registry
change). I haven't read up on exactly how it's supposed to work, but I
think the basic idea is that if the TCP connection is negotiated
properly but it doesn't get a response beyond that, it will try lower
and lower MSSes until it does.
However it works (or doesn't as the case may be), it didn't make a lick
of difference. I waited and waited for content to be delivered to me
until eventually Microsoft's end sent me a TCP RST.
While I was poking at this, though, I had a thought...most IP stacks I
believe keep a path MTU cache of some sort. I know Windows does: if I
send an ICMP packet with DF set that is larger than the PPPoE gateway
can handle, I get something similar to the following:
C:\Documents and Settings\nathana>ping 64.126.160.1 -f -l 1472
Pinging 64.126.160.1 with 1472 bytes of data:
Reply from 64.126.142.249: Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
[...]
Next time that I try the same thing, Windows doesn't even bother trying
to send the packet. It looks at its PMTU table for that IP, and already
KNOWS it is too big:
C:\Documents and Settings\nathana>ping 64.126.160.1 -f -l 1472
Pinging 64.126.160.1 with 1472 bytes of data:
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
[...]
However, even when trying this with www.msnbc.msn.com, and with the
MSNBC entry in its PMTU cache (and its IP set statically in my 'hosts'
file so that Akamai/MS round-robin DNS doesn't screw with me during the
test), when I tried to build a TCP connection to MSNBC from this same
host, Windows told the remote host it had a 1460 MSS.
Now, although that makes sense, in order to avoid issues like the one we
are facing with Microsoft, would it not make _more_ sense for the stack
to look at the PMTU cache first, and then adjust its own MSS just for
connections to that one host? Maybe even send out an MTU - 40 ICMP
packet to the host that we want to build a TCP connection with FIRST to
get an ICMP type 3 code 4 response from the router in-between with the
smaller MTU?
That would put the burden of PMTUD on the host requesting the TCP
session rather than on the one responding, but if hosts were "smarter"
like this it seems to me it might smooth out some of these issues. The
remote end could be "broken" with respect to PMTUD but it wouldn't matter.
Thoughts?
--
Nathan Anderson
First Step Internet, LLC
nathana at fsr.com
More information about the NANOG
mailing list