[NANOG] Microsoft.com PMTUD black hole?

Nathan Anderson/FSR nathana at fsr.com
Tue May 6 19:07:05 UTC 2008


Hello,

Has anyone else here seen problems with microsoft/msn/hotmail/live.com 
sites not performing PMTUD correctly?  We have, for a while now, had 
people on our network complain of poor microsoft.com reachability, and 
discovered we can work around the issue by changing MSS on all TCP SYN 
as they go out of our network.

I recently watched the whole conversation between msn.com and a host on 
our network (with the MSS rewrite disabled), and if I'm reading it 
right, we are following PMTUD protocol correctly by sending back ICMP 
type 3 code 4, but all Microsoft hosts seem to ignore this and continue 
to send packets back to our host with an MSS that is too large.

I hope I'm wrong and that it is we who are doing something stupid, but 
after cruising Google for a while, I found a multitude of other 
complaints from people connected to other ISPs specifically about not 
being able to reach Microsoft web sites.  It seems crazy that MS could 
have PMTUD broken for so long with nobody ever raising a complaint to 
them directly, though, which makes me wonder if there is another answer 
here that I'm missing.

I sent the following message to a couple of addresses that I gleaned 
from ARIN WHOIS for the IP block in question and threw hostmaster in 
there just in case it went somewhere, but noc at microsoft.com appears to 
be defunct.  I have yet to receive acknowledgment of receipt from the 
other address.

Are there any microsoft.com admins that hang out here that can comment 
on this or get in touch with me, or is there perhaps someone on here 
with connections to the Microsoft NOC?

(BTW, I stripped the referenced libpcap attachment off of this message 
to the list just so that I wouldn't accidentally incur the wrath of 
NANOG...if y'all want to see it, I'm happy to post it.)

Thanks,

-- 
Nathan Anderson
First Step Internet, LLC
nathana at fsr.com

-------- Original Message --------
Subject: Microsoft/MSN/Live!/Hotmail behind blackhole router?
Date: Thu, 01 May 2008 19:00:46 -0700
From: Nathan Anderson/FSR <nathana at fsr.com>
To: hostmoaster at microsoft.com, noc at microsoft.com, iprrms at microsoft.com

To microsoft.com NOC admins:

I work for a regional ISP in the inland pacific northwest.  May of our
customers' connections have MTUs of less than 1500, and we get routine
complaints from them that they have trouble reaching web sites that are
under your administration.

Usually we can fix the problem by "mangling" the TCP SYNs originating
from our customers and headed to the world to reflect a lower value;
however, we would rather not have to do that.  The fact that we are
REQUIRED to do this in order for your sites to be reachable by our
customers strongly suggests that either the servers that respond to HTTP
requests sent to www.microsoft/msn/hotmail/live.com are behind routers
that are blocking ALL ICMP traffic sent their way -- even ICMP type 3
code 4 (packet too large, DF set), which is necessary in order for Path
MTU Discovery to work -- or the servers themselves are not listening to
the ICMP messages that we are sending their way when our routers are
forced to drop a packet sent by you which is too large to be forwarded
to a customer of ours.

I set up a test connection "on the bench" so to speak, and had our
router capture a copy of the conversation between our test client and
www.msnbc.msn.com and forward that conversation encapsulated in TZSP to
the same test client over a different interface.  The capture clearly
shows our test client establishing the TCP connection with MSNBC
(SYN/SYN+ACK/ACK), and then goes on to show MSNBC send ethernet
MTU-sized packets our way that an intermediate router of ours drops and
responds with "packet too big, DF set."  Despite this, MSNBC continues
to retrasmit the original packet with the same payload and the same size
back to us.  We continue to respond "packet too big, DF set," but the
MSNBC server never seems to get the message (literally).

We see the same behavior with all sites across the board contained
within the 207.46.0.0/16 space, regardless of actual hostname/FQDN.

We also find this ironic considering that Microsoft published a Technet
article a few years back on black hole routers and the problems they
pose, found at http://technet.microsoft.com/en-us/library/bb878081.aspx
(which we can't read/access unless we are mangling the MSS).

We would appreciate it if Microsoft NOC admins would please look into
the matter and take the appropriate corrective action: allowing ICMP
type 3 code 4 messages through your routers/firewalls, and making sure
that your servers respond to them appropriately as defined in RFC 1191.

I have attached the capture we made of the conversation to this e-mail
message in libpcap format for your analysis.  The test client itself had
a 1500 MTU to a desktop router, which in turn had an MTU of 1492 on its
uplink to us.

I am available to answer any additional clarifying questions you may have.

Thank you for your time and attention to this matter.

Regards,

-- 
Nathan Anderson
First Step Internet, LLC
nathana at fsr.com





More information about the NANOG mailing list