<div dir="ltr"><div dir="ltr">RFC 7094 (<a href="https://tools.ietf.org/html/rfc7094">https://tools.ietf.org/html/rfc7094</a>) describes the pitfalls & risks of using TCP with an anycast address.  It recognizes that there are valid use cases for it, though.<br></div><div dir="ltr"><br></div><div>Specifically, section 3.1 says this:</div><div>>>></div><div><pre class="gmail-newpage" style="font-size:13.3333px;margin-top:0px;margin-bottom:0px;break-before:page;color:rgb(0,0,0)">   Most stateful transport protocols (e.g., TCP), without modification,

   do not understand the properties of anycast; hence, they will fail

   probabilistically, but possibly catastrophically, when using anycast

   addresses in the presence of "normal" routing dynamics.</pre><pre class="gmail-newpage" style="font-size:13.3333px;margin-top:0px;margin-bottom:0px;break-before:page;color:rgb(0,0,0)">...</pre><pre class="gmail-newpage" style="font-size:13.3333px;margin-top:0px;margin-bottom:0px;break-before:page;color:rgb(0,0,0)"><pre class="gmail-newpage" style="margin-top:0px;margin-bottom:0px;break-before:page">   This can lead

   to a protocol working fine in, say, a test lab but not in the global

   Internet.</pre></pre></div><div>>>></div><div><br></div><div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Nov 13, 2019 at 3:33 PM Warren Kumari <<a href="mailto:warren@kumari.net">warren@kumari.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Thu, Nov 14, 2019 at 12:25 AM Matt Corallo <<a href="mailto:nanog@as397444.net" target="_blank">nanog@as397444.net</a>> wrote:<br>

><br>

> This sounds like a bug on Cloudflare’s end (cause trying to do anycast TCP is... out of spec to say the least), not a bug in ECN/ECMP.<br>

<br>

Errrrrr. I really don't think that there is any sort of spec that<br>

covers that :-P<br>

<br>

Using Anycast for TCP is incredibly common - the DNS root servers for<br>

one obvious example.<br>

More TCP centric well-known examples are Fastly and LinkedIn -<br>

LinkedIn in particular did a really good podcast on their experience<br>

with this.<br>

<br>

There is also a good NANOG talk from the ~2000s (?) on people using<br>

TCP anycast for long lived (serving ISO files, which were long-lived<br>

in those days) flows, and how reliable it is - perhaps that's the talk<br>

Todd mentioned?<br>

<br>

W<br>

<br>

><br>

> > On Nov 13, 2019, at 11:07, Toke Høiland-Jørgensen via NANOG <<a href="mailto:nanog@nanog.org" target="_blank">nanog@nanog.org</a>> wrote:<br>

> ><br>

> > <br>

> >><br>

> >> Hello<br>

> >><br>

> >> I have a customer that believes my network has a ECN problem. We do<br>

> >> not, we just move packets. But how do I prove it?<br>

> >><br>

> >> Is there a tool that checks for ECN trouble? Ideally something I could<br>

> >> run on the NLNOG Ring network.<br>

> >><br>

> >> I believe it likely that it is the destination that has the problem.<br>

> ><br>

> > Hi Baldur<br>

> ><br>

> > I believe I may be that customer :)<br>

> ><br>

> > First of all, thank you for looking into the issue! We've been having<br>

> > great fun over on the ecn-sane mailing list trying to figure out what's<br>

> > going on. I'll summarise below, but see this thread for the discussion<br>

> > and debugging details:<br>

> > <a href="https://lists.bufferbloat.net/pipermail/ecn-sane/2019-November/000527.html" rel="noreferrer" target="_blank">https://lists.bufferbloat.net/pipermail/ecn-sane/2019-November/000527.html</a><br>

> ><br>

> > The short version is that the problem appears to come from a combination<br>

> > of the ECMP routing in your network, and Cloudflare's heavy use of<br>

> > anycast. Specifically, a router in your network appears to be doing ECMP<br>

> > by hashing on the packet header, *including the ECN bits*. This breaks<br>

> > TCP connections with ECN because the TCP SYN (with no ECN bits set) end<br>

> > up taking a different path than the rest of the flow (which is marked as<br>

> > ECT(0)). When the destination is anycasted, this means that the data<br>

> > packets go to a different server than the SYN did. This second server<br>

> > doesn't recognise the connection, and so replies with a TCP RST. To fix<br>

> > this, simply exclude the ECN bits (or the whole TOS byte) from your<br>

> > router's ECMP hash.<br>

> ><br>

> > For a longer exposition, see below. You should be able to verify this<br>

> > from somewhere else in the network, but if there's anything else you<br>

> > want me to test, do let me know. Also, would you mind sharing the router<br>

> > make and model that does this? We're trying to collect real-world<br>

> > examples of network problems caused by ECN and this is definitely an<br>

> > interesting example.<br>

> ><br>

> > -Toke<br>

> ><br>

> ><br>

> ><br>

> > The long version:<br>

> ><br>

> > From my end I can see that I have two paths to Cloudflare; which is<br>

> > taken appears to be based on a hash of the packet header, as can be seen<br>

> > by varying the source port:<br>

> ><br>

> > $ traceroute -q 1 --sport=10000 104.24.125.13<br>

> > traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets<br>

> > 1  _gateway (10.42.3.1)  0.357 ms<br>

> > 2  <a href="http://albertslund-edge1-lo.net.gigabit.dk" rel="noreferrer" target="_blank">albertslund-edge1-lo.net.gigabit.dk</a> (185.24.171.254)  4.707 ms<br>

> > 3  <a href="http://customer-185-24-168-46.ip4.gigabit.dk" rel="noreferrer" target="_blank">customer-185-24-168-46.ip4.gigabit.dk</a> (185.24.168.46)  1.283 ms<br>

> > 4  <a href="http://te0-1-1-5.rcr21.cph01.atlas.cogentco.com" rel="noreferrer" target="_blank">te0-1-1-5.rcr21.cph01.atlas.cogentco.com</a> (149.6.137.49)  1.667 ms<br>

> > 5  <a href="http://netnod-ix-cph-blue-9000.cloudflare.com" rel="noreferrer" target="_blank">netnod-ix-cph-blue-9000.cloudflare.com</a> (212.237.192.246)  1.406 ms<br>

> > 6  104.24.125.13 (104.24.125.13)  1.322 ms<br>

> ><br>

> > $ traceroute -q 1 --sport=10001 104.24.125.13<br>

> > traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets<br>

> > 1  _gateway (10.42.3.1)  0.293 ms<br>

> > 2  <a href="http://albertslund-edge1-lo.net.gigabit.dk" rel="noreferrer" target="_blank">albertslund-edge1-lo.net.gigabit.dk</a> (185.24.171.254)  3.430 ms<br>

> > 3  <a href="http://customer-185-24-168-38.ip4.gigabit.dk" rel="noreferrer" target="_blank">customer-185-24-168-38.ip4.gigabit.dk</a> (185.24.168.38)  1.194 ms<br>

> > 4  <a href="http://10ge1-2.core1.cph1.he.net" rel="noreferrer" target="_blank">10ge1-2.core1.cph1.he.net</a> (216.66.83.101)  1.297 ms<br>

> > 5  <a href="http://be2306.ccr42.ham01.atlas.cogentco.com" rel="noreferrer" target="_blank">be2306.ccr42.ham01.atlas.cogentco.com</a> (130.117.3.237)  6.805 ms<br>

> > 6  149.6.142.130 (149.6.142.130)  6.925 ms<br>

> > 7  104.24.125.13 (104.24.125.13)  1.501 ms<br>

> ><br>

> ><br>

> > This is fine in itself. However, the problem stems from the fact that<br>

> > the ECN bits in the IP header are also included in the ECMP hash (-t<br>

> > sets the TOS byte; -t 1 ends up as ECT(0) on the wire and -t 2 is<br>

> > ECT(1)):<br>

> ><br>

> > $ traceroute -q 1 --sport=10000 104.24.125.13 -t 1<br>

> > traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets<br>

> > 1  _gateway (10.42.3.1)  0.336 ms<br>

> > 2  <a href="http://albertslund-edge1-lo.net.gigabit.dk" rel="noreferrer" target="_blank">albertslund-edge1-lo.net.gigabit.dk</a> (185.24.171.254)  6.964 ms<br>

> > 3  <a href="http://customer-185-24-168-46.ip4.gigabit.dk" rel="noreferrer" target="_blank">customer-185-24-168-46.ip4.gigabit.dk</a> (185.24.168.46)  1.056 ms<br>

> > 4  <a href="http://te0-1-1-5.rcr21.cph01.atlas.cogentco.com" rel="noreferrer" target="_blank">te0-1-1-5.rcr21.cph01.atlas.cogentco.com</a> (149.6.137.49)  1.512 ms<br>

> > 5  <a href="http://netnod-ix-cph-blue-9000.cloudflare.com" rel="noreferrer" target="_blank">netnod-ix-cph-blue-9000.cloudflare.com</a> (212.237.192.246)  1.313 ms<br>

> > 6  104.24.125.13 (104.24.125.13)  1.210 ms<br>

> ><br>

> > $ traceroute -q 1 --sport=10000 104.24.125.13 -t 2<br>

> > traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets<br>

> > 1  _gateway (10.42.3.1)  0.339 ms<br>

> > 2  <a href="http://albertslund-edge1-lo.net.gigabit.dk" rel="noreferrer" target="_blank">albertslund-edge1-lo.net.gigabit.dk</a> (185.24.171.254)  2.565 ms<br>

> > 3  <a href="http://customer-185-24-168-38.ip4.gigabit.dk" rel="noreferrer" target="_blank">customer-185-24-168-38.ip4.gigabit.dk</a> (185.24.168.38)  1.301 ms<br>

> > 4  <a href="http://10ge1-2.core1.cph1.he.net" rel="noreferrer" target="_blank">10ge1-2.core1.cph1.he.net</a> (216.66.83.101)  1.339 ms<br>

> > 5  <a href="http://be2306.ccr42.ham01.atlas.cogentco.com" rel="noreferrer" target="_blank">be2306.ccr42.ham01.atlas.cogentco.com</a> (130.117.3.237)  6.570 ms<br>

> > 6  149.6.142.130 (149.6.142.130)  6.888 ms<br>

> > 7  104.24.125.13 (104.24.125.13)  1.785 ms<br>

> ><br>

> ><br>

> > So why is this a problem? The TCP SYN packet first needs to negotiate<br>

> > ECN, so it is sent without any ECN bits set in the header; after<br>

> > negotiation succeeds, the data packets will be marked as ECT(0). But<br>

> > because that becomes part of the ECMP hash, those packets will take<br>

> > another path. And since the destination is anycasted, that means they<br>

> > will also end up at a different endpoint. This second endpoint won't<br>

> > recognise the connection, and reply with a TCP RST. This is clearly<br>

> > visible in tcpdump; notice the different TOS values, and that the RST<br>

> > packet has a different TTL than the SYN-ACK:<br>

> ><br>

> > 12:21:47.816359 IP (tos 0x0, ttl 64, id 25687, offset 0, flags [DF], proto TCP (6), length 60)<br>

> >    10.42.3.130.34420 > 104.24.125.13.80: Flags [SEW], cksum 0xf2ff (incorrect -> 0x0853), seq 3345293502, win 64240, options [mss 1460,sackOK,TS val 4248691972 ecr 0,nop,wscale 7], length 0<br>

> > 12:21:47.823395 IP (tos 0x0, ttl 58, id 0, offset 0, flags [DF], proto TCP (6), length 52)<br>

> >    104.24.125.13.80 > 10.42.3.130.34420: Flags [S.E], cksum 0x9f4a (correct), seq 1936951409, ack 3345293503, win 29200, options [mss 1400,nop,nop,sackOK,nop,wscale 10], length 0<br>

> > 12:21:47.823479 IP (tos 0x0, ttl 64, id 25688, offset 0, flags [DF], proto TCP (6), length 40)<br>

> >    10.42.3.130.34420 > 104.24.125.13.80: Flags [.], cksum 0xf2eb (incorrect -> 0x503e), seq 1, ack 1, win 502, length 0<br>

> > 12:21:47.823665 IP (tos 0x2,ECT(0), ttl 64, id 25689, offset 0, flags [DF], proto TCP (6), length 117)<br>

> >    10.42.3.130.34420 > 104.24.125.13.80: Flags [P.], cksum 0xf338 (incorrect -> 0xc1d4), seq 1:78, ack 1, win 502, length 77: HTTP, length: 77<br>

> >    GET / HTTP/1.1<br>

> >    Host: 104.24.125.13<br>

> >    User-Agent: curl/7.66.0<br>

> >    Accept: */*<br>

> ><br>

> > 12:21:47.825485 IP (tos 0x2,ECT(0), ttl 60, id 0, offset 0, flags [DF], proto TCP (6), length 40)<br>

> >    104.24.125.13.80 > 10.42.3.130.34420: Flags [R], cksum 0x3a65 (correct), seq 1936951410, win 0, length 0<br>

> ><br>

> ><br>

> > The fix is to stop hashing on the ECN bits when doing ECMP. You could<br>

> > keep hashing on the diffserv part of the TOS field if you want, but I<br>

> > think it would also be fine to just exclude the TOS field entirely from<br>

> > the hash.<br>

><br>

<br>

<br>

-- <br>

I don't think the execution is relevant when it was obviously a bad<br>

idea in the first place.<br>

This is like putting rabid weasels in your pants, and later expressing<br>

regret at having chosen those particular rabid weasels and that pair<br>

of pants.<br>

   ---maf<br>

</blockquote></div></div></div>