TCP and anycast (was Re: ECN)

Thu Nov 14 06:39:24 UTC 2019

RFC 7094 (https://tools.ietf.org/html/rfc7094) describes the pitfalls &
risks of using TCP with an anycast address.  It recognizes that there are
valid use cases for it, though.

Specifically, section 3.1 says this:
>>>

   Most stateful transport protocols (e.g., TCP), without modification,
   do not understand the properties of anycast; hence, they will fail
   probabilistically, but possibly catastrophically, when using anycast
   addresses in the presence of "normal" routing dynamics.

...

   This can lead
   to a protocol working fine in, say, a test lab but not in the global
   Internet.

>>>

On Wed, Nov 13, 2019 at 3:33 PM Warren Kumari <warren at kumari.net> wrote:

> On Thu, Nov 14, 2019 at 12:25 AM Matt Corallo <nanog at as397444.net> wrote:
> >
> > This sounds like a bug on Cloudflare’s end (cause trying to do anycast
> TCP is... out of spec to say the least), not a bug in ECN/ECMP.
>
> Errrrrr. I really don't think that there is any sort of spec that
> covers that :-P
>
> Using Anycast for TCP is incredibly common - the DNS root servers for
> one obvious example.
> More TCP centric well-known examples are Fastly and LinkedIn -
> LinkedIn in particular did a really good podcast on their experience
> with this.
>
> There is also a good NANOG talk from the ~2000s (?) on people using
> TCP anycast for long lived (serving ISO files, which were long-lived
> in those days) flows, and how reliable it is - perhaps that's the talk
> Todd mentioned?
>
> W
>
> >
> > > On Nov 13, 2019, at 11:07, Toke Høiland-Jørgensen via NANOG <
> nanog at nanog.org> wrote:
> > >
> > > 
> > >>
> > >> Hello
> > >>
> > >> I have a customer that believes my network has a ECN problem. We do
> > >> not, we just move packets. But how do I prove it?
> > >>
> > >> Is there a tool that checks for ECN trouble? Ideally something I could
> > >> run on the NLNOG Ring network.
> > >>
> > >> I believe it likely that it is the destination that has the problem.
> > >
> > > Hi Baldur
> > >
> > > I believe I may be that customer :)
> > >
> > > First of all, thank you for looking into the issue! We've been having
> > > great fun over on the ecn-sane mailing list trying to figure out what's
> > > going on. I'll summarise below, but see this thread for the discussion
> > > and debugging details:
> > >
> https://lists.bufferbloat.net/pipermail/ecn-sane/2019-November/000527.html
> > >
> > > The short version is that the problem appears to come from a
> combination
> > > of the ECMP routing in your network, and Cloudflare's heavy use of
> > > anycast. Specifically, a router in your network appears to be doing
> ECMP
> > > by hashing on the packet header, *including the ECN bits*. This breaks
> > > TCP connections with ECN because the TCP SYN (with no ECN bits set) end
> > > up taking a different path than the rest of the flow (which is marked
> as
> > > ECT(0)). When the destination is anycasted, this means that the data
> > > packets go to a different server than the SYN did. This second server
> > > doesn't recognise the connection, and so replies with a TCP RST. To fix
> > > this, simply exclude the ECN bits (or the whole TOS byte) from your
> > > router's ECMP hash.
> > >
> > > For a longer exposition, see below. You should be able to verify this
> > > from somewhere else in the network, but if there's anything else you
> > > want me to test, do let me know. Also, would you mind sharing the
> router
> > > make and model that does this? We're trying to collect real-world
> > > examples of network problems caused by ECN and this is definitely an
> > > interesting example.
> > >
> > > -Toke
> > >
> > >
> > >
> > > The long version:
> > >
> > > From my end I can see that I have two paths to Cloudflare; which is
> > > taken appears to be based on a hash of the packet header, as can be
> seen
> > > by varying the source port:
> > >
> > > $ traceroute -q 1 --sport=10000 104.24.125.13
> > > traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte
> packets
> > > 1  _gateway (10.42.3.1)  0.357 ms
> > > 2  albertslund-edge1-lo.net.gigabit.dk (185.24.171.254)  4.707 ms
> > > 3  customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46)  1.283 ms
> > > 4  te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49)  1.667 ms
> > > 5  netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246)  1.406 ms
> > > 6  104.24.125.13 (104.24.125.13)  1.322 ms
> > >
> > > $ traceroute -q 1 --sport=10001 104.24.125.13
> > > traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte
> packets
> > > 1  _gateway (10.42.3.1)  0.293 ms
> > > 2  albertslund-edge1-lo.net.gigabit.dk (185.24.171.254)  3.430 ms
> > > 3  customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38)  1.194 ms
> > > 4  10ge1-2.core1.cph1.he.net (216.66.83.101)  1.297 ms
> > > 5  be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237)  6.805 ms
> > > 6  149.6.142.130 (149.6.142.130)  6.925 ms
> > > 7  104.24.125.13 (104.24.125.13)  1.501 ms
> > >
> > >
> > > This is fine in itself. However, the problem stems from the fact that
> > > the ECN bits in the IP header are also included in the ECMP hash (-t
> > > sets the TOS byte; -t 1 ends up as ECT(0) on the wire and -t 2 is
> > > ECT(1)):
> > >
> > > $ traceroute -q 1 --sport=10000 104.24.125.13 -t 1
> > > traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte
> packets
> > > 1  _gateway (10.42.3.1)  0.336 ms
> > > 2  albertslund-edge1-lo.net.gigabit.dk (185.24.171.254)  6.964 ms
> > > 3  customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46)  1.056 ms
> > > 4  te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49)  1.512 ms
> > > 5  netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246)  1.313 ms
> > > 6  104.24.125.13 (104.24.125.13)  1.210 ms
> > >
> > > $ traceroute -q 1 --sport=10000 104.24.125.13 -t 2
> > > traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte
> packets
> > > 1  _gateway (10.42.3.1)  0.339 ms
> > > 2  albertslund-edge1-lo.net.gigabit.dk (185.24.171.254)  2.565 ms
> > > 3  customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38)  1.301 ms
> > > 4  10ge1-2.core1.cph1.he.net (216.66.83.101)  1.339 ms
> > > 5  be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237)  6.570 ms
> > > 6  149.6.142.130 (149.6.142.130)  6.888 ms
> > > 7  104.24.125.13 (104.24.125.13)  1.785 ms
> > >
> > >
> > > So why is this a problem? The TCP SYN packet first needs to negotiate
> > > ECN, so it is sent without any ECN bits set in the header; after
> > > negotiation succeeds, the data packets will be marked as ECT(0). But
> > > because that becomes part of the ECMP hash, those packets will take
> > > another path. And since the destination is anycasted, that means they
> > > will also end up at a different endpoint. This second endpoint won't
> > > recognise the connection, and reply with a TCP RST. This is clearly
> > > visible in tcpdump; notice the different TOS values, and that the RST
> > > packet has a different TTL than the SYN-ACK:
> > >
> > > 12:21:47.816359 IP (tos 0x0, ttl 64, id 25687, offset 0, flags [DF],
> proto TCP (6), length 60)
> > >    10.42.3.130.34420 > 104.24.125.13.80: Flags [SEW], cksum 0xf2ff
> (incorrect -> 0x0853), seq 3345293502, win 64240, options [mss
> 1460,sackOK,TS val 4248691972 ecr 0,nop,wscale 7], length 0
> > > 12:21:47.823395 IP (tos 0x0, ttl 58, id 0, offset 0, flags [DF], proto
> TCP (6), length 52)
> > >    104.24.125.13.80 > 10.42.3.130.34420: Flags [S.E], cksum 0x9f4a
> (correct), seq 1936951409, ack 3345293503, win 29200, options [mss
> 1400,nop,nop,sackOK,nop,wscale 10], length 0
> > > 12:21:47.823479 IP (tos 0x0, ttl 64, id 25688, offset 0, flags [DF],
> proto TCP (6), length 40)
> > >    10.42.3.130.34420 > 104.24.125.13.80: Flags [.], cksum 0xf2eb
> (incorrect -> 0x503e), seq 1, ack 1, win 502, length 0
> > > 12:21:47.823665 IP (tos 0x2,ECT(0), ttl 64, id 25689, offset 0, flags
> [DF], proto TCP (6), length 117)
> > >    10.42.3.130.34420 > 104.24.125.13.80: Flags [P.], cksum 0xf338
> (incorrect -> 0xc1d4), seq 1:78, ack 1, win 502, length 77: HTTP, length: 77
> > >    GET / HTTP/1.1
> > >    Host: 104.24.125.13
> > >    User-Agent: curl/7.66.0
> > >    Accept: */*
> > >
> > > 12:21:47.825485 IP (tos 0x2,ECT(0), ttl 60, id 0, offset 0, flags
> [DF], proto TCP (6), length 40)
> > >    104.24.125.13.80 > 10.42.3.130.34420: Flags [R], cksum 0x3a65
> (correct), seq 1936951410, win 0, length 0
> > >
> > >
> > > The fix is to stop hashing on the ECN bits when doing ECMP. You could
> > > keep hashing on the diffserv part of the TOS field if you want, but I
> > > think it would also be fine to just exclude the TOS field entirely from
> > > the hash.
> >
>
>
> --
> I don't think the execution is relevant when it was obviously a bad
> idea in the first place.
> This is like putting rabid weasels in your pants, and later expressing
> regret at having chosen those particular rabid weasels and that pair
> of pants.
>    ---maf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20191113/a9bfccb1/attachment.html>