Service provider story about tracking down TCP RSTs

Sat Sep 1 20:03:38 UTC 2018

I would love this as a blog post to link folks that are not nanog members.

-Garrett

On Sat, Sep 1, 2018, 11:52 <frnkblk at iname.com> wrote:

> I want to share a little bit of our journey in tracking down the TCP RSTs
> that impacted some of our customers for almost ten weeks.
>
>
>
> Almost immediately after we turned up two new Arista border routers in
> late July we started receiving a trickle of complaints from customers
> regarding their inability to access certain websites (mostly B2B). All the
> packet captures showed the standard TCP SYN/SYN-ACK pair, then a TCP RST
> from the website after the client sent a TLS/SSL Client Hello. As the
> reports continued to come in, we built a Google Doc to keep track and it
> became clear that most of the sites were hosted by Incapsula/Imperva, but
> there were also a few by Sucuri and Fastly. Knowing that Incapsula provides
> DoS protection, we attempted to work with them (providing websites,
> source/destination IPs, traceroutes, and packet captures) to find out why
> their hosts were issuing our customers a TCP RST, but we made little
> progress. We moved some of the affected customers to different IP addresses
> but that didn’t resolve the issue. We also asked our customer to work with
> the website to see if they would be willing to open a ticket with
> Incapsula. In the meantime, customers were getting frustrated! They
> couldn’t visit Incapsula-hosted healthcare websites, financial firms,
> product dealers, etc. Over the weeks, a few of those customers
> purchased/borrowed different routers and some of those didn’t have website
> issues anymore. And more than a few of them discovered that the websites
> worked fine from home or their mobile phone/hotspot, but not from their
> Internet connection with us. You can guess where they were applying
> pressure! That said, we didn’t know why a small handful of companies, known
> for DoS protection, were issuing TCP RSTs to just some of our customers.
>
>
>
> Earlier this week we received four or five more websites from yet another
> affected customer, but most of those were with Fastly. By this time, we had
> been able to replicate the issue in our lab. Feeling desperate to make some
> tangible progress on this issue, I reached out to the Fastly NOC. In less
> than 12 hours they provided some helpful feedback, pointing out that a
> single traceroute to a Fastly site was hitting two of their POPs (they use
> anycast) and because they don’t sync state between POPs the second POP
> would naturally issue a TCP RST (sidebar: fascinating blog article on
> Fastly’s infrastructure here:
> https://www.fastly.com/blog/building-and-scaling-fastly-network-part-2-balancing-requests).
> In subsequent email exchanges, the Fastly NOC suggested that it appeared
> that we were “spraying flows” (that is, packets related to single client
> session were egressing our network via different paths). Because Fastly is
> also present with us at an IX (though they weren’t advertising their
> anycast IPs at the time), they suggested that we look at how our traffic
> egresses our network (IX versus transit) and our routers’ outbound
> load-balancing/hashing schemes.
>
>
>
> The IX turned up to be a red herring, so I turned my attention to our
> transit. Each of our border routers has two BGP sessions over two circuits
> to transit provider POP A and two BGP sessions over two circuits to transit
> provider POP B, for a total of four BGP sessions per border router, a total
> of eight BGP sessions altogether. Starting with our core router, I
> confirmed that its ECMP hashing was consistent such that Fastly-bound
> traffic always went to border router 1 or border router 2. Then I looked at
> the ECMP hashing scheme on our border routers and noticed something unique
> – by default Arista also uses TTL:
>
>
>
> IPv4 hash fields:
>
>    Source IPv4 Address is ON
>
>    Protocol is ON
>
>    Time-To-Live is ON
>
>    Destination IPv4 Address is ON
>
>
>
> Since the source and destination IPs and protocol weren’t changing,
> perhaps the TTL was not consistent? I opened the first packet trace in
> Wireshark and jackpot – the TTL value was 128 on the SYN but 127 on the
> TLS/SSL Client Hello. I adjusted the Arista’s load-balancing profile not to
> use TTL and immediately my MTR in the background changed and all the sites
> on the lab machine that couldn’t load before … were now loading.
>
>
>
> Fastly also pointed me to another article written by Joel Jaeggli (
> https://blog.apnic.net/2018/01/11/ipv6-flow-label-misuse-hashing/) that
> discusses IPv6 flow labels – we removed that from the border routers’ IPv6
> hash fields, too.
>
>
>
> I reviewed the packet traces today and noticed that TTL values remained
> consistent at 128 **behind** the router CPE. In packet captures on the
> WAN interface of the router CPE I see that the SYN remains at 128, but the
> TLS/Client Hello is properly decremented to 127. So, it appears that some
> router CPE (and there were a variety of makes and models) are doing
> something special to certain packets and not decrementing the TTL.
>
> This explains why:
>
>    - our customers had issues with all their devices behind their router
>    CPE
>    - the issue remained regardless of what public IP address their router
>    CPE obtained via DHCP or was assigned
>    - some customers who changed their router CPE didn’t have the issue
>    anymore – they got lucky with a router that doesn’t adjust/reset the TTL
>    - why customers who used our managed Wi-Fi router did not see the
>    issue, because that model doesn’t apparently manipulate the TTL, at least
>    not in an inconsistent way.
>
>
>
> Lesson learned: review a device’s hashing mechanism before going into
> production.
>
>
>
> For those interested, I have links to the packet traces below my
> signature, showing the inconsistent TTL values.
>
>
>
> Thanks again to the fantastic group of folk at the Fastly NOC who so ably
> pointed us in the right direction!
>
>
>
> Frank
>
>
>
> https://www.premieronline.net/~fbulk/example1.pcapng
>
> https://www.premieronline.net/~fbulk/example2.pcapng
>
> https://www.premieronline.net/~fbulk/example3.pcapng
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20180901/45cf881e/attachment.html>