Centurylink having a bad morning?

Tom Beecher beecher at beecher.cc
Mon Aug 31 21:24:37 UTC 2020


In this specific event, 3356 not withdrawing routes is certainly a head
scratcher, and I'm sure for many the thing we're most looking forward to a
definitive answer on.

However, if a network only has 3356 as their upstream, they are 100% at the
mercy of 3356 at all times. Having a redundant AND diverse connection to a
2nd upstream ASN at least provides you some options. In this case for
example, let's say at all times you did a +2 prepend to both 3356 and Acme.
3356 even happens, you shut down your session to them. Some percentage of
your traffic that would have been faceplanting in/through 3356 now works
via Acme. Then you notice the non-withdrawl issue. You can then remove 1
prepend, or perhaps deagg strategically to try and get more traffic away
from the trouble.

A redundant path to a different.upstream at least provides you some
potential options to work around that with which you otherwise could not.
It wouldn't be perfect, but options > no options.

On Mon, Aug 31, 2020 at 5:08 PM Warren Kumari <warren at kumari.net> wrote:

> On Mon, Aug 31, 2020 at 4:36 PM Tom Beecher <beecher at beecher.cc> wrote:
> >
> > Hopefully those customers learned the difference between redundancy and
> diversity this weekend. :)
>
> I'm unclear how either solves things for many customers...
>
> If they had CenturyLink and AcmeNetworkWidgets, and announce the same
> network through both -- and their connection to CL went down, *but CL
> continues to announce / doesn't withdraw* they are still stuck, yes?
> (Unless they can deaggregate that is...)
> What am I missing?
>
> W
>
>
> >
> > On Mon, Aug 31, 2020 at 3:48 PM Eric Kuhnke <eric.kuhnke at gmail.com>
> wrote:
> >>
> >> There's a number of enterprise end user type customers of 3356 that
> have on-premises server rooms/hosting for their stuff. And they spend a lot
> of money every month for a 'redundant' metro ethernet circuit that takes
> diverse fiber paths from their business park office building to the local
> clink/level3 POP. But all that last mile redundancy and fail over ability
> doesn't do much for them when 3356 breaks its network at the BGP level.
> >>
> >>
> >>
> >> On Mon, Aug 31, 2020 at 9:36 AM Drew Weaver <drew.weaver at thenap.com>
> wrote:
> >>>
> >>> I also found the part where they mention that a lot of hosting
> companies only have one uplink to be quizzical and also the fact that he
> goes pretty close to implying that its Centurylink’s customers fault for
> not having multiple paths to Cloudflare that don’t touch Centurylink a bit
> puzzling. It could have just been poorly written.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> From: NANOG <nanog-bounces+drew.weaver=thenap.com at nanog.org> On
> Behalf Of Tom Beecher
> >>> Sent: Monday, August 31, 2020 9:26 AM
> >>> To: Hank Nussbacher <hank at interall.co.il>
> >>> Cc: NANOG <nanog at nanog.org>
> >>> Subject: Re: Centurylink having a bad morning?
> >>>
> >>>
> >>>
> >>>
> https://blog.cloudflare.com/analysis-of-todays-centurylink-level-3-outage/
> >>>
> >>>
> >>>
> >>> I definitely found Mr. Prince's writing about yesterday's events
> fascinating.
> >>>
> >>>
> >>>
> >>> Verizon makes a mistake with BGP filters that allows a secondary
> mistake from leaked "optimizer" routes to propagate, and Mr. Prince takes
> every opportunity to lob large chunks of granite about how terrible they
> are.
> >>>
> >>>
> >>>
> >>> L3 allows an erroneous flowspec announcement to cause massive global
> connectivity issues, and Mr. Prince shrugs and says "Incidents happen."
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, Aug 31, 2020 at 1:15 AM Hank Nussbacher <hank at interall.co.il>
> wrote:
> >>>
> >>> On 30/08/2020 20:08, Baldur Norddahl wrote:
> >>>
> >>>
> >>>
> >>>
> https://blog.cloudflare.com/analysis-of-todays-centurylink-level-3-outage/
> >>>
> >>>
> >>>
> >>> Sounds like Flowspec possibly blocking tcp/179 might be the cause.
> >>>
> >>>
> >>>
> >>> But that is Cloudflare speculation.
> >>>
> >>>
> >>>
> >>> Regards,
> >>> Hank
> >>>
> >>> Caveat: The views expressed above are solely my own and do not express
> the views or opinions of my employer
> >>>
> >>>
> >>>
> >>> An outage is what it is. I am not worried about outages. We have
> multiple transits to deal with that.
> >>>
> >>>
> >>>
> >>> It is the keep announcing prefixes after withdrawal from peers and
> customers that is the huge problem here. That is killing all the effort and
> money I put into having redundancy. It is sabotage of my network after I
> cut the ties. I do not want to be a customer at an outlet who has a system
> that will do that. Luckily we do not currently have a contract and now they
> will have to convince me it is safe for me to make a contract with them. If
> that is impossible I guess I won't be getting a contract with them.
> >>>
> >>>
> >>>
> >>> But I disagree in that it would be impossible. They need to make a
> good report telling exactly what went wrong and how they changed the
> design, so something like this can not happen again. The basic design of
> BGP is such that this should not happen easily if at all. They did
> something unwise. Did they make a route reflector based on a database or
> something?
> >>>
> >>>
> >>>
> >>> Regards,
> >>>
> >>>
> >>>
> >>> Baldur
> >>>
> >>>
> >>>
> >>> On Sun, Aug 30, 2020 at 5:13 PM Mike Bolitho <mikebolitho at gmail.com>
> wrote:
> >>>
> >>> Exactly. And asking that they somehow prove this won't happen again is
> impossible.
> >>>
> >>> - Mike Bolitho
> >>>
> >>>
> >>>
> >>> On Sun, Aug 30, 2020, 8:10 AM Drew Weaver <drew.weaver at thenap.com>
> wrote:
> >>>
> >>> I’m not defending them but I am sure it isn’t intentional.
> >>>
> >>>
> >>>
> >>> From: NANOG <nanog-bounces+drew.weaver=thenap.com at nanog.org> On
> Behalf Of Baldur Norddahl
> >>> Sent: Sunday, August 30, 2020 9:28 AM
> >>> To: nanog at nanog.org
> >>> Subject: Re: Centurylink having a bad morning?
> >>>
> >>>
> >>>
> >>> How is that acceptable behaviour? I shall remember never to make a
> contract with these guys until they can prove that they won't advertise my
> prefixes after I pull them. Under any circumstances.
> >>>
> >>>
> >>>
> >>> søn. 30. aug. 2020 15.14 skrev Joseph Jenkins <
> joe at breathe-underwater.com>:
> >>>
> >>> Finally got through on their support line and spoke to level1. The
> only thing the tech could say was it was an issue with BGP route reflectors
> and it started about 3am(pacific). They were still trying to isolate the
> issue. I've tried failing over my circuits and no go, the traffic just dies
> as L3 won't stop advertising my routes.
> >>>
> >>>
> >>>
> >>> On Sun, Aug 30, 2020 at 5:21 AM Drew Weaver via NANOG <nanog at nanog.org>
> wrote:
> >>>
> >>> Hello,
> >>>
> >>>
> >>>
> >>> Woke up this morning to a bunch of reports of issues with connectivity
> had to shut down some Level3/CTL connections to get it to return to normal.
> >>>
> >>>
> >>>
> >>> As of right now their support portal won’t load:
> https://www.centurylink.com/business/login/
> >>>
> >>>
> >>>
> >>> Just wondering what others are seeing.
> >>>
> >>>
> >>>
> >>>
>
>
>
> --
> I don't think the execution is relevant when it was obviously a bad
> idea in the first place.
> This is like putting rabid weasels in your pants, and later expressing
> regret at having chosen those particular rabid weasels and that pair
> of pants.
>    ---maf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20200831/563703ff/attachment.html>


More information about the NANOG mailing list