CenturyLink RCA?

Tom Beecher beecher at beecher.cc
Wed Jan 2 18:01:50 UTC 2019


My best parsing of that ticket, with some guesses :

- Infinera management card goes Really Bad, knocks out local waves, and
starts spewing garbage out onto the management network
- Management network propagates the garbage , other Infinera management
cards get it and fall into the same state, knocking down local waves and
re-spewing garbage.
- Backup tunnels in place to ensure management network connectivity works
all the time help propagate the garbage.
- They start getting into some devices via OOB, probably rebooting. Devices
come up ok, then this garbage traffic knocks them over again.
- They start pulling down the backup tunnels to stop the virus from
spreading, bouncing stuff again, putting filters on each device to drop the
garbage traffic.
- This starts to work, but then they hit other problems with linecards from
devices that were bounced.
- They also start hitting sites that they don't have functional OOB for,
and have to get someone driving out to manually get access into.

On Sun, Dec 30, 2018 at 8:45 AM Saku Ytti <saku at ytti.fi> wrote:

> Apologies for the URL, I do not know official source and I do not
> share the URLs sentiment.
> https://fuckingcenturylink.com/
>
> Can someone translate this to IP engineer? What did actually happen?
> From my own history, I rarely recognise the problem I fixed from
> reading the public RCA. I hope CenturyLink will do better.
>
> Best guess so far that I've heard is
>
> a) CenturyLink runs global L2 DCN/OOB
> b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU,
> I've had this failure mode)
> c) DCN had direct access to control-plane, and L2 congested
> control-plane resources causing it to deprovision waves
>
> Now of course this is entirely speculation, but intended to show what
> type of explanation is acceptable and can be used to fix things.
> Hopefully CenturyLink does come out with IP-engineering readable
> explanation, so that we may use it as leverage to support work in our
> own domains to remove such risks.
>
> a) do not run L2 DCN/OOB
> b) do not connect MGMT ETH (it is unprotected access to control-plane,
> it  cannot be protected by CoPP/lo0 filter/LPTS ec)
> c) do add in your RFP scoring item for proper OOB port (Like Cisco CMP)
> d) do fail optical network up
>
> --
>   ++ytti
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20190102/abadedb6/attachment.html>


More information about the NANOG mailing list