Centurylink having a bad morning?

Saku Ytti saku at ytti.fi
Mon Aug 31 04:55:08 UTC 2020


On Sun, 30 Aug 2020 at 20:00, Baldur Norddahl <baldur.norddahl at gmail.com> wrote:

> Not really the point. BGP is designed such that if I take down the link, the prefixes MUST be withdrawn within reasonable time. The self healing aspect of the internet entirely depends on this. Clearly they have some kind of system that does not respect that by design. I am guessing they have something homebrewn going on with their route reflectors?

Add scale and BGP implementations can take a lot of time, hours of it.
Best thing you can do is add contractual obligations so people at your
provider who agree with you have some ammo. Instant is not on the
table, I'm sure that is obvious after that it's less than obvious what
is good enough.

> It is like a plane. It is impossible to prove or even design a plane that can never fall out of the sky. But now we had a plane that crashed in a very bad way, so that plane (Centurylink) is grounded until they can prove that something like this can not happen again. Which means they need to redesign whatever the hell they have going on here.

Nothing ever works like this, it's naive to think any RCA leads to
something fixed so that it can never happen again. Only thing that can
be affected is the frequency of an event, removing it is not on the
cards. And usually affecting frequency is mostly about belief not
something provable. In addition to MTBF, questions should be raised
about MTTR, provable MTTR efforts are far more likely to exist than
provable MTBF efforts, but if we buy-in to the notion that it never
will happen again, because we is good, then no MTTR focus is needed,
why fix something that will never happen.
What if this outage took 5min to solve?

-- 
  ++ytti



More information about the NANOG mailing list