[outages] Major Level3 (CenturyLink) Issues

Wed Sep 2 19:24:21 UTC 2020

On Wed, Sep 2, 2020 at 3:04 PM Vincent Bernat <bernat at luffy.cx> wrote:
>
>  ❦  2 septembre 2020 16:35 +03, Saku Ytti:
>
> >> I am not buying it. No normal implementation of BGP stays online,
> >> replying to heart beat and accepting updates from ebgp peers, yet
> >> after 5 hours failed to process withdrawal from customers.
> >
> > I can imagine writing BGP implementation like this
> >
> >  a) own queue for keepalives, which i always serve first fully
> >  b) own queue for update, which i serve second
> >  c) own queue for withdraw, which i serve last
>
> Or maybe, graceful restart configured without a timeout on IPv4/IPv6?
> The flowspec rule severed the BGP session abruptly, stale routes are
> kept due to graceful restart (except flowspec rules), BGP sessions are
> reestablished but the flowspec rules is handled before before reaching
> EoR and we loop from there.

... or all routes are fed into some magic route optimization box which
is designed to keep things more stable and take advantage of cisco's
"step-10" to suck more traffic, or....

The root issue here is that the *publicc* RFO is incomplete / unclear.
Something something flowspec something, blocked flowspec, no more
something does indeed explain that something bad happened, but not
what caused the lack of withdraws / cascading churn.
As with many interesting outages, I suspect that we will never get the
full story, and "Something bad happened, we fixed it and now it's all
better and will never happen ever again, trust us..." seems to be the
new normal for public postmortems...

W

> --
> Make sure your code "does nothing" gracefully.
>             - The Elements of Programming Style (Kernighan & Plauger)

-- 
I don't think the execution is relevant when it was obviously a bad
idea in the first place.
This is like putting rabid weasels in your pants, and later expressing
regret at having chosen those particular rabid weasels and that pair
of pants.
   ---maf