[outages] Major Level3 (CenturyLink) Issues
saku at ytti.fi
Wed Sep 2 07:15:46 UTC 2020
On Wed, 2 Sep 2020 at 10:00, Martijn Schmidt via NANOG <nanog at nanog.org> wrote:
> I suppose now would be a good time for everyone to re-open their Centurylink ticket and ask why the RFO doesn't address the most important defect, e.g. the inability to withdraw announcements even by shutting down the session?
The more work the BGP process has the longer it takes to complete that
work. You could try in your RFP/RFQ if some provider will commit on
specific convergence time, which would improve your position
contractually and might make you eligible for some compensations or
termination of contract, but realistically every operator can run into
a situation where you will see what most would agree pathologically
long convergence times.
The more BGP sessions, more RIB entries the higher the probability
that these issues manifest. Perhaps protocol level work can be
justified as well. BGP doesn't have concept of initial convergence, if
you have lot of peers, your initial convergence contains massive
amount of useless work, because you keep changing best route, while
you keep receiving new best routes, the higher the scale the more
useless work you do and the longer stability you require to eventually
~converge. Practical devices operators run may require hours during
_normal operation_ to do initial converge.
RFC7313 might show us way to reduce amount of useless work. You might
want to add signal that initial convergence is done, you might want to
add signal that no installation or best path algo happens until all
route are loaded, this would massively improve scaled convergence as
you wouldn't do that throwaway work, which ultimately inflates your
work queue and pushes your useful work far to the future.
The main thing as a customer I would ask, how can we fix it faster
than 5h in future. Did we lose access to control-plane? Could we
reasonably avoid losing it?
More information about the NANOG