Famous operational issues
Daniel Karrenberg
dfk at ripe.net
Fri Feb 19 11:07:58 UTC 2021
On 16 Feb 2021, at 20:37, John Kristoff wrote:
> I'd like to start a thread about the most famous and widespread
> Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
My absolute top one happened 1995. Traffic engineering was not a widely
used term then. A bright colleague who will remain un-named decided that
he could make AS paths longer by repeating the same AS number more than
once. Unfortunately the prevalent software on CISCO routers was not
resilient to such trickery and reacted with a reboot. This caused an
avalanche of jo-jo-ing routers. Think it through!
It took some time before that offending path could be purged from the
whole Internet; yes we all roughly knew the topology and the players of
the BGP speaking parts of it at that time. Luckily this happened
during the set-up for the Danvers IETF and co-ordination between major
operators was quick because most of their routing geeks happened to be
in the same room, the ‘terminal room’; remember those?
Since at the time I personally had no responsibility for operations any
more I went back to pulling cables and crimping RJ45s.
Lessons: HW/SW mono-cultures are dangerous. Input testing is good
practice at all levels software. Operational co-ordination is key in
times of crisis.
Daniel
More information about the NANOG
mailing list