Famous operational issues
job at fastly.com
Tue Feb 16 20:00:32 UTC 2021
On Tue, Feb 16, 2021 at 01:37:35PM -0600, John Kristoff wrote:
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
> Which examples would make up your top three?
This was a fantastic outage, one could really feel the tremors into the
far corners of the BGP default-free zone:
The experiment triggered a bug in some Cisco router models: affected
Ciscos would corrupt this specific BGP announcement ** ON OUTBOUND **.
Any peers of such Ciscos receiving this BGP update, would (according to
then current RFCs) consider the BGP UPDATE corrupted, and would
subsequently tear down the BGP sessions with the Ciscos. Because the
corruption was not detected by the Ciscos themselves, whenever the
sessions would come back online again they'd reannounce the corrupted
update, causing a session tear down. Bounce ... Bounce ... Bounce ... at
global scale in both IBGP and EBGP! :-)
Luckily the industry took these, and many other lessons to heart: in
2015 the IETF published RFC 7606 ("Revised Error Handling for BGP UPDATE
Messages") which specifices far more robust behaviour for BGP speakers.
More information about the NANOG