Famous operational issues

Fri Feb 19 18:44:19 UTC 2021

At a previous company we had a large number of Foundry Networks layer-3
switches. They participated in our OSPF network and had a *really* annoying
bug. Every now and then one of them would get somewhat confused and would
corrupt its OSPF database (there seemed to be some pointer that would end
up off by one).

It would then cleverly realize that its LSDB was different to everyone
else's and so would flood this corrupt database to all other OSPF speakers.
Some vendors would do a better job of sanity checking the LSAs and would
ignore the bad LSAs, other vendors would install them... and now you have
different link state databases on different devices and OSPF becomes
unhappy.

Nov 24 22:23:53.633 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5,
LSID 0.9.32.5

Mask 10.160.8.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.
Nov 26 11:01:32.997 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3

Mask 10.2.153.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.
Nov 27 23:14:00.660 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3

Mask 10.2.153.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.

 If you look at the output, you can see that there is some garbage in the
LSID field and the bit that should be there is now in the Mask section. I
also saw some more extreme version of the same bug, in my favorite example
the mask was 115.104.111.119 and further down there was 105.110.116.114 --
if you take these as decimal number and look up their ASCII values we get
"show" and "inte" -- I wrote a tool to scrape bits from these errors and
ended up with a large amount of the CLI help text.

Many years ago I worked for a small Mom-and-Pop type ISP in New York state
(I was the only network / technical person there) -- it was a very free
wheeling place and I built the network by doing whatever made sense at the
time.

One of my "favorite" customers (Joe somebody) was somehow related to the
owner of the ISP and was a gamer. This was back in the day when the gaming
magazines would give you useful tips like "Type 'tracert $gameserver' and
make sure that there are less than N hops".  Joe would call up tech
support, me, the owner, etc and complain that there was N+3 hops and most
of them were in our network. I spent much time explaining things about
packet-loss, latency, etc but couldn't shake his belief that hop count was
the only metric that mattered.

Finally, one night he called me at home well after midnight (no, I didn't
give him my home phone number, he looked me up in the phonebook!) to
complain that his gaming was suffering because it was "too many hops to get
out of your network". I finally snapped and built a static GRE tunnel from
the RAS box that he connected to all over the network -- it was a thing of
beauty, it went through almost every device that we owned and took the most
convoluted path I could come up with. "Yay!", I figured, "now I can
demonstrate that latency is more important than hop count" and I went to
bed.

The next morning I get a call from him. He is ecstatic and wildly impressed
by how well the network is working for him now and how great his gaming
performance is. "Oh well", I think, "at least he is happy and will leave me
alone now". I don't document the purpose of this GRE anywhere and after
some time forget about it.

A few months later I am doing some routine cleanup work and stumble across
a weird looking tunnel -- its bizarre, it goes all over the place and is
all kinds of crufty -- there are static routes and policy routing and
bizarre things being done on the RADIUS server to make sure some user
always gets a certain IP... I look in my pile of notes and old configs and
then decide to just yank it out.

That night I get an enraged call (at home again) from Joe *screaming* that
the network is all broken again because it is now way too many hops to get
out of the network and that people keep shooting him...

*What I learnt from this:*
1: Make sure you document everything (and no, the network isn't
documentation)
2: Gamers are weird.
3: Making changes to your network in anger provides short term pleasure but
long term pain.

On Fri, Feb 19, 2021 at 1:10 PM Andrew Gallo <akg1330 at gmail.com> wrote:

>
>
> On 2/16/2021 2:37 PM, John Kristoff wrote:
> > Friends,
> >
> > I'd like to start a thread about the most famous and widespread Internet
> > operational issues, outages or implementation incompatibilities you
> > have seen.
> >
> > Which examples would make up your top three?
>
>
> I don't believe I've seen this in any of the replies, but the AT&T
> cascading switch crashes of 1990 is a good one.  This link even has some
> pseudocode
> https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse
>
>

-- 
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20210219/e30884ea/attachment.html>