Famous operational issues

brutal8z brutal8z at verizon.net
Wed Feb 24 01:52:47 UTC 2021


My war story.

At one of our major POPs in DC we had a row of 7513's, and one of them
had intermittent problems. I had replaced every piece of removable
card/part in it over time, and it kept failing. Even the vendor flew in
a team to the site to try to figure out what was wrong. It was finally
decided to replace the whole router (about 200lbs?). Being the local
field tech, that was my Job. On the night of the maintenance at 3am, the
work started. I switched off the rack power, which included a 2511
terminal server that was connected to half the routers in the row and
started to remove the router. A few minutes later I got a text, "You're
taking out the wrong router!" You can imagine the "Damn it, what have I
done?" feeling that runs through your mind and the way your heart stops
for a moment.

Okay, I wasn't taking out the wrong router. But unknown at the time,
terminal servers when turned off, had a nasty habit of sending a break
to all the routers it was connected to, and all those routers
effectively stopped. The remote engineer that was in charge saw the
whole POP go red and assumed I was the cause. I was, but not because of
anything I could have known about. I had to power cycle the downed
routers to bring them back on-line, and then continue with the
maintenance. A disaster to all involved, but the router got replaced.

I gave a very detailed account of my actions in the postmortem. It was
clear they knew I had turned off the wrong rack/router, and wasn't being
honest about it. I was adamant I had done exactly what I said, and even
swore I would fess up if I had error-ed, and always would, even if it
cost me the job. I rarely made mistakes, if any, so it was an easy thing
for me to say. For the next two weeks everyone that aware of the work
gave me the side eye.

About a week after that, the same thing happened to another field tech
in another state. That helped my case. They used my account to figure
out it was the TS that caused the problem. A few of them that had
questioned me harshly admitted to me my account helped them figure out
the cause.

And the worst part of this story? That router, completely replaced,
still had the same intermittent problem as before. It was a DC powered
POP, so they were all wired with the same clean DC power. In the end
they chalked it up to cosmic rays and gave up on it. I believe this
break issue was unique to the DC powered 2511's, and that we were the
first to use them, but I might be wrong on that.


On 2/16/21 2:37 PM, John Kristoff wrote:
> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps  the
> most notorious and likely to top many lists including mine.  So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session.  I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective.  I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20210223/b75fed87/attachment.html>


More information about the NANOG mailing list