<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<font face="Bookman Old Style">My war story.<br>
<br>
At one of our major POPs in DC we had a row of 7513's, and one of
them had intermittent problems. I had replaced every piece of
removable card/part in it over time, and it kept failing. Even the
vendor flew in a team to the site to try to figure out what was
wrong. It was finally decided to replace the whole router (about
200lbs?). Being the local field tech, that was my Job. On the
night of the maintenance at 3am, the work started. I switched off
the rack power, which included a 2511 terminal server that was
connected to half the routers in the row and started to remove the
router. A few minutes later I got a text, "You're taking out the
wrong router!" You can imagine the "Damn it, what have I done?"
feeling that runs through your mind and the way your heart stops
for a moment.<br>
<br>
Okay, I wasn't taking out the wrong router. But unknown at the
time, terminal servers when turned off, had a nasty habit of
sending a break to all the routers it was connected to, and all
those routers effectively stopped. The remote engineer that was in
charge saw the whole POP go red and assumed I was the cause. I
was, but not because of anything I could have known about. I had
to power cycle the downed routers to bring them back on-line, and
then continue with the maintenance. A disaster to all involved,
but the router got replaced.<br>
<br>
I gave a very detailed account of my actions in the postmortem. It
was clear they knew I had turned off the wrong rack/router, and
wasn't being honest about it. I was adamant I had done exactly
what I said, and even swore I would fess up if I had error-ed, and
always would, even if it cost me the job. I rarely made mistakes,
if any, so it was an easy thing for me to say. For the next two
weeks everyone that aware of the work gave me the side eye.<br>
<br>
About a week after that, the same thing happened to another field
tech in another state. That helped my case. They used my account
to figure out it was the TS that caused the problem. A few of them
that had questioned me harshly admitted to me my account helped
them figure out the cause.<br>
<br>
And the worst part of this story? That router, completely
replaced, still had the same intermittent problem as before. It
was a DC powered POP, so they were all wired with the same clean
DC power. In the end they chalked it up to cosmic rays and gave up
on it. I believe this break issue was unique to the DC powered
2511's, and that we were the first to use them, but I might be
wrong on that.<br>
</font><br>
<br>
<div class="moz-cite-prefix">On 2/16/21 2:37 PM, John Kristoff
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:20210216133735.3a79daef@p50.localdomain">
<pre class="moz-quote-pre" wrap="">Friends,
I'd like to start a thread about the most famous and widespread Internet
operational issues, outages or implementation incompatibilities you
have seen.
Which examples would make up your top three?
To get things started, I'd suggest the AS 7007 event is perhaps the
most notorious and likely to top many lists including mine. So if
that is one for you I'm asking for just two more.
I'm particularly interested in this as the first step in developing a
future NANOG session. I'd be particularly interested in any issues
that also identify key individuals that might still be around and
interested in participating in a retrospective. I already have someone
that is willing to talk about AS 7007, which shouldn't be hard to guess
who.
Thanks in advance for your suggestions,
John
</pre>
</blockquote>
<br>
</body>
</html>