Famous operational issues
adamkennedy at watchcomm.net
Wed Feb 24 04:24:14 UTC 2021
While we're talking about raid types...
A few acquisitions ago, between 2006-2010, I worked at a Wireless ISP in
Northern Indiana. Our CEO decided to sell Internet service to school
systems because the e-rate funding was too much to resist. He had the idea
to install towers on the schools and sell service off that while paying the
school for roof rights. About two years into the endeavor, I wake up one
morning and walk to my car. Two FBI agents get out of an unmarked towncar.
About an hour later, they let me go to the office where I found an entire
barrage of FBI agents. It was a full raid and not the kind you want to see.
Hard drives were involved and being made redundant, but the redundant
copies were labeled and placed into boxes that were carried out to SUVs
that were as dark as the morning coffee these guys drank. There were a lot
of drives, all of our servers were in our server room at the office. There
were roughly five or six racks of varying amounts of equipment in each.
After some questioning and assisting them in their cataloging adventure,
the agents left us with a ton of questions and just enough equipment to
keep the customers connected. CEO became extremely paranoid at this point.
He told us to prepare to move servers to a different building. He went into
a tailspin trying to figure out where he could hide the servers to keep
things going without the bank or FBI seizing the assets. He was extremely
worried the bank would close the office down. We started moving all network
routing around to avoid using the office as our primary DIA.
One morning I get into the office and we hear the words we've been
dreading: "We're moving the servers". The plan was to move them to a tower
site that had a decent-sized shack on site. Connectivity was decent, we had
a licensed 11GHz microwave backhaul capable of about 155mbps. The site was
part of the old MCI microwave long-distance network in the 80s and 90s. It
had redundant air conditioners, a large propane tank, and a generator
capable of keeping the site alive for about three days. We were told not to
notify any customers, which became problematic because two customers had
servers colocated in our building. We consolidated the servers into three
racks and managed to get things prepared with a decent UPS in each rack.
CEO decided to move the servers at nightfall to "avoid suspicion". Our
office was in an unsavory part of town, moving anything at night was
suspicious. So, under the cover of half-ass darkness, we loaded the racks
onto a flatbed truck and drove them 20 minutes to the tower. While we
unloaded the racks, an electrician we knew was wiring up the L5-20 outlets
for the UPS in each rack. We got the racks plugged in, servers powered up,
and then the two customers came that had colocated equipment. They got
their equipment powered up and all seemed ok.
Back at the office the next day we were told to gather our workstations and
start working from home. I've been working from home ever since and quite
enjoy it, but that's beside the point.
Summer starts and I tell the CEO we need to repair the AC units because
they are failing. He ignores it, claiming he doesn't want to lose money the
bank could take at any minute. About a month later, a nice hot summer day
rolls in and the AC units both die. I stumble upon an old portable AC unit
and put that at the site. Temperatures rise to 140F ambient. Server
overheat alarms start going off, things start failing. Our colocation
customers are extremely upset. They pull their servers and drop service.
The heat subsides, CEO finally pays to repair one of the AC units.
Eventually, the company declares bankruptcy and goes into liquidation.
Luckily another WISP catches wind of it, buys the customers and assets, and
hires me. My happiest day that year was moving all the servers into a
better-suited home, a real data center. I don't know what happened to the
CEO, but I know that I'll never trust anything he has his hands in ever
adamkennedy at watchcomm.net | 800-589-3837 x120 <800-589-3837;120>
Watch Communications | www.watchcomm.net
3225 W Elm St, Suite A
Lima, OH 45805
On Tue, Feb 23, 2021 at 8:55 PM brutal8z via NANOG <nanog at nanog.org> wrote:
> My war story.
> At one of our major POPs in DC we had a row of 7513's, and one of them had
> intermittent problems. I had replaced every piece of removable card/part in
> it over time, and it kept failing. Even the vendor flew in a team to the
> site to try to figure out what was wrong. It was finally decided to replace
> the whole router (about 200lbs?). Being the local field tech, that was my
> Job. On the night of the maintenance at 3am, the work started. I switched
> off the rack power, which included a 2511 terminal server that was
> connected to half the routers in the row and started to remove the router.
> A few minutes later I got a text, "You're taking out the wrong router!" You
> can imagine the "Damn it, what have I done?" feeling that runs through your
> mind and the way your heart stops for a moment.
> Okay, I wasn't taking out the wrong router. But unknown at the time,
> terminal servers when turned off, had a nasty habit of sending a break to
> all the routers it was connected to, and all those routers effectively
> stopped. The remote engineer that was in charge saw the whole POP go red
> and assumed I was the cause. I was, but not because of anything I could
> have known about. I had to power cycle the downed routers to bring them
> back on-line, and then continue with the maintenance. A disaster to all
> involved, but the router got replaced.
> I gave a very detailed account of my actions in the postmortem. It was
> clear they knew I had turned off the wrong rack/router, and wasn't being
> honest about it. I was adamant I had done exactly what I said, and even
> swore I would fess up if I had error-ed, and always would, even if it cost
> me the job. I rarely made mistakes, if any, so it was an easy thing for me
> to say. For the next two weeks everyone that aware of the work gave me the
> side eye.
> About a week after that, the same thing happened to another field tech in
> another state. That helped my case. They used my account to figure out it
> was the TS that caused the problem. A few of them that had questioned me
> harshly admitted to me my account helped them figure out the cause.
> And the worst part of this story? That router, completely replaced, still
> had the same intermittent problem as before. It was a DC powered POP, so
> they were all wired with the same clean DC power. In the end they chalked
> it up to cosmic rays and gave up on it. I believe this break issue was
> unique to the DC powered 2511's, and that we were the first to use them,
> but I might be wrong on that.
> On 2/16/21 2:37 PM, John Kristoff wrote:
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
> Which examples would make up your top three?
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> Thanks in advance for your suggestions,
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NANOG