Famous operational issues

Warren Kumari warren at kumari.net
Tue Feb 23 23:56:08 UTC 2021


On Tue, Feb 23, 2021 at 5:14 PM Justin Streiner <streinerj at gmail.com> wrote:

> Beyond the widespread outages, I have so many personal war stories that
> it's hard to pick a favorite.
>
> My first job out of college in the mid-late 90s was at an ISP in
> Pittsburgh that I joined pretty early in its existence, and everyone did a
> bit of everything. I was hired to do sysadmin stuff, networking, pretty
> much whatever was needed. About a year after I started, we brought up a new
> mail system with an external RAID enclosure for the mail store itself.  One
> day, we saw indications that one of the disks in the RAID enclosure was
> starting to fail, so I scheduled a maintenance window to replace the disk
> and let the controller rebuild the data and integrate it back into the RAID
> set.  No big worries, right?
>
> It's Tuesday at about 2 AM.
>
> Well, the kernel on the RAID controller itself decided that when I pulled
> the failing drive would be a fine time to panic, and more or less turn
> itself into a bit-blender, and take all the mailstore down with it.  After
> a few hours of watching fsck make no progress on anything, in terms of
> trying to un-fsck the mailstore, we made the decision in consultation with
> the CEO to pull the plug on trying to bring the old RAID enclosure back to
> life, and focus on finding suitable replacement hardware and rebuild from
> scratch.  We also discovered that the most recent backups of the mailstore
> were over a month old :(
>
> I think our CEO ended up driving several hours to procure a suitable
> enclosure.  By the time we got the enclosure installed, filesystems built,
> and got whatever tape backups we had restored, and tested the integrity of
> the system, it was now Thursday around 8 AM. Coincidentally, that was the
> same day the company hosted a big VIP gathering (the mayor was there, along
> with lots of investors and other bigwigs), so I had to come back and put on
> a suit to hobnob with the VIPs after getting a total of 6 hours of sleep in
> about the previous 3 days.  I still don't know how I got home that night
> without wrapping my vehicle around a utility pole (due to being over-tired,
> not due to alcohol).
>
> Many painful lessons learned over that stretch of days, as often the case
> as a company grows from startup mode and builds more robust technology and
> business processes as a consequence of growth.
>

Oh, dear. RAID.... that triggered 2 stories.
1: I worked at a small ISP in Westchester, NY. One day I'm doing stuff, and
want to kill process 1742, so I type 'kill -9 1' ... and then, before
pressing enter, I get distracted by our "Cisco AGS+ monitor" (a separate
story). After I get back to my desk I unlock my terminal, and call over a
friend to show just how close I'd gotten to making something go Boom. He
says "Nah, BSD is cleverer than that. I'm sure the kill command has some
check in to stop you killing init.". I disagree. He disagrees. I disagree
again. He calls me stupid. I bet him a soda.
He proves his point by typing 'su; kill -9 1' in the window he's logged
into -- and our primary NFS server (with all of the user sites)
obediently kills off init, and all of the child processes.... we run over
to the front of the box and hit the power switch, while desperately looking
for a monitor and keyboard to watch it boot.
It does the BIOS checks, and then stops on the RAID controller, complaining
about the fact that there are *2* dead drives, and that the array is now
sad.....
This makes no sense. I can understand one drive not recovering from a power
outage, but 2 seems a bit unlikely, especially because the machine hadn't
been beeping or anything like that.... we try turning it off and on again a
few times, no change... We pull the machine out of the rack and rip the
cover off.
Sure enough, there is a RAID card - but the piezo-buzzer on it is, for some
reason, wrapped in a bunch of napkins, held in place with electrical tape.
I pull that off, and there is also some  paper towel jammed into the hole
in the buzzer, and bits of a broken pencil....

After replacing the drives, starting an rsync restore from a backup server
we investigate more....
...
it turns out that a few months ago(!) the machine had started beeping. The
night crew naturally found this annoying, and so they'd gone investigating
and discovered that it was this machine, and lifted the lid while still in
the rack. They traced the annoying noise to this small black thingie, and
made poked it until it stopped, thus solving the problem once and for
all.... yay!





2: I used to work at a company which was in one of the buildings next to
the twin-towers. For various clever reasons, they had their "datacenter" in
a corner of the office space... anyway, the planes hit, power goes out and
the building is evacuated - luckily no one is injured, but the entire
company/site is down. After a few weeks, my friend Joe is able to arrange
with a fire marshal to get access to the building so he can go and grab the
disks with all the data. The fire marshal and Joe trudge up the 15 flights
of stairs.... When they reach the suite, Joe discovers that the windows
where his desk was are blown in, there is debris everywhere, etc. He's
somewhat shaken by all this, but goes over to the datacenter area, pulls
the drives out of the Sun storage arrays, and puts them in his backpack.
They then trudge down the 15 flights of stairs, and Joe takes them home.
We've managed to scrounge up 3 identical (empty) arrays, and some servers,
and the plan is to temporarily run the service from his basement...

Anyway, I get a panic'ed call from Joe. He's got the empty RAID arrays.
He's got the servers. He's got a pile of 42 drives (3 enclosures, 14 drives
per enclosure). Unfortunately he completely didn't think to mark the order
of the drives, and now we have *no* idea which drives goes in which array,
nor in which slot in the array....

We spent some time trying to figure out how many ways you can arrange 42
things into 3 piles, and how long it would take to try all combinations....
I cannot remember the actual number, but it approached the lifetime of the
universe....
After much time and poking, we eventually worked out that the RAID
controller wrote a slot number at sector 0 on each physical drive, and it
became a solvable problem, but...


W


> jms
>
> On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk at dataplane.org> wrote:
>
>> Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
>> Which examples would make up your top three?
>>
>> To get things started, I'd suggest the AS 7007 event is perhaps  the
>> most notorious and likely to top many lists including mine.  So if
>> that is one for you I'm asking for just two more.
>>
>> I'm particularly interested in this as the first step in developing a
>> future NANOG session.  I'd be particularly interested in any issues
>> that also identify key individuals that might still be around and
>> interested in participating in a retrospective.  I already have someone
>> that is willing to talk about AS 7007, which shouldn't be hard to guess
>> who.
>>
>> Thanks in advance for your suggestions,
>>
>> John
>>
>

-- 
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20210223/d0d9911a/attachment.html>


More information about the NANOG mailing list