Mitigating human error in the SP
mirotrem at gmail.com
Tue Feb 2 15:14:10 UTC 2010
On Tue, Feb 2, 2010 at 9:09 AM, Paul Corrao <pcorrao at voxeo.com> wrote:
> Humans make errors.
> For your upper management to think they can build a foundation of reliability on the theory that humans won't make errors is self deceiving.
> But that isn't where the story ends. That's where it begins. Your infrastructure, processes and tools should all be designed with that in mind so as to reduce or eliminate the impact that human error will have on the reliability of the service you provide to your customers.
> So, for the example you gave there are a few things that could be put in place. The first one, already mentioned by Chad, is that mission critical services should not be designed with single points of failure - that situation should be remediated.
> Another question to be asked - since this was provisioning work being done, and it was apparently being done on production equipment, could the work have been done at a time of day (or night) when an error would not have been as much of a problem?
As it stands now, business want to turn their services up when they
are in the office. We do all new turn-ups during the day, anything
requiring a roll or maintenance window is schedule in the middle of
> You don't say how long the outage lasted, but given the reaction by your upper management, I would infer that it lasted for a while. That raises the next question. Who besides the engineer making the mistake was aware of the fact that work on production equipment was occurring? The reason this is important is because having the NOC know that work is occurring would give them a leg up on locating where the problem is once they get the trouble notification.
The actual error happened when someone was troubleshooting a turn-up,
where in the past the customer in question has had their ethertype set
wrong. It wasn't a provisioning problem as much as someone
troubleshooting why it didn't come up with the customer. Ironically,
the NOC was on the phone when it happened, and the switch was rebooted
almost immediately and the outage lasted 5 minutes.
More information about the NANOG