Mitigating human error in the SP

JC Dill jcdill.lists at
Tue Feb 2 18:01:11 UTC 2010

Chadwick Sorrell wrote:
> This outage, of a high profile customer, triggered upper management to
> react by calling a meeting just days after.  Put bluntly, we've been
> told "Human errors are unacceptable, and they will be completely
> eliminated.  One is too many."

Good, Fast, Cheap - pick any two.  No you can't have all three.

Here, Good is defined by your pointy-haired bosses as an 
impossible-to-achieve zero error rate.[1]  Attempting to achieve this is 
either going to cost $$$, or your operations speed (how long it takes 
people to do things) is going to drop like a rock.  Your first action 
should be to make sure upper management understands this so they can set 
the appropriate priorities on Good, Fast, and Cheap, and make the 
appropriate budget changes.

It's going to cost $$$ to hire enough people to have the staff necessary 
to double-check things in a timely manner, OR things are going to slow 
way down as the existing staff is burdened by necessary double-checking 
of everything and triple-checking of some things required to try to 
achieve a zero error rate.  They will also need to spend $$$ on software 
(to automate as much as possible) and testing equipment.  They will also 
never actually achieve a zero error rate as this is an impossible task 
that no organization has ever achieved, no matter how much emphasis or 
money they pour into it (e.g. Windows vulnerabilities) or how important 
(see Challenger, Columbia, and the Mars Climate Orbiter incidents).

When you put a $$$ cost on trying to achieve a zero error rate, 
pointy-haired bosses are usually willing to accept a normal error rate.  
Of course, they want you to try to avoid errors, and there are a lot of 
simple steps you can take in that effort (basic checklists, automation, 
testing) which have been mentioned elsewhere in this thread that will 
cost some money but not the $$$ that is required to try to achieve a 
zero error rate.  Make sure they understand that the budget they 
allocate for these changes will be strongly correlated to how Good (zero 
error rate) and Fast (quick operational responses to turn-ups and 
problems) the outcome of this initiative.



2. "What I need is a list of specific unknown problems we will 
encounter." (Lykes Lines Shipping)

6. "Doing it right is no excuse for not meeting the schedule." (R&D 
Supervisor, Minnesota Mining & Manufacturing/3M Corp.)

More information about the NANOG mailing list