Mitigating human error in the SP
JC Dill
jcdill.lists at gmail.com
Tue Feb 2 18:01:11 UTC 2010
Chadwick Sorrell wrote:
> This outage, of a high profile customer, triggered upper management to
> react by calling a meeting just days after. Put bluntly, we've been
> told "Human errors are unacceptable, and they will be completely
> eliminated. One is too many."
Good, Fast, Cheap - pick any two. No you can't have all three.
Here, Good is defined by your pointy-haired bosses as an
impossible-to-achieve zero error rate.[1] Attempting to achieve this is
either going to cost $$$, or your operations speed (how long it takes
people to do things) is going to drop like a rock. Your first action
should be to make sure upper management understands this so they can set
the appropriate priorities on Good, Fast, and Cheap, and make the
appropriate budget changes.
It's going to cost $$$ to hire enough people to have the staff necessary
to double-check things in a timely manner, OR things are going to slow
way down as the existing staff is burdened by necessary double-checking
of everything and triple-checking of some things required to try to
achieve a zero error rate. They will also need to spend $$$ on software
(to automate as much as possible) and testing equipment. They will also
never actually achieve a zero error rate as this is an impossible task
that no organization has ever achieved, no matter how much emphasis or
money they pour into it (e.g. Windows vulnerabilities) or how important
(see Challenger, Columbia, and the Mars Climate Orbiter incidents).
When you put a $$$ cost on trying to achieve a zero error rate,
pointy-haired bosses are usually willing to accept a normal error rate.
Of course, they want you to try to avoid errors, and there are a lot of
simple steps you can take in that effort (basic checklists, automation,
testing) which have been mentioned elsewhere in this thread that will
cost some money but not the $$$ that is required to try to achieve a
zero error rate. Make sure they understand that the budget they
allocate for these changes will be strongly correlated to how Good (zero
error rate) and Fast (quick operational responses to turn-ups and
problems) the outcome of this initiative.
jc
[1] http://www.godlessgeeks.com/LINKS/DilbertQuotes.htm
2. "What I need is a list of specific unknown problems we will
encounter." (Lykes Lines Shipping)
6. "Doing it right is no excuse for not meeting the schedule." (R&D
Supervisor, Minnesota Mining & Manufacturing/3M Corp.)
More information about the NANOG
mailing list