Mitigating human error in the SP

Chadwick Sorrell mirotrem at
Tue Feb 2 19:28:44 CST 2010

Thanks for all the comments!

On Tue, Feb 2, 2010 at 1:01 PM, JC Dill <jcdill.lists at> wrote:
> Chadwick Sorrell wrote:
>> This outage, of a high profile customer, triggered upper management to
>> react by calling a meeting just days after.  Put bluntly, we've been
>> told "Human errors are unacceptable, and they will be completely
>> eliminated.  One is too many."
> Good, Fast, Cheap - pick any two.  No you can't have all three.
> Here, Good is defined by your pointy-haired bosses as an
> impossible-to-achieve zero error rate.[1]  Attempting to achieve this is
> either going to cost $$$, or your operations speed (how long it takes people
> to do things) is going to drop like a rock.  Your first action should be to
> make sure upper management understands this so they can set the appropriate
> priorities on Good, Fast, and Cheap, and make the appropriate budget
> changes.
> It's going to cost $$$ to hire enough people to have the staff necessary to
> double-check things in a timely manner, OR things are going to slow way down
> as the existing staff is burdened by necessary double-checking of everything
> and triple-checking of some things required to try to achieve a zero error
> rate.  They will also need to spend $$$ on software (to automate as much as
> possible) and testing equipment.  They will also never actually achieve a
> zero error rate as this is an impossible task that no organization has ever
> achieved, no matter how much emphasis or money they pour into it (e.g.
> Windows vulnerabilities) or how important (see Challenger, Columbia, and the
> Mars Climate Orbiter incidents).
> When you put a $$$ cost on trying to achieve a zero error rate,
> pointy-haired bosses are usually willing to accept a normal error rate.  Of
> course, they want you to try to avoid errors, and there are a lot of simple
> steps you can take in that effort (basic checklists, automation, testing)
> which have been mentioned elsewhere in this thread that will cost some money
> but not the $$$ that is required to try to achieve a zero error rate.  Make
> sure they understand that the budget they allocate for these changes will be
> strongly correlated to how Good (zero error rate) and Fast (quick
> operational responses to turn-ups and problems) the outcome of this
> initiative.
> jc
> [1]
> 2. "What I need is a list of specific unknown problems we will encounter."
> (Lykes Lines Shipping)
> 6. "Doing it right is no excuse for not meeting the schedule." (R&D
> Supervisor, Minnesota Mining & Manufacturing/3M Corp.)

More information about the NANOG mailing list