Mitigating human error in the SP

Chadwick Sorrell mirotrem at gmail.com
Tue Feb 2 19:28:44 CST 2010


Thanks for all the comments!

On Tue, Feb 2, 2010 at 1:01 PM, JC Dill <jcdill.lists at gmail.com> wrote:
> Chadwick Sorrell wrote:
>>
>> This outage, of a high profile customer, triggered upper management to
>> react by calling a meeting just days after.  Put bluntly, we've been
>> told "Human errors are unacceptable, and they will be completely
>> eliminated.  One is too many."
>
> Good, Fast, Cheap - pick any two.  No you can't have all three.
>
> Here, Good is defined by your pointy-haired bosses as an
> impossible-to-achieve zero error rate.[1]  Attempting to achieve this is
> either going to cost $$$, or your operations speed (how long it takes people
> to do things) is going to drop like a rock.  Your first action should be to
> make sure upper management understands this so they can set the appropriate
> priorities on Good, Fast, and Cheap, and make the appropriate budget
> changes.
>
> It's going to cost $$$ to hire enough people to have the staff necessary to
> double-check things in a timely manner, OR things are going to slow way down
> as the existing staff is burdened by necessary double-checking of everything
> and triple-checking of some things required to try to achieve a zero error
> rate.  They will also need to spend $$$ on software (to automate as much as
> possible) and testing equipment.  They will also never actually achieve a
> zero error rate as this is an impossible task that no organization has ever
> achieved, no matter how much emphasis or money they pour into it (e.g.
> Windows vulnerabilities) or how important (see Challenger, Columbia, and the
> Mars Climate Orbiter incidents).
>
> When you put a $$$ cost on trying to achieve a zero error rate,
> pointy-haired bosses are usually willing to accept a normal error rate.  Of
> course, they want you to try to avoid errors, and there are a lot of simple
> steps you can take in that effort (basic checklists, automation, testing)
> which have been mentioned elsewhere in this thread that will cost some money
> but not the $$$ that is required to try to achieve a zero error rate.  Make
> sure they understand that the budget they allocate for these changes will be
> strongly correlated to how Good (zero error rate) and Fast (quick
> operational responses to turn-ups and problems) the outcome of this
> initiative.
>
> jc
>
> [1]  http://www.godlessgeeks.com/LINKS/DilbertQuotes.htm
>
> 2. "What I need is a list of specific unknown problems we will encounter."
> (Lykes Lines Shipping)
>
> 6. "Doing it right is no excuse for not meeting the schedule." (R&D
> Supervisor, Minnesota Mining & Manufacturing/3M Corp.)
>
>
>
>




More information about the NANOG mailing list