Mitigating human error in the SP

Nick Hilliard nick at foobar.org
Tue Feb 2 13:51:25 UTC 2010


On 02/02/2010 02:21, Chadwick Sorrell wrote:
> This outage, of a high profile customer, triggered upper management to
> react by calling a meeting just days after.  Put bluntly, we've been
> told "Human errors are unacceptable, and they will be completely
> eliminated.  One is too many."

Leaving the PHB rhetoric aside for a few moments, this comes down to two
things: 1. cost vs. return and 2. realisation that service availability is
a matter of risk management, not a product bolt-on that you can install in
your operations department in a matter of days.

Pilot error can be substantially reduced by a variety of different things,
most notably good quality training, good quality procedures and
documentation, lab staging of all potentially service-affecting operations,
automation of lots of tasks, good quality change management control,
pre/post project analysis, and basic risk analysis of all regular procedures.

You'll note that all of these things cost time and money to develop,
implement and maintain; also, depending on the operational service model
which you currently use, some of them may dramatically affect operational
productivity one way or another.  This often leads to a significant
increase in staffing / resourcing costs in order to maintain similar levels
of operational service.  It also tends to lead to inflexibility at various
levels, which can have a knock-on effect in terms of customer expectation.

Other things which will help your situation from a customer interaction
point of view is rigorous use of maintenance windows and good
communications to ensure that they understand that there are risks
associated with maintenance.

Your management is obviously pretty upset about this incident.  If they
want things to change, then they need to realise that reducing pilot error
is not just a matter of getting someone to bark at the tech people until
the problem goes away.  They need to be fully aware at all levels that risk
management of this sort is a major undertaking for a small company, and
that it needs their full support and buy-in.

Nick




More information about the NANOG mailing list