Mitigating human error in the SP
wavetossed at googlemail.com
Wed Feb 3 01:30:00 UTC 2010
> Automated config deployment / provisioning. And sanity checking
> before deployment.
Easy to say, not so easy to do. For instance, that incorrect port was identified
by a number or name. Theoretically, if an automated tool pulls the number/name
from a database and issues the command, then the error cannot happen. But how
does the number/name get into the database.
I've seen a situation where a human being enters that number, copying it from
another application screen. We hope that it is done by copy/paste all the
time but who knows? And even copy/paste can make mistakes if the selection
is done by mouse by someone who isn't paying enough attention.
But wait! How did the other application come up with that number for copying?
Actually, it was copy-pasted from yet a third application, and that application
got it by copy paste from a spreadsheet.
It is easy to create a tangled mess of OSS applications that are glued together
by lots of manual human effort creating numerous opportunities for human error.
So while I wholeheartedly support automation of network configuration, that is
not a magic bullet. You also need to pay attention to the whole process, the
whole chain of information flow.
And there are other things that may be even more effective such as hiding your
human errors. This is commonly called a "maintenance window" and it involves
an absolute ban on making any network change, no matter how trivial, outside
of a maintenance window. The human error can still occur but because it is
in a maintenance window, the customer either doesn't notice, or if it is planned
maintenance, they don't complain because they are expecting a bit of disruption
and have agreed to the planned maintenance window.
That only leaves break-fix work which is where the most skilled and trusted
engineers work on the live network outside of maintenance windows to fix
stuff that is seriously broken. It sounds like the event in the original posting
was something like that, but perhaps not, because this kind of break-fix work
should only be done when there is already a customer-affecting issue.
By the way, even break-fix changes can, and should be, tested in a lab
environment before you push them onto the network.
More information about the NANOG