[121971] in North American Network Operators' Group
Re: Mitigating human error in the SP
daemon@ATHENA.MIT.EDU (Michael Dillon)
Tue Feb 2 20:33:33 2010
In-Reply-To: <bb0e440a1002011828u6b68e5e6p9c8a45c620c245bb@mail.gmail.com>
Date: Wed, 3 Feb 2010 01:30:00 +0000
From: Michael Dillon <wavetossed@googlemail.com>
To: Suresh Ramasubramanian <ops.lists@gmail.com>
Cc: nanog@nanog.org
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org
> Automated config deployment / provisioning. =A0 And sanity checking
> before deployment.
Easy to say, not so easy to do. For instance, that incorrect port was ident=
ified
by a number or name. Theoretically, if an automated tool pulls the number/n=
ame
from a database and issues the command, then the error cannot happen. But h=
ow
does the number/name get into the database.
I've seen a situation where a human being enters that number, copying it fr=
om
another application screen. We hope that it is done by copy/paste all the
time but who knows? And even copy/paste can make mistakes if the selection
is done by mouse by someone who isn't paying enough attention.
But wait! How did the other application come up with that number for copyin=
g?
Actually, it was copy-pasted from yet a third application, and that applica=
tion
got it by copy paste from a spreadsheet.
It is easy to create a tangled mess of OSS applications that are glued toge=
ther
by lots of manual human effort creating numerous opportunities for human er=
ror.
So while I wholeheartedly support automation of network configuration, that=
is
not a magic bullet. You also need to pay attention to the whole process, th=
e
whole chain of information flow.
And there are other things that may be even more effective such as hiding y=
our
human errors. This is commonly called a "maintenance window" and it involve=
s
an absolute ban on making any network change, no matter how trivial, outsid=
e
of a maintenance window. The human error can still occur but because it is
in a maintenance window, the customer either doesn't notice, or if it is pl=
anned
maintenance, they don't complain because they are expecting a bit of disrup=
tion
and have agreed to the planned maintenance window.
That only leaves break-fix work which is where the most skilled and trusted
engineers work on the live network outside of maintenance windows to fix
stuff that is seriously broken. It sounds like the event in the original po=
sting
was something like that, but perhaps not, because this kind of break-fix wo=
rk
should only be done when there is already a customer-affecting issue.
By the way, even break-fix changes can, and should be, tested in a lab
environment before you push them onto the network.
--Michael Dillon