[121948] in North American Network Operators' Group
Re: Mitigating human error in the SP
daemon@ATHENA.MIT.EDU (Jared Mauch)
Tue Feb 2 12:34:29 2010
From: Jared Mauch <jared@puck.nether.net>
In-Reply-To: <e7667f301002011821ifcc5be5pf1017cb30b7ea6dc@mail.gmail.com>
Date: Tue, 2 Feb 2010 12:33:50 -0500
To: Chadwick Sorrell <mirotrem@gmail.com>
Cc: nanog@nanog.org
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org
We have solved 98% of this with standard configurations and templates.
To deviate from this requires management approval/exception approval =
after an evaluation of the business risks.
Automation of config building is not too hard, and certainly things like =
peer-groups (cisco) and regular groups (juniper) make it easier.
If you go for the holy grail, you want something that takes into account =
the following:
1) each phase in the provisioning/turn-up state
2) each phase in infrastructure troubleshooting (turn-up, temporary =
outage/temporary testing, production)
3) automated pushing of config via load override/commit replace to your =
config space.
Obviously testing, etc.. is important. I've found that whenever a human =
is involved, mistakes happen. There is also the "Software is imperfect" =
mantra that should be repeated. I find vendors at times have demanding =
customers who want perfection. Bugs happen, Outages happen, the =
question is how do you respond to these risks.
If you have poor handling of bugs, outages, etc.. in your process or are =
decision gridlocked, very bad things happen.
- Jared
On Feb 1, 2010, at 9:21 PM, Chadwick Sorrell wrote:
> Hello NANOG,
>=20
> Long time listener, first time caller.
>=20
> A recent organizational change at my company has put someone in charge
> who is determined to make things perfect. We are a service provider,
> not an enterprise company, and our business is doing provisioning work
> during the day. We recently experienced an outage when an engineer,
> troubleshooting a failed turn-up, changed the ethertype on the wrong
> port losing both management and customer data on said device. This
> isn't a common occurrence, and the engineer in question has a pristine
> track record.
>=20
> This outage, of a high profile customer, triggered upper management to
> react by calling a meeting just days after. Put bluntly, we've been
> told "Human errors are unacceptable, and they will be completely
> eliminated. One is too many."
>=20
> I am asking the respectable NANOG engineers....
>=20
> What measures have you taken to mitigate human mistakes?
>=20
> Have they been successful?
>=20
> Any other comments on the subject would be appreciated, we would like
> to come to our next meeting armed and dangerous.
>=20
> Thanks!
> Chad