[121948] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: Mitigating human error in the SP

daemon@ATHENA.MIT.EDU (Jared Mauch)
Tue Feb 2 12:34:29 2010

From: Jared Mauch <jared@puck.nether.net>
In-Reply-To: <e7667f301002011821ifcc5be5pf1017cb30b7ea6dc@mail.gmail.com>
Date: Tue, 2 Feb 2010 12:33:50 -0500
To: Chadwick Sorrell <mirotrem@gmail.com>
Cc: nanog@nanog.org
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org

We have solved 98% of this with standard configurations and templates.

To deviate from this requires management approval/exception approval =
after an evaluation of the business risks.

Automation of config building is not too hard, and certainly things like =
peer-groups (cisco) and regular groups (juniper) make it easier.

If you go for the holy grail, you want something that takes into account =
the following:

1) each phase in the provisioning/turn-up state
2) each phase in infrastructure troubleshooting (turn-up, temporary =
outage/temporary testing, production)
3) automated pushing of config via load override/commit replace to your =
config space.

Obviously testing, etc.. is important.  I've found that whenever a human =
is involved, mistakes happen.  There is also the "Software is imperfect" =
mantra that should be repeated.  I find vendors at times have demanding =
customers who want perfection.  Bugs happen, Outages happen, the =
question is how do you respond to these risks.

If you have poor handling of bugs, outages, etc.. in your process or are =
decision gridlocked, very bad things happen.

- Jared

On Feb 1, 2010, at 9:21 PM, Chadwick Sorrell wrote:

> Hello NANOG,
>=20
> Long time listener, first time caller.
>=20
> A recent organizational change at my company has put someone in charge
> who is determined to make things perfect.  We are a service provider,
> not an enterprise company, and our business is doing provisioning work
> during the day.  We recently experienced an outage when an engineer,
> troubleshooting a failed turn-up, changed the ethertype on the wrong
> port losing both management and customer data on said device.  This
> isn't a common occurrence, and the engineer in question has a pristine
> track record.
>=20
> This outage, of a high profile customer, triggered upper management to
> react by calling a meeting just days after.  Put bluntly, we've been
> told "Human errors are unacceptable, and they will be completely
> eliminated.  One is too many."
>=20
> I am asking the respectable NANOG engineers....
>=20
> What measures have you taken to mitigate human mistakes?
>=20
> Have they been successful?
>=20
> Any other comments on the subject would be appreciated, we would like
> to come to our next meeting armed and dangerous.
>=20
> Thanks!
> Chad



home help back first fref pref prev next nref lref last post