[121935] in North American Network Operators' Group
Re: Mitigating human error in the SP
daemon@ATHENA.MIT.EDU (Paul Corrao)
Tue Feb 2 09:10:32 2010
From: Paul Corrao <pcorrao@voxeo.com>
In-Reply-To: <20100202234629.7c7cf8cf@opy.nosense.org>
Date: Tue, 2 Feb 2010 09:09:30 -0500
To: Mark Smith <nanog@85d5b20a518b8f6864949bd940457dc124746ddc.nosense.org>
Cc: nanog@nanog.org
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org
Humans make errors. =20
For your upper management to think they can build a foundation of =
reliability on the theory that humans won't make errors is self =
deceiving.
But that isn't where the story ends. That's where it begins. Your =
infrastructure, processes and tools should all be designed with that in =
mind so as to reduce or eliminate the impact that human error will have =
on the reliability of the service you provide to your customers.
So, for the example you gave there are a few things that could be put in =
place. The first one, already mentioned by Chad, is that mission =
critical services should not be designed with single points of failure - =
that situation should be remediated. =20
Another question to be asked - since this was provisioning work being =
done, and it was apparently being done on production equipment, could =
the work have been done at a time of day (or night) when an error would =
not have been as much of a problem?
You don't say how long the outage lasted, but given the reaction by your =
upper management, I would infer that it lasted for a while. That raises =
the next question. Who besides the engineer making the mistake was =
aware of the fact that work on production equipment was occurring? The =
reason this is important is because having the NOC know that work is =
occurring would give them a leg up on locating where the problem is once =
they get the trouble notification.
Paul
On Feb 2, 2010, at 8:16 AM, Mark Smith wrote:
> On Mon, 1 Feb 2010 21:21:52 -0500
> Chadwick Sorrell <mirotrem@gmail.com> wrote:
>=20
>> Hello NANOG,
>>=20
>> Long time listener, first time caller.
>>=20
>> A recent organizational change at my company has put someone in =
charge
>> who is determined to make things perfect. We are a service provider,
>> not an enterprise company, and our business is doing provisioning =
work
>> during the day. We recently experienced an outage when an engineer,
>> troubleshooting a failed turn-up, changed the ethertype on the wrong
>> port losing both management and customer data on said device. This
>> isn't a common occurrence, and the engineer in question has a =
pristine
>> track record.
>>=20
>=20
> Why didn't the customer have a backup link if their service was so
> important to them and indirectly your upper management? If your
> upper management are taking this problem that seriously, then your
> *sales people* didn't do their job properly - they should be ensuring
> that customers with high availability requirements have a backup link,
> or aren't led to believe that the single-point-of-failure service will
> be highly available.
>=20
>=20
>> This outage, of a high profile customer, triggered upper management =
to
>> react by calling a meeting just days after. Put bluntly, we've been
>> told "Human errors are unacceptable, and they will be completely
>> eliminated. One is too many."
>>=20
>=20
> If upper management don't understand that human error is a risk factor
> that can't be completely eliminated, then I suggest "self-eliminating"
> and find yourself a job somewhere else. The only way you'll avoid
> human error having any impact on production services is to not change
> anything - which pretty much means not having a job anyway ...
>=20
>=20
>> I am asking the respectable NANOG engineers....
>>=20
>> What measures have you taken to mitigate human mistakes?
>>=20
>> Have they been successful?
>>=20
>> Any other comments on the subject would be appreciated, we would like
>> to come to our next meeting armed and dangerous.
>>=20
>> Thanks!
>> Chad
>>=20
>=20