[121940] in North American Network Operators' Group
Re: Mitigating human error in the SP
daemon@ATHENA.MIT.EDU (Chadwick Sorrell)
Tue Feb 2 10:14:44 2010
In-Reply-To: <D54A839A-5CE0-4E4E-BE32-216F6AD88515@voxeo.com>
Date: Tue, 2 Feb 2010 10:14:10 -0500
From: Chadwick Sorrell <mirotrem@gmail.com>
To: Paul Corrao <pcorrao@voxeo.com>
Cc: nanog@nanog.org
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org
On Tue, Feb 2, 2010 at 9:09 AM, Paul Corrao <pcorrao@voxeo.com> wrote:
> Humans make errors.
>
> For your upper management to think =A0they can build a foundation of reli=
ability on the theory that humans won't make errors is self deceiving.
>
> But that isn't where the story ends. =A0That's where it begins. =A0Your i=
nfrastructure, processes and tools should all be designed with that in mind=
so as to reduce or eliminate the impact that human error will have on the =
reliability of the service you provide to your customers.
>
> So, for the example you gave there are a few things that could be put in =
place. =A0The first one, already mentioned by Chad, is that mission critica=
l services should not be designed with single points of failure - that situ=
ation should be remediated.
Agreed.
> Another question =A0to be asked - since this was provisioning work being =
done, and it was apparently being done on production equipment, could the w=
ork have been done at a time of day (or night) when an error would not have=
been as much of a problem?
As it stands now, business want to turn their services up when they
are in the office. We do all new turn-ups during the day, anything
requiring a roll or maintenance window is schedule in the middle of
the night.
> You don't say how long the outage lasted, but given the reaction by your =
upper management, I would infer that it lasted for a while. =A0That raises =
the next question. =A0Who besides the engineer making the mistake was aware=
of the fact that work on production equipment was occurring? =A0The reason=
this is important is because having the NOC know that work is occurring wo=
uld give them a leg up on locating where the problem is once they get the t=
rouble notification.
The actual error happened when someone was troubleshooting a turn-up,
where in the past the customer in question has had their ethertype set
wrong. It wasn't a provisioning problem as much as someone
troubleshooting why it didn't come up with the customer. Ironically,
the NOC was on the phone when it happened, and the switch was rebooted
almost immediately and the outage lasted 5 minutes.
Chad