[121969] in North American Network Operators' Group
Re: Mitigating human error in the SP
daemon@ATHENA.MIT.EDU (Chadwick Sorrell)
Tue Feb 2 20:29:14 2010
In-Reply-To: <4B686867.9080602@gmail.com>
Date: Tue, 2 Feb 2010 20:28:44 -0500
From: Chadwick Sorrell <mirotrem@gmail.com>
To: nanog@nanog.org
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org
Thanks for all the comments!
On Tue, Feb 2, 2010 at 1:01 PM, JC Dill <jcdill.lists@gmail.com> wrote:
> Chadwick Sorrell wrote:
>>
>> This outage, of a high profile customer, triggered upper management to
>> react by calling a meeting just days after. =A0Put bluntly, we've been
>> told "Human errors are unacceptable, and they will be completely
>> eliminated. =A0One is too many."
>
> Good, Fast, Cheap - pick any two. =A0No you can't have all three.
>
> Here, Good is defined by your pointy-haired bosses as an
> impossible-to-achieve zero error rate.[1] =A0Attempting to achieve this i=
s
> either going to cost $$$, or your operations speed (how long it takes peo=
ple
> to do things) is going to drop like a rock. =A0Your first action should b=
e to
> make sure upper management understands this so they can set the appropria=
te
> priorities on Good, Fast, and Cheap, and make the appropriate budget
> changes.
>
> It's going to cost $$$ to hire enough people to have the staff necessary =
to
> double-check things in a timely manner, OR things are going to slow way d=
own
> as the existing staff is burdened by necessary double-checking of everyth=
ing
> and triple-checking of some things required to try to achieve a zero erro=
r
> rate. =A0They will also need to spend $$$ on software (to automate as muc=
h as
> possible) and testing equipment. =A0They will also never actually achieve=
a
> zero error rate as this is an impossible task that no organization has ev=
er
> achieved, no matter how much emphasis or money they pour into it (e.g.
> Windows vulnerabilities) or how important (see Challenger, Columbia, and =
the
> Mars Climate Orbiter incidents).
>
> When you put a $$$ cost on trying to achieve a zero error rate,
> pointy-haired bosses are usually willing to accept a normal error rate. =
=A0Of
> course, they want you to try to avoid errors, and there are a lot of simp=
le
> steps you can take in that effort (basic checklists, automation, testing)
> which have been mentioned elsewhere in this thread that will cost some mo=
ney
> but not the $$$ that is required to try to achieve a zero error rate. =A0=
Make
> sure they understand that the budget they allocate for these changes will=
be
> strongly correlated to how Good (zero error rate) and Fast (quick
> operational responses to turn-ups and problems) the outcome of this
> initiative.
>
> jc
>
> [1] =A0http://www.godlessgeeks.com/LINKS/DilbertQuotes.htm
>
> 2. "What I need is a list of specific unknown problems we will encounter.=
"
> (Lykes Lines Shipping)
>
> 6. "Doing it right is no excuse for not meeting the schedule." (R&D
> Supervisor, Minnesota Mining & Manufacturing/3M Corp.)
>
>
>
>