[44425] in North American Network Operators' Group
Re: Followup British Telecom outage reason
daemon@ATHENA.MIT.EDU (Ian Duncan)
Mon Nov 26 10:49:30 2001
Message-ID: <3C0263E9.497C3B3C@sympatico.ca>
Date: Mon, 26 Nov 2001 10:46:49 -0500
From: Ian Duncan <Ian.Duncan@sympatico.ca>
MIME-Version: 1.0
To: nanog@merit.edu
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Errors-To: owner-nanog-outgoing@merit.edu
Wandering off the subject of BT's misfortune ...
Sean Donelan wrote:
> On Mon, 26 Nov 2001, Christian Kuhtz wrote:
[...]
>
> > Faults will happen. And nothing matters as much as how your prepare for
> > when they do.
>
> Mean Time To Repair is a bigger contributor to Availability calculations
> than the Mean Time To Failure. It would be great if things never failed.
And Mean Time To Fault Detected (Accurately) is usually the biggest
sub-contributor within Repair but that's kinda your point.
>
> But some people are making their systems so complicated chasing the Holy
> Grail of 100% uptime, they can't figure out what happened when it does
> fail.
Similar people pursue creation of perpetuum mobile. A strange and somewhat
congruent example stumbled into recently is:
http://www.sce.carleton.ca/netmanage/perpetum.shtml.
Overall simplicity of the system, including failure detection mechanisms, and real
redundancy are the most reliable tools for availablity. Of course, popping just a
few layers out, profit and politics are elements of most systems.
> Murphy's revenge: The more reliable you make a system, the longer it will
> take you to figure out what's wrong when it breaks.
Hmm.