[44415] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: Followup British Telecom outage reason

daemon@ATHENA.MIT.EDU (Wayne E. Bouchard)
Sat Nov 24 15:09:34 2001

Date: Sat, 24 Nov 2001 13:08:37 -0700
From: "Wayne E. Bouchard" <web@typo.org>
To: "Neil J. McRae" <neil@DOMINO.ORG>
Cc: Sean Donelan <sean@donelan.com>, nanog@merit.edu
Message-ID: <20011124130837.A26006@typo.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20011124110521.34F8AED56@equinox.DOMINO.ORG>; from neil@DOMINO.ORG on Sat, Nov 24, 2001 at 11:05:20AM +0000
Errors-To: owner-nanog-outgoing@merit.edu


They probably did. The vendor probably did also. Of course, they can't
always simulate real network conditions. Nor can your own labs. Heck,
even a small deployment on 2 or 3 routers (out of, say, 200) can't
catch everything. It is a simple fact that some bugs don't show up
until its too late.

And cascade failures occure more often than you might think (and not
necessarily from software.) Remember the AT&T frame outage? Procedural
error. How about the netcom outage of a few years ago? Someone
misplaced a '.*' if I remember correctly. Human error of the simplest
kind. I've had a data center go offline because someone slipped and
turned off one side of a large breaker box.

These things happen.

The challenge is to eliminate the ones you CAN control. And, IMO, the
industry is generally doing a good job of that.

I chalk this whole thing up to bad karma for BT.

-Wayne

On Sat, Nov 24, 2001 at 11:05:20AM +0000, Neil J. McRae wrote:
> 
> > 
> > 
> > BT is telling ISPs the reason for the multi-hour outage was
> > a software bug in the interface cards used in BT's core network.
> > BT installed a new version of the software.  When that didn't fix
> > the problem, they fell back to a previous version of the software.
> > 
> > BT didn't identify the vendor, but BT is identified as a "Cisco Powered
> > Network(tm)."  Non-BT folks believe the problem was with GSR interface
> > cards.  I can't independently confirm it.
> > 
> 
> I'd be surprised if it was the GSR, and in anycase that doesn't
> absolve anyone. If it was a software issue- why wasn't the software
> properly tested? Why was such a critical upgrade rolled out across
> the entire network at the same time? It doesn't add up.
> 
> Neil.

---
Wayne Bouchard
web@typo.org
Network Engineer
http://www.typo.org/~web/resume.html

home help back first fref pref prev next nref lref last post