[44413] in North American Network Operators' Group
Re: Followup British Telecom outage reason
daemon@ATHENA.MIT.EDU (Sean Donelan)
Sat Nov 24 14:14:36 2001
Date: Sat, 24 Nov 2001 14:16:38 -0500 (EST)
From: Sean Donelan <sean@donelan.com>
To: "Neil J. McRae" <neil@DOMINO.ORG>
Cc: nanog@merit.edu
In-Reply-To: <20011124110521.34F8AED56@equinox.DOMINO.ORG>
Message-ID: <Pine.GSO.4.40.0111241400160.3405-100000@clifden.donelan.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Errors-To: owner-nanog-outgoing@merit.edu
On Sat, 24 Nov 2001, Neil J. McRae wrote:
> I'd be surprised if it was the GSR, and in anycase that doesn't
> absolve anyone. If it was a software issue- why wasn't the software
> properly tested? Why was such a critical upgrade rolled out across
> the entire network at the same time? It doesn't add up.
It appears to be yet another CEF bug. If you want to use a GSR
you are stuck using some version of IOS with a CEF bug. The
question is which bug do you want. Each version of IOS has
a slightly different set. Several US network providers have also
been bitten by CEF bugs too.
While trying to fix one set of bugs, BT upgraded of their network.
I'm not sure if they were upgrading at 9am in the morning, or had
upgraded earlier and the bug finally came out under load at 9am.
When the BT network melted down, Cisco suggested installing a
different version of IOS, which had previously been tested. At
noon, BT found the new version had an even worse bug, sending packets
out the wrong interface. It was until 2200 (13 hours later), BT and
Cisco found a version of IOS which stablized the network. "Stablized"
not fixed. The running version of IOS still has a bug, but it isn't
as severe.