[27026] in North American Network Operators' Group
Re: Worldcom and Qwest switch places
daemon@ATHENA.MIT.EDU (michael.dillon@gtsip.net)
Mon Feb 7 14:34:02 2000
Date: 7 Feb 2000 19:28:41 +0000
Message-ID: <20000207192841.4082.cpmta@c000.muc.cp.net>
Content-Type: text/plain
Content-Disposition: inline
Mime-Version: 1.0
To: nanog@merit.edu
From: michael.dillon@gtsip.net
Errors-To: owner-nanog-outgoing@merit.edu
On Sat, 05 February 2000, Sean Donelan wrote:
> Since Lucent equipment was also involved in the 10 days of Worldcom problems,
> is there a common root cause between the Worldcom's problems and Qwest's
> problems? Is there some lesson other providers should be learning from
> these events? Or is each service provider expected to learn and re-learn
> these lessons individually? Is there some network design decision engineers
> are getting wrong?
Lucent people told me that the Worldcom problem resulted from a software upgrade to Worldcom's Lucent switches that was done without having a good fallback plan. Lucent engineers had recommended a different strategy to Worldcom but Worldcom went ahead and did it their way. Then the software upgrade triggered some kind of cascading problem that either affected the old code or travelled through the network or both.
In other words, they created a problem as a side effect of the upgrade but didn't have agood strategy to contain or kill the problem that propogated like some kind of living organism. Seems to me that we *HAVE* seen this type of problem before in the Internet with things like AS7007 routes which seemed to hang around parts of the net for days.
How do you plan to rollback to a known state when you can't simply backtrack or reverse your actions?
---
Michael Dillon Phone: +44 (20) 7769 8489
Mobile: +44 (79) 7099 2658
Director of Product Engineering, GTS IP Services
151 Shaftesbury Ave.
London WC2H 8AL
UK