[55711] in North American Network Operators' Group
Re: Cascading Failures Could Crash the Global Internet
daemon@ATHENA.MIT.EDU (Marshall Eubanks)
Sun Feb 9 12:44:15 2003
Date: Sun, 9 Feb 2003 12:35:27 -0500
Cc: "Stewart, William C (Bill), SALES" <billstewart@att.com>,
<nanog@trapdoor.merit.edu>
To: "Jack Bates" <jbates@brightok.net>
From: Marshall Eubanks <tme@multicasttech.com>
In-Reply-To: <003501c2d04d$046e6c80$43174241@jackdell>
Errors-To: owner-nanog-outgoing@merit.edu
Hello;
A packet switched network can be engineered against cascading failures
in a way that's hard for a circuit switched network. Every time you see a
random wait in a protocol, it's a good bet that the protocol writers
were trying to
protect against the tight coupling that leads to cascading failures.
Regards
Marshall Eubanks
On Sunday, February 9, 2003, at 10:07 AM, Jack Bates wrote:
>
> From: "Stewart, William C (Bill), SALES"
>
>>
>> I think the key is that the failures described in the paper
>> are caused by overload rather than other things -
>> too much demand for power blows out the generator,
>> and without it, the grid tries to get the power from the next
>> nearest generators, which overload and fail, and try to pull an
>> even large amount from the _next_ nearest, etc.
>> So the bit about heterogeneity is probably referring to
>> the fact that some nodes are bigger or better-connected than others,
>> and are more likely to blow out a bunch of their neighbors when
>> they fail and shed a big load.
>>
>> That's not really how Internet systems usually fail.
>
> A prime example of this theory was the large network I was using back
> when
> IE5 first came out. They had one circuit bad which overloaded an ATM
> circuit
> at another NAP causing it to generate bit errors. Shutting down the
> second
> circuit overloaded both MAE circuits effectively shutting down the
> network.
> However, it required manual intervention to create full failure,
> otherwise
> TCP would pull back to being useless, effectively killing all
> connections
> going that path, but not causing an issue with other paths until the
> manual
> intervention of shutting down the cirucit.
>
> While in theory it was still a cascade failure, it was also poor
> planning/policy on the part of the network to not be able to compensate
> in
> case of failure. The information provided may be partially inaccurate
> and is
> only hearsay concerning actual outages and effects when various
> interventions were tried; no hard fact. Thus it could be taken as
> solely my
> conjecture and not actual fact.
>
> -Jack
>