[140097] in North American Network Operators' Group
RE: Amazon diagnosis
daemon@ATHENA.MIT.EDU (Robert Bonomi)
Sun May 1 17:35:33 2011
Date: Sun, 1 May 2011 16:35:29 -0500 (CDT)
From: Robert Bonomi <bonomi@mail.r-bonomi.com>
To: nanog@nanog.org
In-Reply-To: <5A6D953473350C4B9995546AFE9939EE0C9E3031@RWC-EX1.corp.seven.com>
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org
> Subject: RE: Amazon diagnosis
> Date: Sun, 1 May 2011 12:50:37 -0700
> From: George Bonser <gbonser@seven.com>
>
> They apparently had a redundant primary network and, on top of that, a
> secondary network. The secondary network, however, did not have the
> capacity of the primary network.
>
> Rather than failing over from the active portion of the primary network
> to the standby portion of the primary network, they inadvertently failed
> the entire primary network to the secondary. This resulted in the
> secondary network reaching saturation and becoming unusable.
>
> There isn't anything that can be done to mitigate against human error.
> You can TRY, but as history shows us, it all boils down the human that
> implements the procedure. All the redundancy in the world will not do
> you an iota of good if someone explicitly does the wrong thing. ...
>
> This looks like it was a procedural error and not an architectural
> problem.
A sage sayeth sooth:
"For any 'fool-proof' system, there exists
a *sufficiently*determied* fool capable of
breaking it."
It would seem that the validity of that has just been re-confirmed. <wry grin>
It is worthy of note that it is considerably harder to protect against
accidental stupidity than it is to protect againt intentional malice.
('malice' is _much_ more predictable, in general. <wry grin>)