[156206] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: Heads-Up: GoDaddy Broke the Interwebs...

daemon@ATHENA.MIT.EDU (Jared Mauch)
Tue Sep 11 17:09:55 2012

From: Jared Mauch <jared@puck.nether.net>
In-Reply-To: <CAGFn2k0BaZiNrx2VXfwKJ6XUTKt7h5i1wXzM1nWojukFx9cKrw@mail.gmail.com>
Date: Tue, 11 Sep 2012 17:08:08 -0400
To: Rubens Kuhl <rubensk@gmail.com>
Cc: NANOG list <nanog@nanog.org>
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org


On Sep 11, 2012, at 4:53 PM, Rubens Kuhl <rubensk@gmail.com> wrote:

> That doesn't mean that their description of the internal error fits
> what happened

Anytime I've seen a real RFO, it takes more than 24 hours to collect =
data.  Sometimes you actually don't know what happened.  There's a =
reason for this comic: http://www.dilbert.com/strips/comic/1999-08-04/  =
(the reboot cleared the problem).

I've seen many odd behaviors of devices that nobody could explain, =
including the vendors.. sometimes it takes a few years to understand =
what happened.  I recall a case where 2-3 years after a major outage =
someone made some minor comment about their architecture and a light =
came on.

I welcome more information about mistakes/errors that we can all learn =
from.  Sharing that information can be hard or uncomfortable at times, =
but can help others learn and not make the same mistakes again.  I took =
the recommendation of others and have started to read "Normal =
Accidents".  amazon link: http://tinyurl.com/9dc6x98

The whole multiple-failures problem really makes me concerned about =
cascading system failures when things go wrong.

- Jared=


home help back first fref pref prev next nref lref last post