[92012] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: Spain was offline

daemon@ATHENA.MIT.EDU (Peter Corlett)
Thu Aug 31 12:31:53 2006

In-Reply-To: <B9843A4B77B5AD46AF1C1756FB565E2C0140D3EF@mcmail01.ad.local>
From: Peter Corlett <abuse@cabal.org.uk>
Date: Thu, 31 Aug 2006 17:30:37 +0100
To: NANOG <nanog@merit.edu>
X-SA-Exim-Rcpt-To: nanog@merit.edu, abuse@cabal.org.uk
X-SA-Exim-Mail-From: abuse@cabal.org.uk
Errors-To: owner-nanog@merit.edu


On 31 Aug 2006, at 16:30, Joseph Jackson wrote:
> I wish the article had more info since I have been wondering how a
> software upgrade downed the entire zone.

Oh, loads of ways.

> Wasn't there any backup servers?

Well, a quick poke suggests, assuming a reasonably traditional setup,  
that ns1.nic.es is the master, and there are various slaves, not  
necessarily directly under their control. ns1.nic.es appears to be  
running BIND 9.3.2, and there's other versions running on the other  
nameservers. So if it *was* a software update of BIND, it's probably  
not global.

OTOH, I can believe that somebody broke a Perl script critical to it  
and it rolled out a valid, but empty, zonefile which the secondaries  
faithfully replicated. Not that I've watched cascading DNS failures  
at too many places with bits of crufty Perl, oh no...

Actually, it amazes me that this sort of thing doesn't happen more  
often.

> Did they not test the upgrade before hand?  I know I'd lose my
> job if I upgraded our dns servers all at once with out testing.

It's Europe, it's harder to fire people. There's probably a bit of  
scapegoating and shooting of messengers going on, but it's quite  
likely that the root cause is a general process failure that's not  
attributable to a single individual.



home help back first fref pref prev next nref lref last post