[34080] in North American Network Operators' Group
Re: How common is lack of DNS server diversity?
daemon@ATHENA.MIT.EDU (Eric A. Hall)
Sat Jan 27 21:03:59 2001
Message-ID: <3A737D18.82388720@ehsco.com>
Date: Sat, 27 Jan 2001 17:59:52 -0800
From: "Eric A. Hall" <ehall@ehsco.com>
MIME-Version: 1.0
To: deeann mikula <deeann@telerama.com>
Cc: Charles Scott <cscott@gaslightmedia.com>,
	Brian <bri@sonicboom.org>, nanog@merit.edu
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Errors-To: owner-nanog-outgoing@merit.edu
> i experienced this exact same thing, and it was the secondary ns that
> NT was "fixating" on when making queries.  (the secondary was up and
> down for a few weeks until a new one was shipped out--yes, off-site
> and off-AS ;)
> 
> i had a VERY hard time explaining to NT professionals that their email
> to our domains shouldn't be bouncing, and that 99% of the internet
> could get mail to our domains just fine with one operating nameserver.
> i also didn't have any proof that NT didn't do The Right Thing, and no
> one wanted to help me prove it by hanging on the phone with me after
> complaining that "your nameservers are down."  is this misbehavior of
> NT documented anywhere?  is it fixable?  i don't know d*ck about NT,
> but i'd love to be able to at least suggest a fix and give someone a
> URL.
The DNS resolver for normal run-of-the-mill lookups handles failover
properly. If anything, it is too ambitious. The algorithm suggested in RFC
1035 is to "wait 5 seconds" for a timeout before trying another server,
while with WinSock-2 resolvers, the timeout threshold is one second, and
then multiple unique queries are sent shotgun-fashion to ALL of the other
servers simultaneously. The aggressiveness level is a matter of
administrative taste: when a query is for a name in a slow remote zone,
the shotgun approach is annoying. When the server is kaput, five seconds
can be too long.
The NT4 DNS server is not this aggressive when it does failover queries
against remote zones. It waits a few seconds for responses to come back
and even ignores ICMP Destination Unreachable Port Unreachable errors
(generated when the DNS server is administratively down but the server is
still running). Note that ignoring ICMP errors is not uncommon, the stock
Linux resolver also does it, while Solaris and a few others do the right
thing.
Anyway, it is possible to get into a situation where the DNS resolver on a
WinSock-2 system agressively fails out while the local DNS server is still
searching for an answer. In truth everything is doing what it is supposed
to do, just that the resolver does it too fast sometimes.
-- 
Eric A. Hall                                        http://www.ehsco.com/
Internet Core Protocols          http://www.oreilly.com/catalog/coreprot/