[26407] in Athena Bugs
9.3.18 Solaris: bind loses google
daemon@ATHENA.MIT.EDU (John Hawkinson)
Thu Mar 31 09:23:57 2005
Date: Thu, 31 Mar 2005 09:23:09 -0500 (EST)
Message-Id: <200503311423.j2VEN98A019721@portnoy.mit.edu>
To: bugs@mit.edu
From: John Hawkinson <jhawk@mit.edu>
X-Spam-Score: -2.4
X-Spam-Flag: NO
Errors-To: bugs-bounces@mit.edu
Hi.
[ Is the bind-doc package installed somewhere on Athena, or is it
just untarred in the source locker? ]
Folk have regularly been reporting intermittant problems resolving
www.google.com over zephyr for the past week or so. I finally turned
on named debugging to try to track it down. I was not terribly
successful, but have enough to be puzzled.
The symptoms seem to be that one gets either NXDOMAIN or SERVFAIL
when trying to resolve www.google.com, and it either goes away
after a few minutes, or it goes away instantly if you restarted named.
One imagines this has something to do with google's complicated
CNAME structure:
;; ANSWER SECTION:
www.google.com. 786 IN CNAME www.l.google.com.
www.l.google.com. 186 IN A 64.233.187.104
www.l.google.com. 186 IN A 64.233.187.99
;; AUTHORITY SECTION:
l.google.com. 86286 IN NS a.l.google.com.
l.google.com. 86286 IN NS b.l.google.com.
l.google.com. 86286 IN NS d.l.google.com.
Logs/dumps are in /mit/jhawk/google-dns:
[portnoy!jhawk] ~/google-dns> ls -l
total 184
-rw-r--r-- 1 jhawk mit 34771 Mar 31 09:13 named.run
-rw-r--r-- 1 jhawk mit 51426 Mar 31 09:13 named_dump.db
-rw-r--r-- 1 jhawk mit 6912 Mar 31 09:14 querylog
[portnoy!jhawk] ~/google-dns> cd
The querylog is uninsightful, with stuff like this:
09:08:10 client 127.0.0.1#47363: query: www.google.com IN A
09:08:15 client 127.0.0.1#47364: query: www.google.com.MIT.EDU IN A
09:08:15 client 127.0.0.1#47365: query: www.google.com IN A
09:08:20 client 127.0.0.1#47366: query: www.google.com.MIT.EDU IN A
09:08:20 client 127.0.0.1#47367: query: www.google.com IN A
Similarly, the dumpfile is unremarkable:
; glue
google.com. 172781 NS ns1.google.com.
172781 NS ns2.google.com.
172781 NS ns3.google.com.
172781 NS ns4.google.com.
; glue
l.google.com. 86381 NS a.l.google.com.
86381 NS b.l.google.com.
86381 NS c.l.google.com.
86381 NS d.l.google.com.
; glue
a.l.google.com. 86381 A 216.239.53.9
; glue
b.l.google.com. 86381 A 64.233.179.9
; glue
c.l.google.com. 86381 A 64.233.161.9
; glue
d.l.google.com. 86381 A 64.233.183.9
; glue
ns1.google.com. 172781 A 216.239.32.10
; glue
ns2.google.com. 172781 A 216.239.34.10
; glue
ns3.google.com. 172781 A 216.239.36.10
; glue
ns4.google.com. 172781 A 216.239.38.10
; authanswer
www.google.com. 881 CNAME www.l.google.com.
It's odd, I guess, that the CNAME is held authoritatively
and yet there's no A record. This does suggest pehraps
the problem lies in the nameservers for the l.google.com domain.
I had querylogging on prior to the start of this problem,
but I only turned on tracing after it. (Perhaps that
is part of the problem in getting good data here?).
Tracing wrote to /etc/named.run, basically stuff like this:
Mar 31 09:09:30.481 client 127.0.0.1#47375: query: www.google.com IN A
Mar 31 09:09:30.482 createfetch: www.l.google.com A
That was at level 1. Bumping up to level 5, it's more detailed,
but not apparently more helpful:
Mar 31 09:11:03.259 client 127.0.0.1#47388: UDP request
Mar 31 09:11:03.259 client 127.0.0.1#47388: using view '_default'
Mar 31 09:11:03.259 client 127.0.0.1#47388: request is not signed
Mar 31 09:11:03.259 client 127.0.0.1#47388: recursion available: approved
Mar 31 09:11:03.259 client 127.0.0.1#47388: query
Mar 31 09:11:03.259 client 127.0.0.1#47388: query: www.google.com IN A
Mar 31 09:11:03.260 client 127.0.0.1#47388: query (cache) approved
Mar 31 09:11:03.260 client 127.0.0.1#47388: replace
Mar 31 09:11:03.260 clientmgr @1fe450: createclients
Mar 31 09:11:03.260 clientmgr @1fe450: recycle
Mar 31 09:11:03.260 createfetch: www.l.google.com A
Mar 31 09:11:03.260 fctx 241a78: create
Mar 31 09:11:03.260 fctx 241a78: join
Mar 31 09:11:03.260 fetch 24a650 (fctx 241a78): created
Mar 31 09:11:03.260 client @26bb68: udprecv
Mar 31 09:11:03.260 fctx 241a78: start
Mar 31 09:11:03.260 fctx 241a78: try
Mar 31 09:11:03.260 fctx 241a78: cancelqueries
Mar 31 09:11:03.260 fctx 241a78: getaddresses
Mar 31 09:11:03.260 dns_adb_destroyfind on find 287120
Mar 31 09:11:03.260 dns_adb_destroyfind on find 287120
Mar 31 09:11:03.260 dns_adb_destroyfind on find 287120
Mar 31 09:11:03.261 dns_adb_destroyfind on find 287120
Mar 31 09:11:03.261 fctx 241a78: no addresses
Mar 31 09:11:03.261 fctx 241a78: done
Mar 31 09:11:03.261 fctx 241a78: stopeverything
Mar 31 09:11:03.261 fctx 241a78: cancelqueries
Mar 31 09:11:03.261 fctx 241a78: sendevents
Mar 31 09:11:03.261 fetch 24a650 (fctx 241a78): destroyfetch
Mar 31 09:11:03.261 fctx 241a78: shutdown
Mar 31 09:11:03.261 client 127.0.0.1#47388: error
Mar 31 09:11:03.261 client 127.0.0.1#47388: send
Mar 31 09:11:03.261 client 127.0.0.1#47388: sendto
Mar 31 09:11:03.261 client 127.0.0.1#47388: senddone
Mar 31 09:11:03.261 client 127.0.0.1#47388: next
Mar 31 09:11:03.261 client 127.0.0.1#47388: endrequest
Mar 31 09:11:03.261 fctx 241a78: doshutdown
Mar 31 09:11:03.261 fctx 241a78: stopeverything
Mar 31 09:11:03.261 fctx 241a78: cancelqueries
Mar 31 09:11:03.261 fctx 241a78: destroy
Going up to 'trace 9' it doesn't do much better.
That's about the extent of my debugging at this point.
I've restarted named and left 'trace 9' running.
--jhawk