[26407] in Athena Bugs

home help back first fref pref prev next nref lref last post

9.3.18 Solaris: bind loses google

daemon@ATHENA.MIT.EDU (John Hawkinson)
Thu Mar 31 09:23:57 2005

Date: Thu, 31 Mar 2005 09:23:09 -0500 (EST)
Message-Id: <200503311423.j2VEN98A019721@portnoy.mit.edu>
To: bugs@mit.edu
From: John Hawkinson <jhawk@mit.edu>
X-Spam-Score: -2.4
X-Spam-Flag: NO
Errors-To: bugs-bounces@mit.edu

Hi.

[ Is the bind-doc package installed somewhere on Athena, or is it
just untarred in the source locker? ]

    Folk have regularly been reporting intermittant problems resolving
www.google.com over zephyr for the past week or so. I finally turned
on named debugging to try to track it down. I was not terribly
successful, but have enough to be puzzled.

    The symptoms seem to be that one gets either NXDOMAIN or SERVFAIL
when trying to resolve www.google.com, and it either goes away
after a few minutes, or it goes away instantly if you restarted named.

    One imagines this has something to do with google's complicated
CNAME structure:

;; ANSWER SECTION:
www.google.com.         786     IN      CNAME   www.l.google.com.
www.l.google.com.       186     IN      A       64.233.187.104
www.l.google.com.       186     IN      A       64.233.187.99
;; AUTHORITY SECTION:
l.google.com.           86286   IN      NS      a.l.google.com.
l.google.com.           86286   IN      NS      b.l.google.com.
l.google.com.           86286   IN      NS      d.l.google.com.


	Logs/dumps are in /mit/jhawk/google-dns:

[portnoy!jhawk] ~/google-dns> ls -l
total 184
-rw-r--r--   1 jhawk    mit        34771 Mar 31 09:13 named.run
-rw-r--r--   1 jhawk    mit        51426 Mar 31 09:13 named_dump.db
-rw-r--r--   1 jhawk    mit         6912 Mar 31 09:14 querylog
[portnoy!jhawk] ~/google-dns> cd

The querylog is uninsightful, with stuff like this:

09:08:10 client 127.0.0.1#47363: query: www.google.com IN A
09:08:15 client 127.0.0.1#47364: query: www.google.com.MIT.EDU IN A
09:08:15 client 127.0.0.1#47365: query: www.google.com IN A
09:08:20 client 127.0.0.1#47366: query: www.google.com.MIT.EDU IN A
09:08:20 client 127.0.0.1#47367: query: www.google.com IN A

Similarly, the dumpfile is unremarkable:

; glue
google.com.             172781  NS      ns1.google.com.
                        172781  NS      ns2.google.com.
                        172781  NS      ns3.google.com.
                        172781  NS      ns4.google.com.
; glue
l.google.com.           86381   NS      a.l.google.com.
                        86381   NS      b.l.google.com.
                        86381   NS      c.l.google.com.
                        86381   NS      d.l.google.com.
; glue
a.l.google.com.         86381   A       216.239.53.9
; glue
b.l.google.com.         86381   A       64.233.179.9
; glue
c.l.google.com.         86381   A       64.233.161.9
; glue
d.l.google.com.         86381   A       64.233.183.9
; glue
ns1.google.com.         172781  A       216.239.32.10
; glue
ns2.google.com.         172781  A       216.239.34.10
; glue
ns3.google.com.         172781  A       216.239.36.10
; glue
ns4.google.com.         172781  A       216.239.38.10
; authanswer
www.google.com.         881     CNAME   www.l.google.com.


It's odd, I guess, that the CNAME is held authoritatively
and yet there's no A record. This does suggest pehraps
the problem lies in the nameservers for the l.google.com domain.

I had querylogging on prior to the start of this problem,
but I only turned on tracing after it. (Perhaps that
is part of the problem in getting good data here?).

Tracing wrote to /etc/named.run, basically stuff like this:

Mar 31 09:09:30.481 client 127.0.0.1#47375: query: www.google.com IN A
Mar 31 09:09:30.482 createfetch: www.l.google.com A

That was at level 1. Bumping up to level 5, it's more detailed,
but not apparently more helpful:

Mar 31 09:11:03.259 client 127.0.0.1#47388: UDP request
Mar 31 09:11:03.259 client 127.0.0.1#47388: using view '_default'
Mar 31 09:11:03.259 client 127.0.0.1#47388: request is not signed
Mar 31 09:11:03.259 client 127.0.0.1#47388: recursion available: approved
Mar 31 09:11:03.259 client 127.0.0.1#47388: query
Mar 31 09:11:03.259 client 127.0.0.1#47388: query: www.google.com IN A
Mar 31 09:11:03.260 client 127.0.0.1#47388: query (cache) approved
Mar 31 09:11:03.260 client 127.0.0.1#47388: replace
Mar 31 09:11:03.260 clientmgr @1fe450: createclients
Mar 31 09:11:03.260 clientmgr @1fe450: recycle
Mar 31 09:11:03.260 createfetch: www.l.google.com A
Mar 31 09:11:03.260 fctx 241a78: create
Mar 31 09:11:03.260 fctx 241a78: join
Mar 31 09:11:03.260 fetch 24a650 (fctx 241a78): created
Mar 31 09:11:03.260 client @26bb68: udprecv
Mar 31 09:11:03.260 fctx 241a78: start
Mar 31 09:11:03.260 fctx 241a78: try
Mar 31 09:11:03.260 fctx 241a78: cancelqueries
Mar 31 09:11:03.260 fctx 241a78: getaddresses
Mar 31 09:11:03.260 dns_adb_destroyfind on find 287120
Mar 31 09:11:03.260 dns_adb_destroyfind on find 287120
Mar 31 09:11:03.260 dns_adb_destroyfind on find 287120
Mar 31 09:11:03.261 dns_adb_destroyfind on find 287120
Mar 31 09:11:03.261 fctx 241a78: no addresses
Mar 31 09:11:03.261 fctx 241a78: done
Mar 31 09:11:03.261 fctx 241a78: stopeverything
Mar 31 09:11:03.261 fctx 241a78: cancelqueries
Mar 31 09:11:03.261 fctx 241a78: sendevents
Mar 31 09:11:03.261 fetch 24a650 (fctx 241a78): destroyfetch
Mar 31 09:11:03.261 fctx 241a78: shutdown
Mar 31 09:11:03.261 client 127.0.0.1#47388: error
Mar 31 09:11:03.261 client 127.0.0.1#47388: send
Mar 31 09:11:03.261 client 127.0.0.1#47388: sendto
Mar 31 09:11:03.261 client 127.0.0.1#47388: senddone
Mar 31 09:11:03.261 client 127.0.0.1#47388: next
Mar 31 09:11:03.261 client 127.0.0.1#47388: endrequest
Mar 31 09:11:03.261 fctx 241a78: doshutdown
Mar 31 09:11:03.261 fctx 241a78: stopeverything
Mar 31 09:11:03.261 fctx 241a78: cancelqueries
Mar 31 09:11:03.261 fctx 241a78: destroy

Going up to 'trace 9' it doesn't do much better.

That's about the extent of my debugging at this point.
I've restarted named and left 'trace 9' running.

--jhawk

home help back first fref pref prev next nref lref last post