[453] in arla-drinkers
Re: another arlad crash on netbsd
daemon@ATHENA.MIT.EDU (Love)
Sun Jan 3 18:01:12 1999
From owner-arla-drinkers@stacken.kth.se Sun Jan 03 23:01:11 1999
Return-Path: <owner-arla-drinkers@stacken.kth.se>
Delivered-To: arla-drinkers-mtg@bloom-picayune.mit.edu
Received: (qmail 19831 invoked from network); 3 Jan 1999 23:01:11 -0000
Received: from unknown (HELO sundance.stacken.kth.se) (130.237.234.41)
by bloom-picayune.mit.edu with SMTP; 3 Jan 1999 23:01:11 -0000
Received: (from majordom@localhost)
by sundance.stacken.kth.se (8.8.8/8.8.8) id XAA12983
for arla-drinkers-list; Sun, 3 Jan 1999 23:51:46 +0100 (MET)
Received: from elixir.e.kth.se (1073744992@elixir.e.kth.se [130.237.48.5])
by sundance.stacken.kth.se (8.8.8/8.8.8) with ESMTP id XAA12979
for <arla-drinkers@stacken.kth.se>; Sun, 3 Jan 1999 23:51:42 +0100 (MET)
Received: from hummel.e.kth.se (hummel.e.kth.se [130.237.43.135])
by elixir.e.kth.se (8.8.7/8.8.7) with ESMTP id XAA02413;
Sun, 3 Jan 1999 23:51:41 +0100 (MET)
Received: (lha@localhost) by hummel.e.kth.se (8.8.7/8.6.6) id XAA29130; Sun, 3 Jan 1999 23:51:39 +0100 (MET)
From: Love <lha@stacken.kth.se>
To: Ken Raeburn <raeburn@raeburn.org>
Cc: arla-drinkers@stacken.kth.se
Subject: Re: another arlad crash on netbsd
References: <199901030822.DAA14282@kr-pc.cygnus.com>
Mime-Version: 1.0 (generated by tm-edit 7.106)
Content-Type: text/plain; charset=US-ASCII
Date: 03 Jan 1999 23:51:38 +0100
In-Reply-To: Ken Raeburn's message of Sun, 3 Jan 1999 03:22:06 -0500 (EST)
Message-ID: <amu2y82cn9.fsf@hummel.e.kth.se>
Lines: 109
X-Mailer: Gnus v5.5/Emacs 20.2
Sender: owner-arla-drinkers@stacken.kth.se
Precedence: bulk
Ken Raeburn <raeburn@raeburn.org> writes:
> I was running a "du" across a modem line (ppp) that probably had a
> bunch of other traffic as well (mail & news downloads, X11), and when
> I went to look at the output, after some numbers for the first many
> directories, I saw a lot of "network is down" messages for individual
> files, then:
>
> du: ./.mh/save/1610: Network is down
> du: ./.mh/save/1616: Network is down
> du: ./.mh/save/.mh_sequences: Network is down
> 751 ./.mh/save
> du: ./.mh/Zephyr: Operation not supported by device
> du: ./.mh/ANSI_C: Not a directory
> du: ./.mh/tcl: Not a directory
>
> The "not a directory" stuff seems to come up when arlad isn't running,
> so I'm guessing that that's the point when it crashed, and the
> "network is down" came from having a heavy load on the ppp link, but
> I'm just guessing.
The "Operation not supported by device" and "Not a directory" comes from
the dead vnode that xfs created when arlad died.
What probably have happened is that that your fileserver respons get throw to
arlad to late and arlad considers the fileserver down, and later when
arlad cleans the the cache the cleaner thread never checks the returnvalue
of conn_get().
Guess that there need to be some way to tell arla to retry forever when
you sits behind a slow link and really want your files. It should probably
not be too hard to fix that.
The below patch should fix this problem with arlad dieing. (But not the
retry stuff).
> And entry->host does correspond to the host holding the volume I was
> examining. (However, using Transarc "fs whereis" on the volume after
> restarting arlad, I get a backwards IP address printed out,
> "30.0.185.18" when it should presumably be "18.185.0.30" or
> "cronos.mit.edu". Perhaps AFS and Arla are using different byte
> orders for that datum.)
We always assume that data is in network order when passed between
arla and fs. The documentation does not say anything about the hostorder.
Guess you could use arla's fs instead.
Love
Index: fcache.c
===================================================================
RCS file: /usr/local/cvsroot/arla/arlad/fcache.c,v
retrieving revision 1.176
diff -u -w -u -w -r1.176 fcache.c
--- fcache.c 1999/01/03 05:25:39 1.176
+++ fcache.c 1999/01/03 22:26:11
@@ -449,6 +449,7 @@
FS_SERVICE_ID, fs_probe, ce);
cred_free (ce);
+ if (conn != NULL) {
fids.len = cbs.len = 1;
fids.val = &entry->fid.fid;
cbs.val = &entry->callback;
@@ -457,6 +458,7 @@
if (ret)
arla_warn (ADEBFCACHE, ret, "RXAFS_GiveUpCallBacks");
}
+ }
volcache_free (entry->volume);
entry->volume = NULL;
/* entry->inode = 0;*/
@@ -1417,6 +1419,8 @@
conn = conn_get (entry->fid.Cell, entry->host, afsport,
FS_SERVICE_ID, fs_probe, ce);
cred_free (ce);
+
+ if (conn != NULL) {
fids.len = cbs.len = 1;
fids.val = &entry->fid.fid;
cbs.val = &entry->callback;
@@ -1427,6 +1431,7 @@
arla_warn (ADEBFCACHE, ret, "RXAFS_GiveUpCallBacks");
}
}
+ }
return 0; /* XXX */
}
@@ -1457,7 +1462,9 @@
conn = conn_get (entry->fid.Cell, entry->host, afsport,
FS_SERVICE_ID, fs_probe, ce);
+ cred_free (ce);
+ if (conn != NULL) {
ret = RXAFS_FetchStatus (conn->connection,
&entry->fid.fid,
&status,
@@ -1470,7 +1477,7 @@
rx_HostOf(rx_PeerOf (conn->connection)),
ce->cred);
conn_free (conn);
- cred_free (ce);
+ }
}
}
return 0; /* XXX */