[17401] in Athena Bugs
Re: Solaris clients with filesystem corruption
daemon@ATHENA.MIT.EDU (Greg Hudson)
Thu Dec 2 07:52:55 1999
Message-Id: <199912021252.HAA05719@small-gods.mit.edu>
To: Garry Zacheiss <zacheiss@MIT.EDU>
cc: bugs@MIT.EDU
In-Reply-To: Your message of "Wed, 01 Dec 1999 22:11:40 EST."
<199912020311.WAA93287@oliver.mit.edu>
Date: Thu, 02 Dec 1999 07:52:47 -0500
From: Greg Hudson <ghudson@MIT.EDU>
> At the time, I speculated that this was a syncconf bug, and I still
> think it seems likely.
After some inspection, I think you're wrong about it being a bug in
syncconf, but I think you're right that syncconf is encouraging this
behavior.
The general sequence of events is this:
* Public machine boots and runs syncconf
* syncconf moves aside the four network configuration files
and writes out new versions. There is a short window of
time between moving them aside and writing out new versions
during which time a reboot would hose us (/etc/hostname.hme0
doesn't exist), but we're not getting bitten by that as far
as we know.
* Standard FFS semantics are that the inode and directory
entry for the new versions of the files get written
synchronously out to disk, so on disk the four files are
created with size 0. Since the files were not fsync()'d
during writing, the contents of the files are not
synchronously written to disk. So there is another window
of indefinite length during which time an unclean reboot
will hose us (/etc/hostname.hme0 has zero length).
I can make the problem significantly better by making syncconf create
new versions of the files and only move them into place if they differ
from the old versions. I'm not sure if I can fix the problem entirely
without writing C code, since I don't know how to get a shell script
to fsync() a file.
Incidentally, yesterday I gave cluster (well, Lou and Chris)
instructions on how to fix machines with this problem without
reinstalling them.