[1955] in SIPB_Linux_Development
YACCB (Yet Another Cache Corruption Bug)
daemon@ATHENA.MIT.EDU (Jeffrey Hutzelman)
Wed Dec 10 17:09:53 1997
Date: Wed, 10 Dec 1997 17:09:15 -0500 (EST)
From: Jeffrey Hutzelman <jhutz+@cmu.edu>
Reply-To: Jeffrey Hutzelman <jhutz+@cmu.edu>
To: linux-afs-bugs@MIT.EDU
Cc: cg2v+@andrew.cmu.edu, linux-dev@MIT.EDU, afs-suckers@dementia.org
In transaction [1975] of linux-afs, Emil Sit <sit@MIT.EDU> wrote:
> I have a machine whose video card is not particularly stable and will
> sometime lock the machine when graphics programs mess with the color
> map. Anyway, it crashed, and then /afs/athena/user/n (and possibly
> other things) went away. This sounds to me like AFS cache corruption.
> I nuked /usr/vice/cache and rebooted and things worked well again.
We've seen a similar problem occaisonally, but even when the machine
hadn't crashed. The symptoms are basically the same - a directory
appears to be empty, or a volume appears to be offline, even though
it isn't. Once it comes up, the problem persists across a reboot,
unless the cache is cleared (removing the CacheItems file is sufficient;
it is almost never neccessary to actually remove the V* files).
However, for the empty-directory case, an 'fs flush' of the affected
directory was sufficient to make it go away (that doesn't work for
mount points, since the volume can't be accessed).
This hasn't happened very often, and we've seen similar problems
before on other platforms, so we've basically been ignoring it.
However, on Monday, I stumbled upon a way to reproduce the problem.
We use an automated mechanism (built around depot) to install software
on our machines. I was attempting to upgrade my machine on Monday,
and it kept losing with the problem I described above. The difference
this time was that even if I cleared the cache and rebooted, the
problem would come back if I tried the upgrade again. Thus, I now
had a consistent, if unwieldy, way to reproduce a problem.
Yesterday, we did some work, and found out interesting things. It
seems that cache chunks were (apparently) randomly being truncated
to 0 length when they shouldn't be. So, the cache manager would
think it had a 17-byte mount point cached, but the chunk contained
no data. Obviously, it wasn't able to access a volume named by the
empty string! Further investigation showed that, every once in a
great while, the cache manager would send a bogus fetch-data request
to the fileserver (correct Fid, but bogus offset and length).
It would get an equally bogus response, which would result in the
0-length cache chunk.
We haven't been able to track down the reason for the bogus request.
We added some logging code, and the values are correct at least as
far as the call to StartRXAFS_FetchData, but the wrong values go out
on the wire. So, I don't have a fix for this.
What I _do_ have is a workaround. The bogus length returned by the
fileserver is apparently always negative. So, we made a change to
UFS_CacheFetchProc (in afs/afs_cache.c) so that if it gets a
negative data length from the fileserver, it returns VMOVED. This
tricks GetDCache into retrying the fetch, which generally succeeds
(in the short time since we came up with this workaround, I haven't
seen it fail yet). Existing code in afs_Analyze insures that this
retry will happen at most 5 times, rather than looping forever if
there happens to be a bogus fileserver out there.
Since this is really only a workaround and not a fix, I'm kind of hesitant
to commit it to the CVS tree. On the other hand, it does eliminate a
case of cache corruption that was biting me pretty badly a couple of
days ago. I'd be interested in hearing people's opinions on this...
-- Jeffrey T. Hutzelman (N3NHS) <jhutz+@cmu.edu>
Systems Programmer, SCS Research Computing Facility
President and System Administrator, CMU Computer Club
Carnegie Mellon University - Pittsburgh, PA