[98] in athena10
Re: Corruption bug affecting debathenificator
daemon@ATHENA.MIT.EDU (Greg Hudson)
Wed Feb 27 17:19:47 2008
From: Greg Hudson <ghudson@MIT.EDU>
To: athena10@mit.edu
In-Reply-To: <200802262159.m1QLxSc7014383@outgoing.mit.edu>
Content-Type: text/plain
Date: Wed, 27 Feb 2008 17:19:16 -0500
Message-Id: <1204150756.5862.21.camel@error-messages.mit.edu>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Didn't turn up anything suspicious in my LVM configuration. Some more
thoughts before I give up for the evening:
On Tue, 2008-02-26 at 16:59 -0500, ghudson@MIT.EDU wrote:
> In at least the _ to ^ case, the change was represented in the
> diff, which may indicate that the corruption is not happening
> during the copy to /tmp. (If it were happening during the copy
> to /tmp then it wouldn't show up in the diff and the binary build
> would have succeeded. I think.)
Thinking about this case harder, the simplest explanation is that one of
the other files in $tmpdir besides the orig tarball is being corrupted.
So, the scenario appears to be:
1. Create a bunch of files in /tmp.
2. Run schroot -b to create an schroot session.
3. Some of the time, one of the new files has had one byte decremented,
always at offset 156 within a 1024-byte block (but a different block
each time).
One possibility is that something schroot -b is doing is opening files
in /tmp and altering them. That seems kind of unlikely.
A second possibility is that something schroot -b is doing is causing
kernel memory corruption in the page cache. One way to verify this
assessment would be to dump the page cache and see if the file remains
corrupted. (Right now the only way I know of to dump the page cache is
to reboot the machine, which cleans out /tmp, but I can either inhibit
that cleanup or move $tmpdir to a different location.) I can narrow
down the kernel operation causing the corruption by instrumenting
schroot with periodic calls to a function which verifies the md5sums of
the newly created files. Creation of an LVM snapshot is the most exotic
kernel operation and thus the most likely candidate.
If it is kernel memory corruption, I can either stop using LVM snapshots
(helloooo slow tarball schroots) or try to track down the problem in the
kernel code. Fun either way.