[864] in SIPB_Linux_Development
ext2 files getting unlinked under 1.1.7[23] and other things
daemon@ATHENA.MIT.EDU (Erik Nygren)
Tue Dec 20 16:51:43 1994
To: linux-kernel@vger.rutgers.edu, linux-afs-bugs@MIT.EDU
Cc: warlord@MIT.EDU, yonah@MIT.EDU, jered@MIT.EDU, linux-dev@MIT.EDU
Date: Tue, 20 Dec 1994 16:51:30 -0500
From: Erik Nygren <nygren@MIT.EDU>
Hello,
A number of machines here at MIT, all running 1.1.7[23] with Linux AFS
have noticed ext2 file corruption. This is a BadThing[tm]. File
corruption seems to occur in very localized areas on the disk. In the
most recent case on my machine, I was able to look the files up in the
Slackware package list from my install and discovered that it was a
chunk from there:
usr/sbin/routed
usr/sbin/rpc.bwnfsd
usr/sbin/rpc.mountd
usr/sbin/rpc.nfsd <-- start of lossage
usr/sbin/rpc.pcnfsd
usr/sbin/rpc.portmap
usr/sbin/rpc.ugidd
usr/sbin/rwhod
usr/sbin/sendlog
usr/sbin/tcpd
usr/sbin/in.ftpd.NetBSD <-- end of lossage
usr/sbin/showmount
usr/sbin/in.rexecd
Oddly, /usr/sbin/syslogd and /usr/sbin/klogd and possibly other files
also went away. By my count this is about 150K, but I think it's the
number of files that matters. The previous time I noticed this, a
chunk of my dotfiles disappeared from my homedir. Those files were
all quite small and about the same number got trashed. The problem
isn't that they get really unlinked. When fsck gets run on the
partition, often because of maximal mount count getting reached, it
sees the files as being not properly unlinked (I don't have a record
of the exact message) and proceeds to unlink them. A large
block of bitmap differences was also recorded. I've only seen this problem
root partitions. In another instance, I noticed that some of
my dotfiles were not accessable. I did an ls -l and saw:
....
-rw-r--r-- 1 nygren root 438 Feb 8 1994 .Xmodmap
-rwxr-xr-x 1 nygren root 11773 Nov 19 15:29 .Xresources*
-rw-r--r-- 1 nygren root 0 Feb 8 1994 .addressbook
c--s-wS-wx 20527 13125 17735 10, 58 Aug 15 2011 .aliases
c---rwx--- 8250 8250 8275 73, 84 Oct 23 2014 .anyone
-rw-r--r-- 1 nygren root 1160 Feb 8 1994 .anyone.bak
-rw-r--r-- 1 nygren root 6684 Nov 22 22:25 .anyone.who
cr-s--Sr-- 12115 16971 17225 75, 32 Dec 9 2009 .bash_history
-rw-r--r-- 1 nygren root 57 Feb 8 1994 .bash_logout
d--Sr----T 12845 19781 17232 959661356 Nov 7 2006 .bash_profile/
dr-x-wS-w- 19759 11600 20047 1163013170 Mar 23 2005 .bashrc/
-rw-r--r-- 1 nygren root 5 Feb 8 1994 .cmdf_pid
-rw-r--r-- 1 nygren root 81 Dec 10 01:22 .comicsrc
-rw-r--r-- 1 nygren root 0 Dec 4 23:42 .crefrc
-rw-r--r-- 1 nygren root 474 Aug 30 16:30 .dayplan
-rw-r--r-- 1 nygren root 89 Aug 30 16:21 .dayplan.bak
-rw------- 1 nygren root 0 Aug 30 16:30 .dayplan.priv
-rw------- 1 nygren root 0 Aug 30 16:21 .dayplan.priv.bak
-rw-r--r-- 1 nygren root 724 Dec 9 17:44 .doomrc
d--x--S--x 17751 16717 20304 540680259 Oct 27 2014 .emacs/
-rw-r--r-- 1 nygren root 3204 Apr 7 1994 .emacs.old
-rw-r--r-- 1 nygren root 3511 Jun 23 04:12 .emacs.sgi
dr-s--Sr-T 17952 8275 14880 1330121274 Jul 24 23:53 .emacs~/
-rwxr-xr-x 1 nygren root 1062 Jan 28 1994 .environment*
-rw-r--r-- 1 nygren root 62 Oct 29 02:06 .eolcr
-rw-r--r-- 1 nygren root 15 Feb 8 1994 .forward
-rw-r--r-- 1 nygren root 13054 Sep 8 02:14 .fvwmrc
....
I don't think .bash_history will work all that well as
a setuid character device.... :-/ Anyways,
this stayed like this for a short while and then
suddenly returned to normal! I have no idea what caused
this sudden change of heart as I didn't reboot or fsck or
anything. The inode numbers of these are in the 6432[0-9] range
(ie .anyone is 64322).
Some messages from my syslog:
.....
Dec 11 19:13:18 foundation kernel: Unable to handle kernel NULL pointer dereference at virtual address c0000004
Dec 11 19:13:18 foundation kernel: current->tss.cr3 = 007d4000, Xr3 = 007d4000
Dec 11 19:13:18 foundation kernel: *pde = 00102027
Dec 11 19:13:18 foundation kernel: *pte = 00000027
Dec 11 19:13:26 foundation kernel: EXT2-fs error (device 3/7): ext2_find_entry:
bad directory entry: rec_len is smaller than minimal
Dec 11 19:15:05 foundation syslogd: exiting on signal 15
Dec 11 19:19:22 foundation kernel: EXT2-fs warning (device 3/69): ext2_free_inode: bit already cleared for inode 64335
Dec 11 19:19:22 foundation kernel: EXT2-fs warning (device 3/69): ext2_free_inode: bit already cleared for inode 64335
....
(those are from the first case where my dotfiles went away).
Another user reported on another machine:
....
Dec 15 01:34:33 vorlon kernel: EXT2-fs error (device 8/2): ext2_find_entry: bad directory entry: directory entry across blocks
Dec 15 01:34:56 vorlon last message repeated 2 times
Dec 15 01:50:09 vorlon kernel: EXT2-fs error (device 8/2): ext2_find_entry: bad directory entry: directory entry across blocks
.....
Dec 15 02:02:05 vorlon kernel: EXT2-fs error (device 8/2): ext2_find_entry: bad directory entry: rec_len is smaller than minimal
Dec 15 02:02:14 vorlon last message repeated 2 times
Dec 15 02:07:21 vorlon kernel: EXT2-fs error (device 8/2): ext2_find_entry: bad directory entry: rec_len is smaller than minimal
Dec 15 02:07:30 vorlon last message repeated 2 times
Dec 15 02:11:43 vorlon kernel: EXT2-fs error (device 8/2): ext2_find_entry: bad directory entry: rec_len is smaller than minimal
Dec 15 02:12:14 vorlon kernel: EXT2-fs error (device 8/2): ext2_readdir: bad directory entry: rec_len is smaller than minimal
Dec 15 02:18:13 vorlon kernel: EXT2-fs error (device 8/2): ext2_find_entry: bad directory entry: rec_len is smaller than minimal
Dec 15 02:18:15 vorlon kernel: EXT2-fs error (device 8/2): ext2_find_entry: bad directory entry: rec_len is smaller than minimal
.....
A number of other strange things have been noticed which may be
causing problems. The TIME_WAIT problem still seems to occur
even with the patch to 1.1.73 Linus sent out yesterday.
Some messages from my syslog:
Dec 20 01:09:11 foundation kernel: Trying to free free memory (002fd000): memory probably corrupted
Dec 20 01:09:11 foundation kernel: PC = 0011a935
Dec 20 14:40:40 foundation kernel: Trying to free free memory (006ec000): memory probably corrupted
Dec 20 14:40:40 foundation kernel: PC = 0011a935
0011a8bc t _free_one_table
0011a9cc T _clear_page_tables
This still happens under 1.1.73. My machine is VESA local bus and the
video card uses a linear aperature in super high mem, but I've gotten
this message even in a console without X running.
Note that when Linux AFS starts up (by getting loaded with insmod),
the kernel increases in size quickly and pushes the limits of
memory (possibly due to agressive buffering?) and kmalloc
returns:
Dec 20 14:46:57 foundation kernel: Starting AFS cache scan...Couldn't get a free page.....
Dec 20 14:46:57 foundation kernel: osi_Alloc: kmalloc returned NULL allocing 84000
The "Couldn't get a free page" message is returned by kmalloc
sometimes before it returns NULL. Linux AFS checks for
this and deals, but I've looked through other portions
of the kernel which do *NOT* deal with kmalloc returning
NULL. This is VERY bad as kmalloc will return NULL and
many things in the kernel do a kmalloc and then dereference
it immediately without checking. This is a bad thing.
It may be that kmalloc returns null at boot to some other parts of
the kernel at bootup and then bad things start happening.
The ext2 lossage has been noticed enough times that
it might not be the random occurrance I would think
this would cause. Machines this has been noticed
on range from old ISA machines with SCSI to VLB machines
with IDE to PCI bus machines.
I've been having problems ever since 1.1.61 with memory corruption,
etc. (I made a report of them here a few weeks ago). They
seemed to have gotten mostly better until this which
seems to have started in 1.1.72. This problem is most likely
connected in some way with Linux AFS, but other parts of the
kernel may be involved. If anyone needs more info, please ask
and I'll try to provide it. Any ideas?
Thanks,
Erik Nygren
___________________________________________________________________________
Erik Nygren \ \ \ Massachusetts Institute of Technology
450 Memorial Drive \ \ \ Email: nygren@mit.edu Voice: 617/225-9297
Cambridge, MA 02139 \ \ \ http://www.mit.edu:8001/people/nygren/home.html