[864] in SIPB_Linux_Development

home help back first fref pref prev next nref lref last post

ext2 files getting unlinked under 1.1.7[23] and other things

daemon@ATHENA.MIT.EDU (Erik Nygren)
Tue Dec 20 16:51:43 1994

To: linux-kernel@vger.rutgers.edu, linux-afs-bugs@MIT.EDU
Cc: warlord@MIT.EDU, yonah@MIT.EDU, jered@MIT.EDU, linux-dev@MIT.EDU
Date: Tue, 20 Dec 1994 16:51:30 -0500
From: Erik Nygren <nygren@MIT.EDU>


Hello,

A number of machines here at MIT, all running 1.1.7[23] with Linux AFS
have noticed ext2 file corruption.  This is a BadThing[tm].  File
corruption seems to occur in very localized areas on the disk.  In the
most recent case on my machine, I was able to look the files up in the
Slackware package list from my install and discovered that it was a
chunk from there:

usr/sbin/routed
usr/sbin/rpc.bwnfsd
usr/sbin/rpc.mountd
usr/sbin/rpc.nfsd     <-- start of lossage
usr/sbin/rpc.pcnfsd
usr/sbin/rpc.portmap
usr/sbin/rpc.ugidd
usr/sbin/rwhod
usr/sbin/sendlog
usr/sbin/tcpd
usr/sbin/in.ftpd.NetBSD  <-- end of lossage
usr/sbin/showmount
usr/sbin/in.rexecd

Oddly, /usr/sbin/syslogd and /usr/sbin/klogd and possibly other files
also went away.  By my count this is about 150K, but I think it's the
number of files that matters.  The previous time I noticed this, a
chunk of my dotfiles disappeared from my homedir.  Those files were
all quite small and about the same number got trashed.  The problem
isn't that they get really unlinked.  When fsck gets run on the
partition, often because of maximal mount count getting reached, it
sees the files as being not properly unlinked (I don't have a record
of the exact message) and proceeds to unlink them.  A large
block of bitmap differences was also recorded. I've only seen this problem
root partitions.  In another instance, I noticed that some of
my dotfiles were not accessable.  I did an ls -l and saw:

....
-rw-r--r--   1 nygren   root          438 Feb  8  1994 .Xmodmap
-rwxr-xr-x   1 nygren   root        11773 Nov 19 15:29 .Xresources*
-rw-r--r--   1 nygren   root            0 Feb  8  1994 .addressbook
c--s-wS-wx 20527 13125    17735     10,  58 Aug 15  2011 .aliases
c---rwx--- 8250 8250     8275      73,  84 Oct 23  2014 .anyone
-rw-r--r--   1 nygren   root         1160 Feb  8  1994 .anyone.bak
-rw-r--r--   1 nygren   root         6684 Nov 22 22:25 .anyone.who
cr-s--Sr-- 12115 16971    17225     75,  32 Dec  9  2009 .bash_history
-rw-r--r--   1 nygren   root           57 Feb  8  1994 .bash_logout
d--Sr----T 12845 19781    17232    959661356 Nov  7  2006 .bash_profile/
dr-x-wS-w- 19759 11600    20047    1163013170 Mar 23  2005 .bashrc/
-rw-r--r--   1 nygren   root            5 Feb  8  1994 .cmdf_pid
-rw-r--r--   1 nygren   root           81 Dec 10 01:22 .comicsrc
-rw-r--r--   1 nygren   root            0 Dec  4 23:42 .crefrc
-rw-r--r--   1 nygren   root          474 Aug 30 16:30 .dayplan
-rw-r--r--   1 nygren   root           89 Aug 30 16:21 .dayplan.bak
-rw-------   1 nygren   root            0 Aug 30 16:30 .dayplan.priv
-rw-------   1 nygren   root            0 Aug 30 16:21 .dayplan.priv.bak
-rw-r--r--   1 nygren   root          724 Dec  9 17:44 .doomrc
d--x--S--x 17751 16717    20304    540680259 Oct 27  2014 .emacs/
-rw-r--r--   1 nygren   root         3204 Apr  7  1994 .emacs.old
-rw-r--r--   1 nygren   root         3511 Jun 23 04:12 .emacs.sgi
dr-s--Sr-T 17952 8275     14880    1330121274 Jul 24 23:53 .emacs~/
-rwxr-xr-x   1 nygren   root         1062 Jan 28  1994 .environment*
-rw-r--r--   1 nygren   root           62 Oct 29 02:06 .eolcr
-rw-r--r--   1 nygren   root           15 Feb  8  1994 .forward
-rw-r--r--   1 nygren   root        13054 Sep  8 02:14 .fvwmrc
....

I don't think .bash_history will work all that well as
a setuid character device....  :-/  Anyways, 
this stayed like this for a short while and then
suddenly returned to normal!  I have no idea what caused
this sudden change of heart as I didn't reboot or fsck or
anything.  The inode numbers of these are in the 6432[0-9] range
(ie .anyone is 64322).

Some messages from my syslog:

.....
Dec 11 19:13:18 foundation kernel: Unable to handle kernel NULL pointer dereference at virtual address c0000004
Dec 11 19:13:18 foundation kernel: current->tss.cr3 = 007d4000, Xr3 = 007d4000
Dec 11 19:13:18 foundation kernel: *pde = 00102027
Dec 11 19:13:18 foundation kernel: *pte = 00000027
Dec 11 19:13:26 foundation kernel: EXT2-fs error (device 3/7): ext2_find_entry:
bad directory entry: rec_len is smaller than minimal
Dec 11 19:15:05 foundation syslogd: exiting on signal 15
Dec 11 19:19:22 foundation kernel: EXT2-fs warning (device 3/69): ext2_free_inode: bit already cleared for inode 64335
Dec 11 19:19:22 foundation kernel: EXT2-fs warning (device 3/69): ext2_free_inode: bit already cleared for inode 64335
....



(those are from the first case where my dotfiles went away).
Another user reported on another machine:


....
Dec 15 01:34:33 vorlon kernel: EXT2-fs error (device 8/2): ext2_find_entry: bad directory entry: directory entry across blocks
Dec 15 01:34:56 vorlon last message repeated 2 times
Dec 15 01:50:09 vorlon kernel: EXT2-fs error (device 8/2): ext2_find_entry: bad directory entry: directory entry across blocks
.....
Dec 15 02:02:05 vorlon kernel: EXT2-fs error (device 8/2): ext2_find_entry: bad directory entry: rec_len is smaller than minimal
Dec 15 02:02:14 vorlon last message repeated 2 times
Dec 15 02:07:21 vorlon kernel: EXT2-fs error (device 8/2): ext2_find_entry: bad directory entry: rec_len is smaller than minimal
Dec 15 02:07:30 vorlon last message repeated 2 times
Dec 15 02:11:43 vorlon kernel: EXT2-fs error (device 8/2): ext2_find_entry: bad directory entry: rec_len is smaller than minimal
Dec 15 02:12:14 vorlon kernel: EXT2-fs error (device 8/2): ext2_readdir: bad directory entry: rec_len is smaller than minimal
Dec 15 02:18:13 vorlon kernel: EXT2-fs error (device 8/2): ext2_find_entry: bad directory entry: rec_len is smaller than minimal
Dec 15 02:18:15 vorlon kernel: EXT2-fs error (device 8/2): ext2_find_entry: bad directory entry: rec_len is smaller than minimal
.....


A number of other strange things have been noticed which may be
causing problems.  The TIME_WAIT problem still seems to occur 
even with the patch to 1.1.73 Linus sent out yesterday.
Some messages from my syslog:

Dec 20 01:09:11 foundation kernel: Trying to free free memory (002fd000): memory probably corrupted
Dec 20 01:09:11 foundation kernel: PC = 0011a935

Dec 20 14:40:40 foundation kernel: Trying to free free memory (006ec000): memory probably corrupted
Dec 20 14:40:40 foundation kernel: PC = 0011a935

0011a8bc t _free_one_table
0011a9cc T _clear_page_tables

This still happens under 1.1.73.  My machine is VESA local bus and the
video card uses a linear aperature in super high mem, but I've gotten
this message even in a console without X running.

Note that when Linux AFS starts up (by getting loaded with insmod),
the kernel increases in size quickly and pushes the limits of
memory (possibly due to agressive buffering?) and kmalloc
returns:

Dec 20 14:46:57 foundation kernel: Starting AFS cache scan...Couldn't get a free page.....
Dec 20 14:46:57 foundation kernel: osi_Alloc: kmalloc returned NULL allocing 84000

The "Couldn't get a free page" message is returned by kmalloc
sometimes before it returns NULL.  Linux AFS checks for
this and deals, but I've looked through other portions
of the kernel which do *NOT* deal with kmalloc returning
NULL.  This is VERY bad as kmalloc will return NULL and
many things in the kernel do a kmalloc and then dereference
it immediately without checking.  This is a bad thing.
It may be that kmalloc returns null at boot to some other parts of
the kernel at bootup and then bad things start happening.
The ext2 lossage has been noticed enough times that
it might not be the random occurrance I would think
this would cause.  Machines this has been noticed
on range from old ISA machines with SCSI to VLB machines
with IDE to PCI bus machines.

I've been having problems ever since 1.1.61 with memory corruption,
etc.  (I made a report of them here a few weeks ago).  They
seemed to have gotten mostly better until this which
seems to have started in 1.1.72.  This problem is most likely
connected in some way with Linux AFS, but other parts of the
kernel may be involved.  If anyone needs more info, please ask
and I'll try to provide it.  Any ideas?

	Thanks,
	Erik Nygren

___________________________________________________________________________
Erik Nygren        \ \ \  Massachusetts Institute of Technology
450 Memorial Drive  \ \ \  Email: nygren@mit.edu  Voice: 617/225-9297
Cambridge, MA 02139  \ \ \  http://www.mit.edu:8001/people/nygren/home.html

home help back first fref pref prev next nref lref last post