[4865] in Hotline Meeting

home help back first fref pref prev next nref lref last post

joker.mit.edu has a bad RD53

daemon@ATHENA.MIT.EDU (patl@ATHENA.MIT.EDU)
Thu May 2 14:26:11 1991

From: patl@ATHENA.MIT.EDU
To: hotline@ATHENA.MIT.EDU
Cc: mjhostet@ATHENA.MIT.EDU
Date: Thu, 02 May 91 14:24:37 EDT


------- Forwarded Message

Received: by ATHENA-PO-1.MIT.EDU (5.45/4.7) id AA03053; Tue, 30 Apr 91 00:23:42 EDT
Received: from TISIPHONE.MIT.EDU by ATHENA.MIT.EDU with SMTP
	id AA19514; Tue, 30 Apr 91 00:23:36 EDT
Received: by tisiphone.MIT.EDU (5.61/4.7) id AA03407; Tue, 30 Apr 91 00:23:33 -0400
Date: Tue, 30 Apr 91 00:23:33 -0400
From: David Krikorian <dkk@ATHENA.MIT.EDU>
Message-Id: <9104300423.AA03407@tisiphone.MIT.EDU>
To: patl@ATHENA.MIT.EDU
Subject: Joker
Reply-To: dkk@mit.edu
Home: 47 Lake St., Arlington, MA 02174, (617) 646-9289
Office: MIT Bldg. E40-358A, (617) 253-8651, 258-8736 (fax)


Joker is having some serious problems on ra2.  Here are some relevant
syslogs:
- ------------

Aug 27 18:40:06 joker vmunix: ra0c: hard error sn0 OFFLINE
[...]
Sep 12 20:30:33 joker vmunix: ra2c: hard error sn2496 ra2c: hard error sn2496 
Sep 16 17:40:32 joker vmunix: ra2c: hard error sn2480 
[...]
Sep 17 15:38:58 joker vmunix: ra2c: hard error sn2496 
Feb 10 23:20:12 joker vmunix: ra2c: hard error sn16496 <4>uda0: soft error, <4>unknown error, unit 2, format 011, event 024
[...]
Mar  9 17:32:32 joker vmunix: ra2c: hard error sn16496 
Apr 23 00:20:24 joker vmunix: ra2c: hard error sn2880 
Apr 23 00:20:25 joker vmunix: ra2c: hard error sn2880 
Apr 28 16:16:35 joker vmunix: ra2c: hard error sn36212 
[...]

[The disk-checking I did (see below) generated further errors on all
these blocks: 1440, 2640, 4160, 6144, 6208, 16192, 20288, 24240,
36212, 38464, 50496, 64576, 72768]

- ------------
I ran icheck to determine which inodes (ie: files) contained the bad
blocks, and got:
- ------------

joker# /etc/icheck -b 2480 2496 16496 36212 /dev/rra2c
/dev/rra2c:
2480 arg; frag 0 of 8, inode=386, class=logical data block 8
2496 arg; frag 0 of 8, inode=387, class=logical data block 0
16496 arg; frag 0 of 8, inode=1952, class=logical data block 18
9470 dup frag; inode=3904, class=logical data block 0
9470 dup frag; inode=3904, class=logical data block 0
9526 dup frag; inode=3905, class=logical data block 0
9526 dup frag; inode=3905, class=logical data block 0
[...]
9590 dup frag; inode=3912, class=logical data block 0
9391 dup frag; inode=3913, class=logical data block 0
9624 dup block; inode=3914, class=logical data block 0
9619 dup frag; inode=3915, class=logical data block 0
9619 dup frag; inode=3915, class=logical data block 0
[...]
9688 dup frag; inode=3931, class=logical data block 0
36212 arg; frag 0 of 1, inode=6727, class=logical data block 0
files   2523 (r=2410,d=90,b=0,c=0,sl=23)
used   26682 (i=35,ii=0,b=2609,f=5530)
free   31114 (b=3833,f=450)
missing 7194

- ------------
The bad blocks (reported in the logs) are in inodes 386, 387, 1952 and
6727.  I think fsck should clean up all the cruft in the middle of the
icheck output.  When you reboot, your RVD disk isn't getting fsck'd.
*After* the hard errors are dealt with properly (not before!) the RVD
on ra2 should be mounted exclusive and fsck'd.

I then ran ncheck to figure out what filenames were associated with
the bad blocks:
- ------------

joker# /etc/ncheck -i 386 387 1952 6727 /dev/rra2c
/dev/rra2c:
ncheck: read error 4160
ncheck: read error 6208
ncheck: read error 14400
ncheck: read error 16192
ncheck: read error 20288
ncheck: read error 38464
ncheck: read error 64576
ncheck: read error 72768
[almost the same 9 messages, again]
386     ???//lib/runtime.bin
387     ???//lib/runtime.com
ncheck: read error 4160
ncheck: read error 6208
1952    /CLU/exe/link
ncheck: read error 14400
ncheck: read error 16192
ncheck: read error 20288
6727    ???//matlab.m
ncheck: read error 38464
ncheck: read error 50496
ncheck: read error 64576
ncheck: read error 72768

- ------------
One of the known bad blocks is in /CLU/exe/link.  (The filesystem root
'/' should be the root of the RVD, not the root of Joker's
filesystem.)  Others are in runtime.bin and runtime.com (in a
directory named lib/ somewhere), and another named matlab.m.  The
"???"  indicates that ncheck couldn't tell what the parent directory
was.  The rest of the "read error" messages indicate, as you might
expect, that something is very wrong with your ra2 or the hard disk
controller.  From the hard errors on ra0 earlier on, I'd suspect the
controller first, though even if it is just the controller, it
probably already caused some filesystem damage which needs to be
fixed.  If the disk is at fault, the hard errors could be cleared with
/tp/rqbads, but the disk would probably need to be replaced soon,
anyway.

The longer this problem is present, the worse it's going to get.

You may want to forward this message to hotline.


------- End of Forwarded Message


home help back first fref pref prev next nref lref last post