| home | help | back | first | fref | pref | prev | next | nref | lref | last | post |
From: patl@ATHENA.MIT.EDU To: hotline@ATHENA.MIT.EDU Cc: mjhostet@ATHENA.MIT.EDU Date: Thu, 02 May 91 14:24:37 EDT ------- Forwarded Message Received: by ATHENA-PO-1.MIT.EDU (5.45/4.7) id AA03053; Tue, 30 Apr 91 00:23:42 EDT Received: from TISIPHONE.MIT.EDU by ATHENA.MIT.EDU with SMTP id AA19514; Tue, 30 Apr 91 00:23:36 EDT Received: by tisiphone.MIT.EDU (5.61/4.7) id AA03407; Tue, 30 Apr 91 00:23:33 -0400 Date: Tue, 30 Apr 91 00:23:33 -0400 From: David Krikorian <dkk@ATHENA.MIT.EDU> Message-Id: <9104300423.AA03407@tisiphone.MIT.EDU> To: patl@ATHENA.MIT.EDU Subject: Joker Reply-To: dkk@mit.edu Home: 47 Lake St., Arlington, MA 02174, (617) 646-9289 Office: MIT Bldg. E40-358A, (617) 253-8651, 258-8736 (fax) Joker is having some serious problems on ra2. Here are some relevant syslogs: - ------------ Aug 27 18:40:06 joker vmunix: ra0c: hard error sn0 OFFLINE [...] Sep 12 20:30:33 joker vmunix: ra2c: hard error sn2496 ra2c: hard error sn2496 Sep 16 17:40:32 joker vmunix: ra2c: hard error sn2480 [...] Sep 17 15:38:58 joker vmunix: ra2c: hard error sn2496 Feb 10 23:20:12 joker vmunix: ra2c: hard error sn16496 <4>uda0: soft error, <4>unknown error, unit 2, format 011, event 024 [...] Mar 9 17:32:32 joker vmunix: ra2c: hard error sn16496 Apr 23 00:20:24 joker vmunix: ra2c: hard error sn2880 Apr 23 00:20:25 joker vmunix: ra2c: hard error sn2880 Apr 28 16:16:35 joker vmunix: ra2c: hard error sn36212 [...] [The disk-checking I did (see below) generated further errors on all these blocks: 1440, 2640, 4160, 6144, 6208, 16192, 20288, 24240, 36212, 38464, 50496, 64576, 72768] - ------------ I ran icheck to determine which inodes (ie: files) contained the bad blocks, and got: - ------------ joker# /etc/icheck -b 2480 2496 16496 36212 /dev/rra2c /dev/rra2c: 2480 arg; frag 0 of 8, inode=386, class=logical data block 8 2496 arg; frag 0 of 8, inode=387, class=logical data block 0 16496 arg; frag 0 of 8, inode=1952, class=logical data block 18 9470 dup frag; inode=3904, class=logical data block 0 9470 dup frag; inode=3904, class=logical data block 0 9526 dup frag; inode=3905, class=logical data block 0 9526 dup frag; inode=3905, class=logical data block 0 [...] 9590 dup frag; inode=3912, class=logical data block 0 9391 dup frag; inode=3913, class=logical data block 0 9624 dup block; inode=3914, class=logical data block 0 9619 dup frag; inode=3915, class=logical data block 0 9619 dup frag; inode=3915, class=logical data block 0 [...] 9688 dup frag; inode=3931, class=logical data block 0 36212 arg; frag 0 of 1, inode=6727, class=logical data block 0 files 2523 (r=2410,d=90,b=0,c=0,sl=23) used 26682 (i=35,ii=0,b=2609,f=5530) free 31114 (b=3833,f=450) missing 7194 - ------------ The bad blocks (reported in the logs) are in inodes 386, 387, 1952 and 6727. I think fsck should clean up all the cruft in the middle of the icheck output. When you reboot, your RVD disk isn't getting fsck'd. *After* the hard errors are dealt with properly (not before!) the RVD on ra2 should be mounted exclusive and fsck'd. I then ran ncheck to figure out what filenames were associated with the bad blocks: - ------------ joker# /etc/ncheck -i 386 387 1952 6727 /dev/rra2c /dev/rra2c: ncheck: read error 4160 ncheck: read error 6208 ncheck: read error 14400 ncheck: read error 16192 ncheck: read error 20288 ncheck: read error 38464 ncheck: read error 64576 ncheck: read error 72768 [almost the same 9 messages, again] 386 ???//lib/runtime.bin 387 ???//lib/runtime.com ncheck: read error 4160 ncheck: read error 6208 1952 /CLU/exe/link ncheck: read error 14400 ncheck: read error 16192 ncheck: read error 20288 6727 ???//matlab.m ncheck: read error 38464 ncheck: read error 50496 ncheck: read error 64576 ncheck: read error 72768 - ------------ One of the known bad blocks is in /CLU/exe/link. (The filesystem root '/' should be the root of the RVD, not the root of Joker's filesystem.) Others are in runtime.bin and runtime.com (in a directory named lib/ somewhere), and another named matlab.m. The "???" indicates that ncheck couldn't tell what the parent directory was. The rest of the "read error" messages indicate, as you might expect, that something is very wrong with your ra2 or the hard disk controller. From the hard errors on ra0 earlier on, I'd suspect the controller first, though even if it is just the controller, it probably already caused some filesystem damage which needs to be fixed. If the disk is at fault, the hard errors could be cleared with /tp/rqbads, but the disk would probably need to be replaced soon, anyway. The longer this problem is present, the worse it's going to get. You may want to forward this message to hotline. ------- End of Forwarded Message
| home | help | back | first | fref | pref | prev | next | nref | lref | last | post |