[981] in Hotline Meeting
nfs server crashing
daemon@ATHENA.MIT.EDU (Joe Harrington)
Wed Jun 27 23:14:16 1990
Date: Wed, 27 Jun 90 23:11:01 -0400
From: Joe Harrington <jh@ATHENA.MIT.EDU>
To: hotline@ATHENA.MIT.EDU, olc@ATHENA.MIT.EDU, bugs@ATHENA.MIT.EDU
Cc: jh@ATHENA.MIT.EDU
Reply-To: jh@ATHENA.MIT.EDU
hotline told me to send this to olc before I found the machine check
panics, marc horowitz said to send it to bugs, and machine crashes on
nfs servers ought to be seen by hotline.
I have been able to crash (cause to reboot) our RT nfs server
earth.mit.edu with a normal user program similar to dd. The only I/O
it does is with the read and write system calls, and all it does is
read a block of data and write out the first portion of that block.
The program called by the control shell script below
(/earth/u4/jh/saturn/lcgen/bin/dro2fits), which calls the pickout
program in the same directory and pipes its output to dd. pickout's
source is in /earth/u4/jh/saturn/lcgen/src/pickout/, and should be
world-readable (it's very short). I'll be very happy to talk to
someone to replicate/isolate the bug, as it has put a halt in my work.
For obvious reasons, I haven't checked to see if it crashes vax
servers like themis.
for i in ${*}; do
dro=images/dro/oc${i}.dro
head=images/headers/oc${i}.head.fits
fits=images/fits/oc${i}.fits
/bin/cp $head $fits
/bin/chmod 644 $fits
bin/pickout -header 512 -blocksize 1536 -keep 1488 < $dro |\
/bin/dd bs=1488 conv=swab >> $fits
done
The error the program gets is:
write: I/O error
3840+0 records in
3840+0 records out
bin/pickout: I/O error
error occured while reading block 3842.
The total file is about 30 megabytes.
The first three error lines come from dd. Next comes the read error
for the next block (I don't understand why the numbers aren't
sequential; I'm looking into it). On the server, /usr/adm/messages
records several exceptions and machine check panics:
Jun 27 21:58:21 earth vmunix: trap: mcs_pcs=0x342<DATA-ADDR,UNKNOWN,MACHINE-CHEC
K,IO-TRAP>, info=0x0, iar=0x40d0, ics_cs=0x810048<CHKSTOP-MASK>
Jun 27 21:58:21 earth vmunix: CSR=220000ff
Jun 27 21:58:21 earth vmunix: R0 ... R3 00000280 1ffff3d4 f40876b0 e102867c
Jun 27 21:58:21 earth vmunix: R4 ... R7 cc01df02 0501b302 37016402 5a017002
Jun 27 21:58:21 earth vmunix: R8 ... Rb 3200e400 50ffd5ff 66fc47fd 7ffd36ff
Jun 27 21:58:21 earth vmunix: Rc ... Rf b1fc2cfd 2bfcdefd dcfca0fd affcfafc
Jun 27 21:58:21 earth vmunix: exception stack(1): lm r10,e1028698
Jun 27 21:58:21 earth vmunix: REP1=4ffffae SER=8<PAGE-FAULT> SEAR=b69698 TRAR=b6
1800 TCR=105f<TLIPT> HAT/IPT=5f
Jun 27 21:58:21 earth vmunix: panic: machine check
Jun 27 21:58:21 earth vmunix: Inited (monochrome) screen
Jun 27 21:58:21 earth vmunix: Athena 4.3BSD UNIX (ATHENA) #6-4-40: Wed Dec 13 03
:10:12 EST 1989
[....]
Jun 27 22:42:58 earth vmunix: trap: mcs_pcs=0x342<DATA-ADDR,UNKNOWN,MACHINE-CHEC
K,IO-TRAP>, info=0x0, iar=0x40d0, ics_cs=0x810048<CHKSTOP-MASK>
Jun 27 22:42:58 earth vmunix: CSR=220000ff
Jun 27 22:42:58 earth vmunix: R0 ... R3 00000280 1ffff3d4 f40876b0 e10e467c
Jun 27 22:42:58 earth vmunix: R4 ... R7 db01f302 0901b402 4e018402 79019602
Jun 27 22:42:58 earth vmunix: R8 ... Rb 52000901 60ffe1ff 74fc51fd 84fd47ff
Jun 27 22:42:58 earth vmunix: Rc ... Rf a6fc39fd 1ffccafd e3fcb7fd a2fcf5fc
Jun 27 22:42:58 earth vmunix: exception stack(1): lm r10,e10e4698
Jun 27 22:42:58 earth vmunix: REP1=4ffffdb SER=8<PAGE-FAULT> SEAR=b69698 TRAR=b9
6000 TCR=105f<TLIPT> HAT/IPT=5f
Jun 27 22:42:58 earth vmunix: panic: machine check
Jun 27 22:42:58 earth vmunix: syncing disks... 7 7 5 5 4 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 done
Jun 27 22:42:58 earth vmunix:
Jun 27 22:42:58 earth vmunix: dumping to dev 101, offset 20814
Jun 27 22:42:58 earth vmunix: dump succeededInited (monochrome) screen
Jun 27 22:42:58 earth vmunix: Athena 4.3BSD UNIX (ATHENA) #6-4-40: Wed Dec 13 03
:10:12 EST 1989
--jh--