[3325] in Release_7.7_team
ow
daemon@ATHENA.MIT.EDU (Garry Zacheiss)
Thu Jun 6 07:39:20 2002
Message-Id: <200206061139.HAA09037@w20-575-115.mit.edu>
To: ops@MIT.EDU, release-team@MIT.EDU
Date: Thu, 06 Jun 2002 07:39:17 -0400
From: Garry Zacheiss <zacheiss@MIT.EDU>
We've had some serious issues with various Athena machines this
evening. To be precise, to the best of our knowledge, all cluster O2s
panicked, as did most of the private ones we could think of (Camilla and
my O2s both panicked, as did the one in the SIPB office, and hypodorian,
the SGI license server). Interestingly, the SGI longjobs slaves did not
panic, nor did one of the 2 SGI build machines.
Additionally, we've had many 8.4 Suns panic this evening as
well, with the panic string "kernel heap corruption detected". In
particular, we had all 3 SIPB cell AFS servers panic, as did many 8.4
private workstations. However, no 8.4 Suns that are ops-maintained
servers have panicked. One quasi-ops maintained 8.2 Sun (memory-ultra)
has panicked, however.
The data we've collected so far suggests something related to
AFS. The 2 SGIs I've examined so far (via savecore and icrash) were
reactivating when they panicked, and were running "attach" as well as
other processes that touch AFS (local-netscape and local-menus on one,
config_afs on the other). The Suns we've examined didn't appear to have
any reason to be touching AFS, although it's possible they all had
system packs attached.
Additional relevant data is that kolya mentioned earlier this
evening that the panic we saw on the SIPB AFS servers reminded him of
one he's debugged on www, caused by a client traversing a symlink with a
null character in its contents. This is believeable, but I'm confused
as to why we would be seeing it all at once; it either points to
corruption in some AFS volume (seems odd that it wouldn't affect 9.0
Suns, since they should still be running Transarc clients with the
relevant bug) or a network attack of some sort with similar symptoms.
Thus far, we are unaware of any 9.0 or 9.1 Suns or Linux-Athena
machines of any vintage that have crashed this evening in a way that
seems related to this problem. The AFS client doesn't seem obviously at
fault, because the AFS client in the 9.0 release for both Solaris and
Irix was built from the same source at approximately the same time.
We're also unaware of any machine having panicked twice, so its possible
this isn't a total epidemic; however, it would be nice to have a better
idea of why it's occuring.
Camilla, Jonathon, and I spent much of this early morning
examining this, but haven't reached any firm conclusions yet; we spent
most of our time gathering the information above. It would be greatly
appreciated if someone, possibly multiple someones, made examining this
their top priority while we're sleeping.
Included below is a crash report from one of the crashed SGIs.
A crash dump could be made available as well. Crash dumps from Suns can
be found on the SIPB AFS servers. Greg and kolya should be able to
retrieve them, as well as Jonathon, Camilla, and I, if they're deemed
useful. The crash dump on memory-ultra can be retrieved by Larry Stone
or any ops full timer.
We've put up an OLC motd indicating the problem, and will be
sending mail to cluster-services and cfyi shortly.
=======================
ICRASH CORE FILE REPORT
=======================
SYSTEM:
system name: IRIX
release: 6.5 (6.5.7m)
node name: w20-575-116
version: 01200533
machine name: IP32
GENERATED ON:
Thu Jun 6 05:08:54 2002
TIME OF CRASH:
1023340939 Thu Jun 6 01:22:19 2002
PANIC STRING:
PANIC: KERNEL FAULT
NAMELIST:
/var/adm/crash/unix.0 [CREATE TIME: Thu Jun 6 05:07:45 2002]
COREFILE:
/var/adm/crash/vmcore.0.comp [CREATE TIME: Thu Jun 6 05:08:03 2002]
================
COREFILE SUMMARY
================
The system was brought down due to an internal panic.
===========
PUTBUF DUMP
===========
<CE=33
<6>
Dumping to /hw/node/io/pci/1/scsi_ctlr/0/target/1/lun/0/disk/partition/1/block at block 0, space: 0x20000 pages
<6>Dumping low memory...<6>
<6>Dumping static kernel pages...<6>.<6>.<6>.<6>.ll athena.mit.edu is back up (multi-homed address; other same-host interfaces may still be down)
<6>afs: Lost contact with file server 18.7.15.68 in cell athena.mit.edu (all multi-homed ip addresses down for the server)
<6>afs: Lost contact with file server 18.7.15.68 in cell athena.mit.edu (all multi-homed ip addresses down for the server)
<6>afs: file server 18.7.15.68 in cell athena.mit.edu is back up (multi-homed address; other same-host interfaces may still be down)
<6>afs: file server 18.7.15.68 in cell athena.mit.edu is back up (multi-homed address; other same-host interfaces may still be down)
<4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
<6>afs: Waiting for busy volume 537112115 (user.chiscanu) in cell athena.mit.edu
<6>afs: failed to store file (13)
<4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
<4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
<4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
<4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
<4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
<4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
<4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
<4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
<6>afs: setting clock back 2 seconds (via 18.179.0.30 in cell athena.mit.edu).
<6>afs: setting clock back 2 seconds (via 18.179.0.30 in cell athena.mit.edu).
<6>afs: setting clock ahead 2 seconds (via 18.179.0.30 in cell athena.mit.edu).
<6>afs: setting clock ahead 2 seconds (via 18.179.0.30 in cell athena.mit.edu).
<4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
<0>PANIC: KERNEL FAULT
PC: 0x8000593c ep: 0xffffc568
EXC code:128, `Software detected SEGV '
Bad addr: 0x4d204000, cause: 0x8cˆ
===========
CPU SUMMARY
===========
CPU 0 was in kernel mode running the command 'attach'
STACK TRACE:
===============================================================================
STACK TRACE FOR UTHREAD 0x81f43400 (attach, PID=2245906):
1 dumpsys[../os/vmdump.c: 283, 0x801c6f3c]
2 syncreboot[../os/printf.c: 1557, 0x80196bac]
3 icmn_err_tag[../os/printf.c: 577, 0x80195608]
4 cmn_err[../os/printf.c: 151, 0x80194a18]
5 panicregs[../os/trap.c: 254, 0x8015d028]
6 k_trap[../os/trap.c: 560, 0x8015d140]
7 trap[../os/trap.c: 730, 0x8015d76c]
8 VEC_trap[../ml/LOCORE/vec_trap.s: 62, 0x8000accc]
r0/zero:0000000000000000 r1/at:fffffffffffffffc r2/v0:0000000000000001
r3/v1:0000000000000015 r4/a0:000000004d205449 r5/a1:ffffffff820e7cc4
r6/a2:0000000000000000 r7/a3:000000004d2054cd r8/t0:0000000000030000
r9/t1:0000000000050000 r10/t2:ffffffffb4000018 r11/t3:0000000000008604
r12/t4:00ffffffffffffff r13/t5:00ffffffffffffff r14/t6:00ffffffffffffff
r15/t7:00ffffffffffffff r16/s0:ffffffff814f1240 r17/s1:ffffffffffffccb0
r18/s2:0000000000000000 r19/s3:000000007fff2fc8 r20/s4:0000000000000021
r21/s5:0000000000000000 r22/s6:0000000000000000 r23/s7:0000000000000000
r24/t8:ffffffffb2dfabb7 r25/t9:ffffffff81ca3198 r26/k0:ffffffffffffccb0
r27/k1:ffffffffffffcb28 r28/gp:ffffffff80480218 r29/sp:ffffffffffffc6d0
r30/s8:ffffffff8196e000 r31/ra:ffffffff801ac7ac EPC:ffffffff8000593c
CAUSE=8, SR=8703, BADVADDR=4d204000
9 bcmp[../ml/usercopy.s: 747, 0x8000593c]
10 crcmp[../os/cred.c: 190, 0x801ac7a4]
11 nfs3_access[../fs/nfs/nfs3_vnops.c: 1905, 0x802e0f30]
12 nfs3lookup[../fs/nfs/nfs3_vnops.c: 2400, 0x802e1df8]
13 nfs3_lookup[../fs/nfs/nfs3_vnops.c: 2295, 0x802e1c74]
14 lookuppn[../os/lookup.c: 211, 0x801a0478]
15 lookupname[../os/lookup.c: 70, 0x801a01b8]
16 access[../os/vncalls.c: 1526, 0x80187e94]
17 syscall[../os/trap.c: 2802, 0x8015fed0]
18 systrap[../ml/LOCORE/systrap.s: 314, 0x80016168]
r0/zero:0000000000000000 r1/at:000000000fb55708 r2/v0:0000000000000409
r3/v1:000000000000005e r4/a0:000000007fff2550 r5/a1:0000000000000009
r6/a2:0000000000000002 r7/a3:000000000000002f r8/t0:0000000000000000
r9/t1:000000003cfef150 r10/t2:00000000100b09a9 r11/t3:00000000100b09a0
r12/t4:00000000000fb557 r13/t5:00000000000fb557 r14/t6:00000000000fb557
r15/t7:00000000000fb557 r16/s0:000000007fff2567 r17/s1:000000007fff2a80
r18/s2:0000000000001000 r19/s3:000000007fff2fc8 r20/s4:000000007fff2bc0
r21/s5:0000000000000000 r22/s6:0000000000000000 r23/s7:0000000000000000
r24/t8:000000003cfe0000 r25/t9:0000000000fe0000 r26/k0:0000000000000000
r27/k1:000000000012371e r28/gp:000000000fb5ac44 r29/sp:000000007fff2550
r30/s8:0000000000000000 r31/ra:000000000fa46780 EPC:000000000fa44688
CAUSE=8, SR=ffffffff84008733, BADVADDR=fb55710
===============================================================================
=======================
CRASH SUMMARY FOR CPU 0
=======================
The command 'attach' was running.
1 dumpsys[../os/vmdump.c: 283, 0x801c6f3c]
2 syncreboot[../os/printf.c: 1557, 0x80196bac]
3 icmn_err_tag[../os/printf.c: 577, 0x80195608]
4 cmn_err[../os/printf.c: 151, 0x80194a18]
5 panicregs[../os/trap.c: 254, 0x8015d028]
6 k_trap[../os/trap.c: 560, 0x8015d140]
7 trap[../os/trap.c: 730, 0x8015d76c]
8 VEC_trap[../ml/LOCORE/vec_trap.s: 62, 0x8000accc]
r0/zero:0000000000000000 r1/at:fffffffffffffffc r2/v0:0000000000000001
r3/v1:0000000000000015 r4/a0:000000004d205449 r5/a1:ffffffff820e7cc4
r6/a2:0000000000000000 r7/a3:000000004d2054cd r8/t0:0000000000030000
r9/t1:0000000000050000 r10/t2:ffffffffb4000018 r11/t3:0000000000008604
r12/t4:00ffffffffffffff r13/t5:00ffffffffffffff r14/t6:00ffffffffffffff
r15/t7:00ffffffffffffff r16/s0:ffffffff814f1240 r17/s1:ffffffffffffccb0
r18/s2:0000000000000000 r19/s3:000000007fff2fc8 r20/s4:0000000000000021
r21/s5:0000000000000000 r22/s6:0000000000000000 r23/s7:0000000000000000
r24/t8:ffffffffb2dfabb7 r25/t9:ffffffff81ca3198 r26/k0:ffffffffffffccb0
r27/k1:ffffffffffffcb28 r28/gp:ffffffff80480218 r29/sp:ffffffffffffc6d0
r30/s8:ffffffff8196e000 r31/ra:ffffffff801ac7ac EPC:ffffffff8000593c
CAUSE=8, SR=8703, BADVADDR=4d204000
9 bcmp[../ml/usercopy.s: 747, 0x8000593c]
10 crcmp[../os/cred.c: 190, 0x801ac7a4]
11 nfs3_access[../fs/nfs/nfs3_vnops.c: 1905, 0x802e0f30]
12 nfs3lookup[../fs/nfs/nfs3_vnops.c: 2400, 0x802e1df8]
13 nfs3_lookup[../fs/nfs/nfs3_vnops.c: 2295, 0x802e1c74]
14 lookuppn[../os/lookup.c: 211, 0x801a0478]
15 lookupname[../os/lookup.c: 70, 0x801a01b8]
16 access[../os/vncalls.c: 1526, 0x80187e94]
17 syscall[../os/trap.c: 2802, 0x8015fed0]
18 systrap[../ml/LOCORE/systrap.s: 314, 0x80016168]
r0/zero:0000000000000000 r1/at:000000000fb55708 r2/v0:0000000000000409
r3/v1:000000000000005e r4/a0:000000007fff2550 r5/a1:0000000000000009
r6/a2:0000000000000002 r7/a3:000000000000002f r8/t0:0000000000000000
r9/t1:000000003cfef150 r10/t2:00000000100b09a9 r11/t3:00000000100b09a0
r12/t4:00000000000fb557 r13/t5:00000000000fb557 r14/t6:00000000000fb557
r15/t7:00000000000fb557 r16/s0:000000007fff2567 r17/s1:000000007fff2a80
r18/s2:0000000000001000 r19/s3:000000007fff2fc8 r20/s4:000000007fff2bc0
r21/s5:0000000000000000 r22/s6:0000000000000000 r23/s7:0000000000000000
r24/t8:000000003cfe0000 r25/t9:0000000000fe0000 r26/k0:0000000000000000
r27/k1:000000000012371e r28/gp:000000000fb5ac44 r29/sp:000000007fff2550
r30/s8:0000000000000000 r31/ra:000000000fa46780 EPC:000000000fa44688
CAUSE=8, SR=ffffffff84008733, BADVADDR=fb55710