[3325] in Release_7.7_team

home help back first fref pref prev next nref lref last post

ow

daemon@ATHENA.MIT.EDU (Garry Zacheiss)
Thu Jun 6 07:39:20 2002

Message-Id: <200206061139.HAA09037@w20-575-115.mit.edu>
To: ops@MIT.EDU, release-team@MIT.EDU
Date: Thu, 06 Jun 2002 07:39:17 -0400
From: Garry Zacheiss <zacheiss@MIT.EDU>

	We've had some serious issues with various Athena machines this
evening.  To be precise, to the best of our knowledge, all cluster O2s
panicked, as did most of the private ones we could think of (Camilla and
my O2s both panicked, as did the one in the SIPB office, and hypodorian,
the SGI license server).  Interestingly, the SGI longjobs slaves did not
panic, nor did one of the 2 SGI build machines.  

        Additionally, we've had many 8.4 Suns panic this evening as
well, with the panic string "kernel heap corruption detected".  In
particular, we had all 3 SIPB cell AFS servers panic, as did many 8.4
private workstations.  However, no 8.4 Suns that are ops-maintained
servers have panicked.  One quasi-ops maintained 8.2 Sun (memory-ultra)
has panicked, however.

	The data we've collected so far suggests something related to
AFS.  The 2 SGIs I've examined so far (via savecore and icrash) were
reactivating when they panicked, and were running "attach" as well as
other processes that touch AFS (local-netscape and local-menus on one,
config_afs on the other).  The Suns we've examined didn't appear to have
any reason to be touching AFS, although it's possible they all had
system packs attached.

        Additional relevant data is that kolya mentioned earlier this
evening that the panic we saw on the SIPB AFS servers reminded him of
one he's debugged on www, caused by a client traversing a symlink with a
null character in its contents.  This is believeable, but I'm confused
as to why we would be seeing it all at once; it either points to
corruption in some AFS volume (seems odd that it wouldn't affect 9.0
Suns, since they should still be running Transarc clients with the
relevant bug) or a network attack of some sort with similar symptoms.
	
	Thus far, we are unaware of any 9.0 or 9.1 Suns or Linux-Athena
machines of any vintage that have crashed this evening in a way that
seems related to this problem.  The AFS client doesn't seem obviously at
fault, because the AFS client in the 9.0 release for both Solaris and
Irix was built from the same source at approximately the same time.
We're also unaware of any machine having panicked twice, so its possible
this isn't a total epidemic; however, it would be nice to have a better
idea of why it's occuring.

        Camilla, Jonathon, and I spent much of this early morning
examining this, but haven't reached any firm conclusions yet; we spent
most of our time gathering the information above.  It would be greatly
appreciated if someone, possibly multiple someones, made examining this
their top priority while we're sleeping.

        Included below is a crash report from one of the crashed SGIs.
A crash dump could be made available as well.  Crash dumps from Suns can
be found on the SIPB AFS servers.  Greg and kolya should be able to
retrieve them, as well as Jonathon, Camilla, and I, if they're deemed
useful.  The crash dump on memory-ultra can be retrieved by Larry Stone
or any ops full timer.

	We've put up an OLC motd indicating the problem, and will be
sending mail to cluster-services and cfyi shortly.

=======================
ICRASH CORE FILE REPORT
=======================

SYSTEM:
    system name:    IRIX
    release:        6.5 (6.5.7m)
    node name:      w20-575-116
    version:        01200533
    machine name:   IP32

GENERATED ON:
    Thu Jun  6 05:08:54 2002

TIME OF CRASH:
    1023340939 Thu Jun  6 01:22:19 2002

PANIC STRING:
    PANIC: KERNEL FAULT

NAMELIST:
    /var/adm/crash/unix.0 [CREATE TIME: Thu Jun  6 05:07:45 2002]

COREFILE:
    /var/adm/crash/vmcore.0.comp [CREATE TIME: Thu Jun  6 05:08:03 2002]

================
COREFILE SUMMARY
================

    The system was brought down due to an internal panic.

===========
PUTBUF DUMP
===========
	<CE=33
    <6>
    Dumping to /hw/node/io/pci/1/scsi_ctlr/0/target/1/lun/0/disk/partition/1/block at block 0, space: 0x20000 pages
    <6>Dumping low memory...<6>
    <6>Dumping static kernel pages...<6>.<6>.<6>.<6>.ll athena.mit.edu is back up (multi-homed address; other same-host interfaces may still be down)
    <6>afs: Lost contact with file server 18.7.15.68 in cell athena.mit.edu (all multi-homed ip addresses down for the server)
    <6>afs: Lost contact with file server 18.7.15.68 in cell athena.mit.edu (all multi-homed ip addresses down for the server)
    <6>afs: file server 18.7.15.68 in cell athena.mit.edu is back up (multi-homed address; other same-host interfaces may still be down)
    <6>afs: file server 18.7.15.68 in cell athena.mit.edu is back up (multi-homed address; other same-host interfaces may still be down)
    <4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
    <6>afs: Waiting for busy volume 537112115 (user.chiscanu) in cell athena.mit.edu
    <6>afs: failed to store file (13)
    <4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
    <4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
    <4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
    <4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
    <4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
    <4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
    <4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
    <4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
    <6>afs: setting clock back 2 seconds (via 18.179.0.30 in cell athena.mit.edu).
    <6>afs: setting clock back 2 seconds (via 18.179.0.30 in cell athena.mit.edu).
    <6>afs: setting clock ahead 2 seconds (via 18.179.0.30 in cell athena.mit.edu).
    <6>afs: setting clock ahead 2 seconds (via 18.179.0.30 in cell athena.mit.edu).
    <4>WARNING: ARP: got MAC address on ec for BCAST IP address 0.0.0.0
    
    <0>PANIC: KERNEL FAULT
    PC: 0x8000593c ep: 0xffffc568
    EXC code:128, `Software detected SEGV '
    Bad addr: 0x4d204000, cause: 0x8cˆ

===========
CPU SUMMARY
===========

  CPU 0 was in kernel mode running the command 'attach'

STACK TRACE:

===============================================================================
STACK TRACE FOR UTHREAD 0x81f43400 (attach, PID=2245906):

 1 dumpsys[../os/vmdump.c: 283, 0x801c6f3c]
 2 syncreboot[../os/printf.c: 1557, 0x80196bac]
 3 icmn_err_tag[../os/printf.c: 577, 0x80195608]
 4 cmn_err[../os/printf.c: 151, 0x80194a18]
 5 panicregs[../os/trap.c: 254, 0x8015d028]
 6 k_trap[../os/trap.c: 560, 0x8015d140]
 7 trap[../os/trap.c: 730, 0x8015d76c]
 8 VEC_trap[../ml/LOCORE/vec_trap.s: 62, 0x8000accc]
     r0/zero:0000000000000000   r1/at:fffffffffffffffc   r2/v0:0000000000000001
       r3/v1:0000000000000015   r4/a0:000000004d205449   r5/a1:ffffffff820e7cc4
       r6/a2:0000000000000000   r7/a3:000000004d2054cd   r8/t0:0000000000030000
       r9/t1:0000000000050000  r10/t2:ffffffffb4000018  r11/t3:0000000000008604
      r12/t4:00ffffffffffffff  r13/t5:00ffffffffffffff  r14/t6:00ffffffffffffff
      r15/t7:00ffffffffffffff  r16/s0:ffffffff814f1240  r17/s1:ffffffffffffccb0
      r18/s2:0000000000000000  r19/s3:000000007fff2fc8  r20/s4:0000000000000021
      r21/s5:0000000000000000  r22/s6:0000000000000000  r23/s7:0000000000000000
      r24/t8:ffffffffb2dfabb7  r25/t9:ffffffff81ca3198  r26/k0:ffffffffffffccb0
      r27/k1:ffffffffffffcb28  r28/gp:ffffffff80480218  r29/sp:ffffffffffffc6d0
      r30/s8:ffffffff8196e000  r31/ra:ffffffff801ac7ac     EPC:ffffffff8000593c
          CAUSE=8, SR=8703, BADVADDR=4d204000
 9 bcmp[../ml/usercopy.s: 747, 0x8000593c]
10 crcmp[../os/cred.c: 190, 0x801ac7a4]
11 nfs3_access[../fs/nfs/nfs3_vnops.c: 1905, 0x802e0f30]
12 nfs3lookup[../fs/nfs/nfs3_vnops.c: 2400, 0x802e1df8]
13 nfs3_lookup[../fs/nfs/nfs3_vnops.c: 2295, 0x802e1c74]
14 lookuppn[../os/lookup.c: 211, 0x801a0478]
15 lookupname[../os/lookup.c: 70, 0x801a01b8]
16 access[../os/vncalls.c: 1526, 0x80187e94]
17 syscall[../os/trap.c: 2802, 0x8015fed0]
18 systrap[../ml/LOCORE/systrap.s: 314, 0x80016168]
     r0/zero:0000000000000000   r1/at:000000000fb55708   r2/v0:0000000000000409
       r3/v1:000000000000005e   r4/a0:000000007fff2550   r5/a1:0000000000000009
       r6/a2:0000000000000002   r7/a3:000000000000002f   r8/t0:0000000000000000
       r9/t1:000000003cfef150  r10/t2:00000000100b09a9  r11/t3:00000000100b09a0
      r12/t4:00000000000fb557  r13/t5:00000000000fb557  r14/t6:00000000000fb557
      r15/t7:00000000000fb557  r16/s0:000000007fff2567  r17/s1:000000007fff2a80
      r18/s2:0000000000001000  r19/s3:000000007fff2fc8  r20/s4:000000007fff2bc0
      r21/s5:0000000000000000  r22/s6:0000000000000000  r23/s7:0000000000000000
      r24/t8:000000003cfe0000  r25/t9:0000000000fe0000  r26/k0:0000000000000000
      r27/k1:000000000012371e  r28/gp:000000000fb5ac44  r29/sp:000000007fff2550
      r30/s8:0000000000000000  r31/ra:000000000fa46780     EPC:000000000fa44688
          CAUSE=8, SR=ffffffff84008733, BADVADDR=fb55710
===============================================================================

=======================
CRASH SUMMARY FOR CPU 0
=======================

 The command 'attach' was running.
 1 dumpsys[../os/vmdump.c: 283, 0x801c6f3c]
 2 syncreboot[../os/printf.c: 1557, 0x80196bac]
 3 icmn_err_tag[../os/printf.c: 577, 0x80195608]
 4 cmn_err[../os/printf.c: 151, 0x80194a18]
 5 panicregs[../os/trap.c: 254, 0x8015d028]
 6 k_trap[../os/trap.c: 560, 0x8015d140]
 7 trap[../os/trap.c: 730, 0x8015d76c]
 8 VEC_trap[../ml/LOCORE/vec_trap.s: 62, 0x8000accc]
     r0/zero:0000000000000000   r1/at:fffffffffffffffc   r2/v0:0000000000000001
       r3/v1:0000000000000015   r4/a0:000000004d205449   r5/a1:ffffffff820e7cc4
       r6/a2:0000000000000000   r7/a3:000000004d2054cd   r8/t0:0000000000030000
       r9/t1:0000000000050000  r10/t2:ffffffffb4000018  r11/t3:0000000000008604
      r12/t4:00ffffffffffffff  r13/t5:00ffffffffffffff  r14/t6:00ffffffffffffff
      r15/t7:00ffffffffffffff  r16/s0:ffffffff814f1240  r17/s1:ffffffffffffccb0
      r18/s2:0000000000000000  r19/s3:000000007fff2fc8  r20/s4:0000000000000021
      r21/s5:0000000000000000  r22/s6:0000000000000000  r23/s7:0000000000000000
      r24/t8:ffffffffb2dfabb7  r25/t9:ffffffff81ca3198  r26/k0:ffffffffffffccb0
      r27/k1:ffffffffffffcb28  r28/gp:ffffffff80480218  r29/sp:ffffffffffffc6d0
      r30/s8:ffffffff8196e000  r31/ra:ffffffff801ac7ac     EPC:ffffffff8000593c
          CAUSE=8, SR=8703, BADVADDR=4d204000
 9 bcmp[../ml/usercopy.s: 747, 0x8000593c]
10 crcmp[../os/cred.c: 190, 0x801ac7a4]
11 nfs3_access[../fs/nfs/nfs3_vnops.c: 1905, 0x802e0f30]
12 nfs3lookup[../fs/nfs/nfs3_vnops.c: 2400, 0x802e1df8]
13 nfs3_lookup[../fs/nfs/nfs3_vnops.c: 2295, 0x802e1c74]
14 lookuppn[../os/lookup.c: 211, 0x801a0478]
15 lookupname[../os/lookup.c: 70, 0x801a01b8]
16 access[../os/vncalls.c: 1526, 0x80187e94]
17 syscall[../os/trap.c: 2802, 0x8015fed0]
18 systrap[../ml/LOCORE/systrap.s: 314, 0x80016168]
     r0/zero:0000000000000000   r1/at:000000000fb55708   r2/v0:0000000000000409
       r3/v1:000000000000005e   r4/a0:000000007fff2550   r5/a1:0000000000000009
       r6/a2:0000000000000002   r7/a3:000000000000002f   r8/t0:0000000000000000
       r9/t1:000000003cfef150  r10/t2:00000000100b09a9  r11/t3:00000000100b09a0
      r12/t4:00000000000fb557  r13/t5:00000000000fb557  r14/t6:00000000000fb557
      r15/t7:00000000000fb557  r16/s0:000000007fff2567  r17/s1:000000007fff2a80
      r18/s2:0000000000001000  r19/s3:000000007fff2fc8  r20/s4:000000007fff2bc0
      r21/s5:0000000000000000  r22/s6:0000000000000000  r23/s7:0000000000000000
      r24/t8:000000003cfe0000  r25/t9:0000000000fe0000  r26/k0:0000000000000000
      r27/k1:000000000012371e  r28/gp:000000000fb5ac44  r29/sp:000000007fff2550
      r30/s8:0000000000000000  r31/ra:000000000fa46780     EPC:000000000fa44688
          CAUSE=8, SR=ffffffff84008733, BADVADDR=fb55710





home help back first fref pref prev next nref lref last post