[3036] in testers


home	help	back	first	fref	pref	prev	next	nref	lref	last	post
8.0F sun4m: portnoy panic in mutex_exit() called from afs_UFSWrite()

daemon@ATHENA.MIT.EDU (John Hawkinson)
Sat Jul 27 13:31:32 1996

Date: Sat, 27 Jul 1996 13:27:15 -0400
To: testers@MIT.EDU, afsdev@MIT.EDU
Cc: sipb-staff@MIT.EDU, bug-afs@MIT.EDU
From: John Hawkinson <jhawk@MIT.EDU>


[cc-ing bug-afs in accordance w/ jweiss' latest recommendation]

I will update portnoy to 8.0I when I finish analysis for this
crash. I haven't seen any response to the previous afs-related
crash, it'd be nice if it could be submitted to Transarc if no
one is going to do any more local analysis.

kretch once again succeeded in panicing portnoy; this time he
had just sent a zephyr to himself (his zephyr logger), and the
machine paniced.

Given that this is 8.0F, this machine was not running the new
(fix for vnode caching problem) afs module, but rather the
older one (3.4a.patches.1 derivative?).

Here's the message buffer output:

BAD TRAP: type=7 rp=f055aa04 addr=0 mmu_fsr=0 rw=0
zwgc: Memory address alignment
pid=20882, pc=0xfc411df4, sp=0xf055aa50, psr=0x44010c4, context=0
g1-g7: ffffffb8, 44010e4, 2810, 3010, 200, 1, fc23aa40
Begin traceback... sp = f055aa50
Called from fc4219e8, fp=f055ab28, args=fc50a400 31 35 0 0 fc522be0
Called from fc424324, fp=f055abc0, args=fc367f70 0 0 95 fc50a400 fc245000
Called from fc423ef8, fp=f055ac40, args=fc50a458 95 0 95 95 0
Called from fc425154, fp=f055acd0, args=2000 0 18000 f01c622c 0 1
Called from fc4230ac, fp=f055ad88, args=fc50a400 0 0 95 f4cd0000 0
Called from f00dfe88, fp=f055ade8, args=fc50a400 f055ae48 0 fc0e9680 8058 0
Called from f00772dc, fp=f055aef0, args=f055afe0 f055af50 2 0 0 95
Called from f00777ac, fp=f055af58, args=a 72d84 95 1000 f00dfb24 fc3a05a0
Called from ef5dd410, fp=effff0d8, args=568b0 f6b 1000 72d84 95 73d84
End traceback...
panic: Memory address alignment
syncing file systems... [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] done
 2753 static and sysmap kernel pages
   54 dynamic kernel data pages
  112 kernel-pageable pages
    0 segkmap kernel pages
    0 segvn kernel pages
  338 current user process pages
 3257 total pages (3257 chunks)

dumping to vp fc1e0e1c, offset 72480
WARNING: /iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd@3,0 (sd3):
	SCSI transport failed: reason 'reset': retrying command


3257 total pages, dump succeeded


This "SCSI transport failed" problem is someone disconcerting, as it
has happened the last two times (including this time) that portnoy has
crashed.  Perhaps it's indicative of a bug in the 2.4 complete_panic()
routine?  OTOH, this may be normal -- Sun BugID 1176508, an otherwise
unrelated crash report+analysis has such an error, and there's no
commentary in the analysis that suggests it's cause for worry -- the author
seems to have ignored it. Similar for Bugid 1179488. Still, it would be nice
to get verification from Sun.


Anyhow, here's the stack trace:

[portnoy!jhawk] /var/crash/portnoy# adb -k unix.6 vmcore.6
physmem 1e6b
$c
complete_panic(0xf0048bb0,0xf055a8fc,0xf055a788,0x1dbcd18,0x0,0x1) + f8
do_panic(?) + 1c
vcmn_err(0xf015f8c4,0xf055a8fc,0xf055a8fc,0x2e,0xffeec000,0x3)
cmn_err(0x3,0xf015f8c4,0xf015fc00,0x21,0x21,0xf0152400) + 1c
die(0x7,0xf055aa04,0x0,0x0,0x0,0xf015f8c4) + 78
trap(0x7,0xf055aa04,0xf017d36c,0x0,0x0,0xfc449c00) + 28c
sys_trap(?) + 1d4
mutex_exit(0xfc55aa40,0x95,0xe11d52ff,0x95,0xfc55aa48,0x1)
afs_UFSWrite(0xfc50a400,0x31,0x35,0x0,0x0,0xfc522be0) + 480
afs_ustrategy(0xfc367f70,0x0,0x0,0x95,0xfc50a400,0xfc245000) + 1a8
afs_putapage(0xfc50a458,0x95,0x0,0x95,0x95,0x0) + 208
afs_putpage(?) + 2ec
segmap_release(0x2000,0x0,0x18000,0xf01c622c,0x0,0x1)
afs_nfsrdwr(0xfc50a400,0x0,0x0,0x95,0xf4cd0000,0x0) + d3c
afs_vmwrite(0xfc50a400,0xf055ae48,0x0,0xfc0e9680,0x8058,0x0) + 6c
rw(0xf055afe0,0xf055af50,0x2,0x0,0x0,0x95) + 2f8
syscall_ap(0xa) + 6c

I don't really understand the Solaris kernel's mutex locking
primitives and semantics, so I'm walking in even murkier waters than I
usually am doing this sort of analysis.

Trap 7 is certainly "memory address allignment".

It seems peculiar that mutex_exit() doesn't have an offset -- this
implies that there was something wrong with one of the args to it as
called by afs_UFSWrite (?).

Here's the disassembly prior to the call to mutex_exit()

afs_UFSWrite+0x468/8i
afs_UFSWrite+0x468:
                call    afsio_trim
                nop
                ld      [%fp - 0x1c], %i3
                st      %i3, [%fp - 0x44]
                sethi   %hi(0xfc44a400), %o0
                or      %o0, 0x3bc, %o0          ! afs_global_lock
                call    mutex_exit
                nop

At this point, afs_UFSWrite() (in afs_vnodeops.c) is calling
AFS_GUNLOCK (line 2490), which is defined in osi.h:

#define AFS_GUNLOCK()   mutex_exit(&afs_global_lock);

So this is consistent, and mutex_exit() takes one parameter.
Shockingly enough, mutex_exit() is documented in mutex(9F)
[I summarize here]:

SYNOPSIS
     #include <sys/ksynch.h>
     void mutex_exit(kmutex_t *mp)
ARGUMENTS
     mp             Pointer to a kernel mutex lock (kmutex_t).
DESCRIPTION
     mutex_exit() releases  a  mutex  and  will  unblock  another
     thread if any are blocked on the mutex.
BUGS
     There is currently no product support for  looking  at  lock
     statistics.


There's an inconsistancy thatI don't understand, which is that mutex_exit()
is supposedly called with a pointer to afs_global_lock, however the value
of afs_global_lock and the parameter to mutex_exit() do not agree, nor does
dereferencing the 1st parameter to mutex_exit() help:

mutex_exit(0xfc55aa40,0x95,0xe11d52ff,0x95,0xfc55aa48,0x1)

0xfc55aa40/X
0xfc55aa40:     fc245000
fc245000/X
0xfc245000:     69742773

afs_global_lock=X
                fc44a7bc
afs_global_lock/X
afs_global_lock:
afs_global_lock:                0

It's possible that I'm just misunderstanding something obvious, or there maybe
stack corruption, or there may be serious problem (?). I'd check it's value in
the running kernel, but I don't know how to compute the offset of a symbol
inside a loadable module ("modinfo|grep afs" and add the address within
/kernel/fs/afs to the offset? How to find the address of afs_global_lock within
/kernel/fs/afs? ??), and I didn't have the forsight to boot kadb after the
crash.

Looking at the argument to mutex_exit (since afs_global_lock isn't
valid, anyhow, being 0):

0xfc55aa40$<mutex
0xfc55aa40:     owner
                ff848a00
0xfc55aa40:     lock    type    waiters
                0       0       95

<sys/mutex.h> clarifies that type 0 is MUTEX_ADAPTIVE ("The basic mutex type is
MUTEX_ADAPTIVE, and this is expected to be used in almost all of the kernel...
        MUTEX_ADAPTIVE = 0,     /* spin if owner is running */
"), and crash concurs:

> mutex 0xfc55aa40 
 waiters 95   lock 0
        type: MUTEX_ADAPTIVE    owner ff848a00

I'm rather unclear as to just what "owner" is. Attempting to
examine it seems to be a lost cause:

ff848a00/X
 
data address not found

so perhaps this is irrelvent. I guess it could be an index
into a table, but this seems unlikely from its value, and
given that /usr/kvm/lib/adb/mutex already does weird stuff
to obtain it:

./"owner"n;((*.&~ff)%8)|e0000000=X

It may also be pretinent that it does not show up
in the "mutextable" output from crash; crash also suggests
the address is invalid and not just swapped out (I think):

> vtop ff848a00
 VIRTUAL  PHYSICAL
ff848a00 not mapped


Perhaps I've been barking up the wrong tree, though. The actual
PC that caused the trap per the message buffer was 0xfc411df4,
or afs_UFSWrite+0x490. Disassembling again from where we left off:

afs_UFSWrite+0x480:             call    mutex_exit
afs_UFSWrite+0x484:             nop
afs_UFSWrite+0x488:             ld      [%fp - 0x58], %o0
afs_UFSWrite+0x48c:             ld      [%o0 + 0x4], %o1
afs_UFSWrite+0x490:             ld      [%o1 + 0x14], %o2
afs_UFSWrite+0x494:             ld      [%o2 + 0x58], %g1
afs_UFSWrite+0x498:             ld      [%fp - 0x58], %o3
afs_UFSWrite+0x49c:             ld      [%o3 + 0x4], %o0
afs_UFSWrite+0x4a0:             mov     0x1, %o1
afs_UFSWrite+0x4a4:             jmpl    %g1, %o7

This corresponds to:

        VOP_RWLOCK(tfile->vnode, 1);

Where VOP_RWLOCK is defined in <sys/vnode.h> as:

#define VOP_RWLOCK(vp, w) (*(vp)->v_op->vop_rwlock)(vp, w)

So that's:

	(*(tfile->vnode)->v_op->vop_rwlock)(tfile->vnode, 1)

Checking the registers at the time of the crash:

f055aa04$<regs
0xf055aa04:     psr             pc              npc
                44010c4         fc411df4        fc411df8
0xf055aa10:     y               g1              g2              g3
                0               ffffffb8        44010e4         2810
0xf055aa20:     g4              g5              g6              g7
                3010            200             1               fc23aa40
0xf055aa30:     o0              o1              o2              o3
                fc55aa40        95              e11d52ff        95
0xf055aa40:     o4              o5              o6              o7
                fc55aa48        1               f055aa50        fc411de4

It appears we loaded tfile->vnode properly (unless register windows
are confusing me and we actually lost earlier?) as 0xfc55aa40. Except
the vnode structure doesn't really look valid:

0xfc55aa40$<vnode
0xfc55aa40:     flag    refcnt          vfsmnt           vop
                0       0               3a81            VIK_WINMASK+0x1607
0xfc55aa58:     vfsp            stream          pages
                ffffffff        ffffffff        ffffffff
0xfc55aa64:     type            rdev    data
                -1              ffffffff        ffffffff
0xfc55aa70:     filocks
                ffffffff

In any case, the value of o1 is consistent with o0+4:
fc55aa40+4/X
0xfc55aa44:     95

And certainly adding 14 to that and dereferencing it would give you an
unaligned access :-).

What's going on here? I guess all I can only conclude that stack corruption
looks more likely.

At this point I'm pretty much out of ideas. Wandering up the stack
trace, one might think one could interpret the args
to afs_putapage() (int afs_putpage(vp, off, len, flags, cred)).

The vnode pointer seems corrupt, unfortunately:

afs_putapage(0xfc50a458,0x95,0x0,0x95,0x95,0x0) + 208

0xfc50a458$<vnode
0xfc50a458:     flag    refcnt          vfsmnt           vop
                0       838482084       7351            LEDPATCNT+0xd
0xfc50a470:     vfsp            stream          pages
                81800100        20000           0
0xfc50a47c:     type            rdev    data
                0               0               5192
0xfc50a488:     filocks
                f7

The other functions don't seem to have interesting arguments, and I don't know
what rw() does. syscall_ap(0xa) presumably means syscall(10), which is
SYS_unlink. Maybe not. Go figure.

Have fun.

I'm still hanging onto the two crash dumps from kretch's previous crash. I'll
hang on to this one as well. If we keep seeing crashes, I won't be able to hang
on to too many... We should be good for a while, tho:

[portnoy!jhawk] /var/crash/portnoy> df -k .
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/dsk/c0t3d0s6     307982   71598  205594    26%    /var
[portnoy!jhawk] /var/crash/portnoy> ls -ls
total 115410
   2 -rw-r--r--   1 root     other          2 Jul 27 11:47 bounds
2448 -rw-r--r--   1 root     root     1243192 Jul 20 23:13 unix.4
2496 -rw-r--r--   1 root     root     1268216 Jul 20 23:15 unix.5
2576 -rw-r--r--   1 root     root     1303524 Jul 27 11:47 unix.6
40896 -rw-r--r--   1 root     root     20914176 Jul 20 23:13 vmcore.4
40896 -rw-r--r--   1 root     root     20914176 Jul 20 23:15 vmcore.5
26096 -rw-r--r--   1 root     root     13352960 Jul 27 11:46 vmcore.6

--jhawk
  (who wonders why portnoy only crashes on weekends)

home	help	back	first	fref	pref	prev	next	nref	lref	last	post
[3036] in testers

8.0F sun4m: portnoy panic in mutex_exit() called from afs_UFSWrite()

daemon@ATHENA.MIT.EDU (John Hawkinson)Sat Jul 27 13:31:32 1996

daemon@ATHENA.MIT.EDU (John Hawkinson)
Sat Jul 27 13:31:32 1996