[3036] in testers
8.0F sun4m: portnoy panic in mutex_exit() called from afs_UFSWrite()
daemon@ATHENA.MIT.EDU (John Hawkinson)
Sat Jul 27 13:31:32 1996
Date: Sat, 27 Jul 1996 13:27:15 -0400
To: testers@MIT.EDU, afsdev@MIT.EDU
Cc: sipb-staff@MIT.EDU, bug-afs@MIT.EDU
From: John Hawkinson <jhawk@MIT.EDU>
[cc-ing bug-afs in accordance w/ jweiss' latest recommendation]
I will update portnoy to 8.0I when I finish analysis for this
crash. I haven't seen any response to the previous afs-related
crash, it'd be nice if it could be submitted to Transarc if no
one is going to do any more local analysis.
kretch once again succeeded in panicing portnoy; this time he
had just sent a zephyr to himself (his zephyr logger), and the
machine paniced.
Given that this is 8.0F, this machine was not running the new
(fix for vnode caching problem) afs module, but rather the
older one (3.4a.patches.1 derivative?).
Here's the message buffer output:
BAD TRAP: type=7 rp=f055aa04 addr=0 mmu_fsr=0 rw=0
zwgc: Memory address alignment
pid=20882, pc=0xfc411df4, sp=0xf055aa50, psr=0x44010c4, context=0
g1-g7: ffffffb8, 44010e4, 2810, 3010, 200, 1, fc23aa40
Begin traceback... sp = f055aa50
Called from fc4219e8, fp=f055ab28, args=fc50a400 31 35 0 0 fc522be0
Called from fc424324, fp=f055abc0, args=fc367f70 0 0 95 fc50a400 fc245000
Called from fc423ef8, fp=f055ac40, args=fc50a458 95 0 95 95 0
Called from fc425154, fp=f055acd0, args=2000 0 18000 f01c622c 0 1
Called from fc4230ac, fp=f055ad88, args=fc50a400 0 0 95 f4cd0000 0
Called from f00dfe88, fp=f055ade8, args=fc50a400 f055ae48 0 fc0e9680 8058 0
Called from f00772dc, fp=f055aef0, args=f055afe0 f055af50 2 0 0 95
Called from f00777ac, fp=f055af58, args=a 72d84 95 1000 f00dfb24 fc3a05a0
Called from ef5dd410, fp=effff0d8, args=568b0 f6b 1000 72d84 95 73d84
End traceback...
panic: Memory address alignment
syncing file systems... [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] [5] done
2753 static and sysmap kernel pages
54 dynamic kernel data pages
112 kernel-pageable pages
0 segkmap kernel pages
0 segvn kernel pages
338 current user process pages
3257 total pages (3257 chunks)
dumping to vp fc1e0e1c, offset 72480
WARNING: /iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd@3,0 (sd3):
SCSI transport failed: reason 'reset': retrying command
3257 total pages, dump succeeded
This "SCSI transport failed" problem is someone disconcerting, as it
has happened the last two times (including this time) that portnoy has
crashed. Perhaps it's indicative of a bug in the 2.4 complete_panic()
routine? OTOH, this may be normal -- Sun BugID 1176508, an otherwise
unrelated crash report+analysis has such an error, and there's no
commentary in the analysis that suggests it's cause for worry -- the author
seems to have ignored it. Similar for Bugid 1179488. Still, it would be nice
to get verification from Sun.
Anyhow, here's the stack trace:
[portnoy!jhawk] /var/crash/portnoy# adb -k unix.6 vmcore.6
physmem 1e6b
$c
complete_panic(0xf0048bb0,0xf055a8fc,0xf055a788,0x1dbcd18,0x0,0x1) + f8
do_panic(?) + 1c
vcmn_err(0xf015f8c4,0xf055a8fc,0xf055a8fc,0x2e,0xffeec000,0x3)
cmn_err(0x3,0xf015f8c4,0xf015fc00,0x21,0x21,0xf0152400) + 1c
die(0x7,0xf055aa04,0x0,0x0,0x0,0xf015f8c4) + 78
trap(0x7,0xf055aa04,0xf017d36c,0x0,0x0,0xfc449c00) + 28c
sys_trap(?) + 1d4
mutex_exit(0xfc55aa40,0x95,0xe11d52ff,0x95,0xfc55aa48,0x1)
afs_UFSWrite(0xfc50a400,0x31,0x35,0x0,0x0,0xfc522be0) + 480
afs_ustrategy(0xfc367f70,0x0,0x0,0x95,0xfc50a400,0xfc245000) + 1a8
afs_putapage(0xfc50a458,0x95,0x0,0x95,0x95,0x0) + 208
afs_putpage(?) + 2ec
segmap_release(0x2000,0x0,0x18000,0xf01c622c,0x0,0x1)
afs_nfsrdwr(0xfc50a400,0x0,0x0,0x95,0xf4cd0000,0x0) + d3c
afs_vmwrite(0xfc50a400,0xf055ae48,0x0,0xfc0e9680,0x8058,0x0) + 6c
rw(0xf055afe0,0xf055af50,0x2,0x0,0x0,0x95) + 2f8
syscall_ap(0xa) + 6c
I don't really understand the Solaris kernel's mutex locking
primitives and semantics, so I'm walking in even murkier waters than I
usually am doing this sort of analysis.
Trap 7 is certainly "memory address allignment".
It seems peculiar that mutex_exit() doesn't have an offset -- this
implies that there was something wrong with one of the args to it as
called by afs_UFSWrite (?).
Here's the disassembly prior to the call to mutex_exit()
afs_UFSWrite+0x468/8i
afs_UFSWrite+0x468:
call afsio_trim
nop
ld [%fp - 0x1c], %i3
st %i3, [%fp - 0x44]
sethi %hi(0xfc44a400), %o0
or %o0, 0x3bc, %o0 ! afs_global_lock
call mutex_exit
nop
At this point, afs_UFSWrite() (in afs_vnodeops.c) is calling
AFS_GUNLOCK (line 2490), which is defined in osi.h:
#define AFS_GUNLOCK() mutex_exit(&afs_global_lock);
So this is consistent, and mutex_exit() takes one parameter.
Shockingly enough, mutex_exit() is documented in mutex(9F)
[I summarize here]:
SYNOPSIS
#include <sys/ksynch.h>
void mutex_exit(kmutex_t *mp)
ARGUMENTS
mp Pointer to a kernel mutex lock (kmutex_t).
DESCRIPTION
mutex_exit() releases a mutex and will unblock another
thread if any are blocked on the mutex.
BUGS
There is currently no product support for looking at lock
statistics.
There's an inconsistancy thatI don't understand, which is that mutex_exit()
is supposedly called with a pointer to afs_global_lock, however the value
of afs_global_lock and the parameter to mutex_exit() do not agree, nor does
dereferencing the 1st parameter to mutex_exit() help:
mutex_exit(0xfc55aa40,0x95,0xe11d52ff,0x95,0xfc55aa48,0x1)
0xfc55aa40/X
0xfc55aa40: fc245000
fc245000/X
0xfc245000: 69742773
afs_global_lock=X
fc44a7bc
afs_global_lock/X
afs_global_lock:
afs_global_lock: 0
It's possible that I'm just misunderstanding something obvious, or there maybe
stack corruption, or there may be serious problem (?). I'd check it's value in
the running kernel, but I don't know how to compute the offset of a symbol
inside a loadable module ("modinfo|grep afs" and add the address within
/kernel/fs/afs to the offset? How to find the address of afs_global_lock within
/kernel/fs/afs? ??), and I didn't have the forsight to boot kadb after the
crash.
Looking at the argument to mutex_exit (since afs_global_lock isn't
valid, anyhow, being 0):
0xfc55aa40$<mutex
0xfc55aa40: owner
ff848a00
0xfc55aa40: lock type waiters
0 0 95
<sys/mutex.h> clarifies that type 0 is MUTEX_ADAPTIVE ("The basic mutex type is
MUTEX_ADAPTIVE, and this is expected to be used in almost all of the kernel...
MUTEX_ADAPTIVE = 0, /* spin if owner is running */
"), and crash concurs:
> mutex 0xfc55aa40
waiters 95 lock 0
type: MUTEX_ADAPTIVE owner ff848a00
I'm rather unclear as to just what "owner" is. Attempting to
examine it seems to be a lost cause:
ff848a00/X
data address not found
so perhaps this is irrelvent. I guess it could be an index
into a table, but this seems unlikely from its value, and
given that /usr/kvm/lib/adb/mutex already does weird stuff
to obtain it:
./"owner"n;((*.&~ff)%8)|e0000000=X
It may also be pretinent that it does not show up
in the "mutextable" output from crash; crash also suggests
the address is invalid and not just swapped out (I think):
> vtop ff848a00
VIRTUAL PHYSICAL
ff848a00 not mapped
Perhaps I've been barking up the wrong tree, though. The actual
PC that caused the trap per the message buffer was 0xfc411df4,
or afs_UFSWrite+0x490. Disassembling again from where we left off:
afs_UFSWrite+0x480: call mutex_exit
afs_UFSWrite+0x484: nop
afs_UFSWrite+0x488: ld [%fp - 0x58], %o0
afs_UFSWrite+0x48c: ld [%o0 + 0x4], %o1
afs_UFSWrite+0x490: ld [%o1 + 0x14], %o2
afs_UFSWrite+0x494: ld [%o2 + 0x58], %g1
afs_UFSWrite+0x498: ld [%fp - 0x58], %o3
afs_UFSWrite+0x49c: ld [%o3 + 0x4], %o0
afs_UFSWrite+0x4a0: mov 0x1, %o1
afs_UFSWrite+0x4a4: jmpl %g1, %o7
This corresponds to:
VOP_RWLOCK(tfile->vnode, 1);
Where VOP_RWLOCK is defined in <sys/vnode.h> as:
#define VOP_RWLOCK(vp, w) (*(vp)->v_op->vop_rwlock)(vp, w)
So that's:
(*(tfile->vnode)->v_op->vop_rwlock)(tfile->vnode, 1)
Checking the registers at the time of the crash:
f055aa04$<regs
0xf055aa04: psr pc npc
44010c4 fc411df4 fc411df8
0xf055aa10: y g1 g2 g3
0 ffffffb8 44010e4 2810
0xf055aa20: g4 g5 g6 g7
3010 200 1 fc23aa40
0xf055aa30: o0 o1 o2 o3
fc55aa40 95 e11d52ff 95
0xf055aa40: o4 o5 o6 o7
fc55aa48 1 f055aa50 fc411de4
It appears we loaded tfile->vnode properly (unless register windows
are confusing me and we actually lost earlier?) as 0xfc55aa40. Except
the vnode structure doesn't really look valid:
0xfc55aa40$<vnode
0xfc55aa40: flag refcnt vfsmnt vop
0 0 3a81 VIK_WINMASK+0x1607
0xfc55aa58: vfsp stream pages
ffffffff ffffffff ffffffff
0xfc55aa64: type rdev data
-1 ffffffff ffffffff
0xfc55aa70: filocks
ffffffff
In any case, the value of o1 is consistent with o0+4:
fc55aa40+4/X
0xfc55aa44: 95
And certainly adding 14 to that and dereferencing it would give you an
unaligned access :-).
What's going on here? I guess all I can only conclude that stack corruption
looks more likely.
At this point I'm pretty much out of ideas. Wandering up the stack
trace, one might think one could interpret the args
to afs_putapage() (int afs_putpage(vp, off, len, flags, cred)).
The vnode pointer seems corrupt, unfortunately:
afs_putapage(0xfc50a458,0x95,0x0,0x95,0x95,0x0) + 208
0xfc50a458$<vnode
0xfc50a458: flag refcnt vfsmnt vop
0 838482084 7351 LEDPATCNT+0xd
0xfc50a470: vfsp stream pages
81800100 20000 0
0xfc50a47c: type rdev data
0 0 5192
0xfc50a488: filocks
f7
The other functions don't seem to have interesting arguments, and I don't know
what rw() does. syscall_ap(0xa) presumably means syscall(10), which is
SYS_unlink. Maybe not. Go figure.
Have fun.
I'm still hanging onto the two crash dumps from kretch's previous crash. I'll
hang on to this one as well. If we keep seeing crashes, I won't be able to hang
on to too many... We should be good for a while, tho:
[portnoy!jhawk] /var/crash/portnoy> df -k .
Filesystem kbytes used avail capacity Mounted on
/dev/dsk/c0t3d0s6 307982 71598 205594 26% /var
[portnoy!jhawk] /var/crash/portnoy> ls -ls
total 115410
2 -rw-r--r-- 1 root other 2 Jul 27 11:47 bounds
2448 -rw-r--r-- 1 root root 1243192 Jul 20 23:13 unix.4
2496 -rw-r--r-- 1 root root 1268216 Jul 20 23:15 unix.5
2576 -rw-r--r-- 1 root root 1303524 Jul 27 11:47 unix.6
40896 -rw-r--r-- 1 root root 20914176 Jul 20 23:13 vmcore.4
40896 -rw-r--r-- 1 root root 20914176 Jul 20 23:15 vmcore.5
26096 -rw-r--r-- 1 root root 13352960 Jul 27 11:46 vmcore.6
--jhawk
(who wonders why portnoy only crashes on weekends)