[15598] in Athena Bugs
8.1.9 sun4: ws hanging lossage
daemon@ATHENA.MIT.EDU (John Hawkinson)
Thu Oct 23 02:18:42 1997
Date: Thu, 23 Oct 1997 02:18:38 -0400
To: bugs@MIT.EDU
From: John Hawkinson <jhawk@MIT.EDU>
w20-575-33 is a sparc 4 that's been seeing a higher-than-average
number of "sparc 4 hangs".
A couple times this week, users have reported new problems on it --
that they've tried to logout and found it doesn't work -- clicking
dash's Logout and saying Yes seems to do nothing.
Since the machine happened to have a serial console, I did some
more investigation.
It appears there are a bunch of spinning processes. Here's a dump:
w20-575-33# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 0 0 0 21:06:23 ? 0:01 sched
root 1 0 0 21:06:26 ? 0:00 /etc/init -v
root 2 0 0 21:06:26 ? 0:00 pageout
root 3 0 1 21:06:26 ? 0:29 fsflush
root 269 1 0 21:08:09 console 0:00 /usr/bin/login
root 153 1 0 21:07:41 ? 0:00 /usr/sbin/syslogd
root 400 1 0 21:24:46 ? 0:00 /etc/athena/dm /etc/athena/login/config ttyp0 console
root 104 1 0 21:07:22 ? 0:01 /etc/afsd -stat 2000 -dcache 800 -daemons 4 -volumes 70
root 103 1 0 21:07:22 ? 0:00 /etc/afsd -stat 2000 -dcache 800 -daemons 4 -volumes 70
root 105 1 0 21:07:22 ? 0:00 /etc/afsd -stat 2000 -dcache 800 -daemons 4 -volumes 70
root 106 1 0 21:07:22 ? 0:00 /etc/afsd -stat 2000 -dcache 800 -daemons 4 -volumes 70
root 107 1 0 21:07:22 ? 0:00 /etc/afsd -stat 2000 -dcache 800 -daemons 4 -volumes 70
root 108 1 0 21:07:22 ? 0:00 /etc/afsd -stat 2000 -dcache 800 -daemons 4 -volumes 70
root 109 1 0 21:07:22 ? 0:00 /etc/afsd -stat 2000 -dcache 800 -daemons 4 -volumes 70
root 110 1 0 21:07:27 ? 0:01 /etc/afsd -stat 2000 -dcache 800 -daemons 4 -volumes 70
root 132 1 0 21:07:39 ? 0:01 /usr/sbin/in.named
root 117 1 0 21:07:31 ? 0:00 /usr/sbin/rpcbind
root 260 1 0 21:08:08 ? 0:02 /usr/sbin/vold
root 137 1 0 21:07:40 ? 0:00 /usr/sbin/inetd -s
root 268 1 0 21:08:09 ? 0:00 /usr/lib/saf/sac -t 300
root 163 1 0 21:07:48 ? 0:00 /usr/sbin/cron
root 173 1 0 21:07:49 ? 0:00 /usr/lib/utmpd
root 255 1 0 21:08:07 ? 0:01 /etc/athena/zhm
root 244 1 0 21:08:05 ? 0:00 /etc/athena/inetd
daemon 249 1 0 21:08:06 ? 0:02 /etc/athena/snmpd
root 273 268 0 21:08:12 ? 0:00 /usr/lib/saf/ttymon
junekim 421 1 0 21:28:37 ttyp0 0:03 mwm
root 401 400 0 21:24:47 ? 0:26 X -fp /usr/openwin/lib/X11/fonts/75dpi,/usr/openwin/lib/X11/fonts/misc :0 -noba
junekim 429 1 0 21:28:39 ttyp0 0:01 zwgc
junekim 504 434 0 23:39:46 ttyp0 0:00 /bin/sh /usr/athena/bin/end_session
junekim 434 1 0 21:28:40 ttyp0 0:02 dash -default -logout
junekim 491 434 0 23:39:25 ttyp0 0:00 /bin/sh /usr/athena/bin/end_session
root 767 766 0 00:37:01 ? 0:00 /etc/athena/desync
junekim 443 442 0 21:28:43 ttyp0 0:00 from -v
junekim 582 434 0 23:40:39 ttyp0 0:00 xterm
root 766 163 0 00:37:00 ? 0:00 sh -c /etc/athena/desync; /usr/lib/sendmail -q
root 413 409 0 0:00 <defunct>
junekim 467 409 0 21:37:01 ttyp0 0:00 kdestroy
junekim 409 400 0 21:24:52 ttyp0 0:01 /bin/sh /etc/athena/login/Xsession 1
junekim 608 434 0 23:40:55 ttyp0 0:00 /bin/sh /usr/athena/bin/end_session
junekim 566 434 0 23:40:29 ttyp0 0:00 /bin/sh /usr/athena/bin/end_session
junekim 631 421 0 23:41:40 ? 0:00 tcsh -c xterm &
root 668 269 0 23:47:00 console 0:01 -tcsh
junekim 592 434 0 23:40:47 ? 0:00 xterm
junekim 643 592 0 23:42:03 pts/1 0:00 tcsh
root 870 869 0 00:45:00 ? 0:00 /etc/athena/desync 39600
junekim 664 631 0 23:46:53 ? 0:00 xterm
junekim 442 1 0 21:28:43 ttyp0 0:00 /bin/athena/tcsh -f /usr/athena/lib/init/xsession
junekim 514 434 0 23:39:57 ttyp0 0:00 /bin/sh /usr/athena/bin/end_session
junekim 677 664 0 23:48:28 pts/2 0:00 tcsh
root 869 163 0 00:45:00 ? 0:00 sh -c /etc/athena/desync 39600; . /etc/athena/rc.conf; /etc/athena/gettime -s $
root 908 668 0 00:53:59 console 0:02 /mit/sunsoft_v5.1/SUNWspro/bin/dbx /usr/athena/bin/kdestroy 467
root 1012 668 2 01:14:36 console 0:00 ps -ef
w20-575-33#
kdestroy certainly stands out as being bogus. dbx tells us:
(/mit/sunsoft_v5.1/SUNWspro/bin/dbx) where
=>[1] _libc_write(0x2, 0x27cc0, 0x13, 0xef6932bc, 0xef6932bc, 0xef6932bc), at 0xef639590
[2] _fwrite_unlocked(0x27cc0, 0x0, 0x13, 0x13, 0x0, 0x4), at 0xef66ecd8
[3] _dowrite(0x27cc0, 0x13, 0x3f158, 0xeffffc84, 0xef6945d4, 0xef6945d8), at 0xef6607f8
[4] _doprnt(0x27cd3, 0x0, 0x0, 0x13, 0x27cc0, 0x0), at 0xef660a48
[5] _fprintf(0x3f158, 0x27cc0, 0x27c00, 0x0, 0xeffffd64, 0x0), at 0xef66ac44
[6] main(0x1, 0xeffffe04, 0xeffffe0c, 0x3ef20, 0x1, 0x0), at 0x13528
and it's trying to fprintf(stderr, "Tickets destroyed.\n"); this
seems pretty strange. It's stuck trying to do a system call trap
and it's not returning:
(/mit/sunsoft_v5.1/SUNWspro/bin/dbx) examine $pc-10/11i
0xef639568: waitid+0x0028: or %o5, 0x178, %o5 ! 0x00054d78
0xef63956c: waitid+0x002c: add %o5, %o7, %o5
0xef639570: waitid+0x0030: mov %g1, %o7
0xef639574: waitid+0x0034: ld [%o5 + 0xdf8], %o5
0xef639578: waitid+0x0038: jmp %o5
0xef63957c: waitid+0x003c: nop
0xef639580: waitid+0x0040: jmp %o7 + 0x8
0xef639584: waitid+0x0044: nop
0xef639588: _libc_write : st %o0, [%sp + 0x44]
0xef63958c: _libc_write+0x0004: mov 0x4, %g1
0xef639590: _libc_write+0x0008: ta 0x8
w20-575-33# grep SYS_write /usr/include/sys/syscall.h
#define SYS_write 4
What's it trying to write to?:
(/mit/sunsoft_v5.1/SUNWspro/bin/dbx) examine 0x3f158/10x
0x0003f158: __iob+0x0020: 0x0000 0x0000 0xef69 0x32bc 0xef69 0x32bc 0x060
Write to stderr, which is:
fgw20-575-33# /usr/proc/bin/pfiles 467
467: kdestroy
Current rlimit: 64 file descriptors
0: S_IFCHR mode:0666 dev:32,24 ino:7618 uid:0 gid:3 rdev:13,2
O_RDONLY
1: S_IFCHR mode:0620 dev:32,24 ino:7767 uid:1021 gid:7 rdev:26,0
O_RDWR
2: S_IFCHR mode:0620 dev:32,24 ino:7767 uid:1021 gid:7 rdev:26,0
O_RDWR
4: S_IFREG mode:0664 dev:32,29 ino:26457 uid:0 gid:2 size:2528
O_RDONLY
5: S_IFREG mode:0644 dev:32,30 ino:18883 uid:0 gid:101 size:432
O_RDWR
6: S_IFREG mode:0644 dev:32,30 ino:18884 uid:0 gid:101 size:4464
O_RDWR
7: S_IFREG mode:0644 dev:32,30 ino:18884 uid:0 gid:101 size:4464
O_RDWR
w20-575-33# find /dev -follow -inum 7767 -ls
7767 0 crw--w---- 1 junekim tty 26, 0 Oct 23 00:51 /dev/ttyp0
Not sure why it wouldn't be writable.
mycroft suggested looking at /dev/ptyp0:
w20-575-33# fuser /dev/ptyp0
/dev/ptyp0: 400o
w20-575-33# ps -fp400
UID PID PPID C STIME TTY TIME CMD
root 400 1 0 21:24:46 ? 0:00 /etc/athena/dm /etc/athena/login/config ttyp0 console
Other processes interested in ttyp0 are:
w20-575-33# fuser /dev/ttyp0
/dev/ttyp0: 664o 631o 608o 592o 582o 566o 514o 504o 491o 467o 443o 442o 434o 429o 421o 409o
w20-575-33# ps -fp 664,631,608,592,582,566,514,504,491,467,443,442,434,429,421,409
UID PID PPID C STIME TTY TIME CMD
junekim 421 1 0 21:28:37 ttyp0 0:03 mwm
junekim 429 1 0 21:28:39 ttyp0 0:01 zwgc
junekim 504 434 0 23:39:46 ttyp0 0:00 /bin/sh /usr/athena/bin/end_session
junekim 434 1 0 21:28:40 ttyp0 0:02 dash -default -logout
junekim 491 434 0 23:39:25 ttyp0 0:00 /bin/sh /usr/athena/bin/end_session
junekim 443 442 0 21:28:43 ttyp0 0:00 from -v
junekim 582 434 0 23:40:39 ttyp0 0:00 xterm
junekim 467 409 0 21:37:01 ttyp0 0:00 kdestroy
junekim 409 400 0 21:24:52 ttyp0 0:01 /bin/sh /etc/athena/login/Xsession 1
junekim 608 434 0 23:40:55 ttyp0 0:00 /bin/sh /usr/athena/bin/end_session
junekim 566 434 0 23:40:29 ttyp0 0:00 /bin/sh /usr/athena/bin/end_session
junekim 631 421 0 23:41:40 ? 0:00 tcsh -c xterm &
junekim 592 434 0 23:40:47 ? 0:00 xterm
junekim 664 631 0 23:46:53 ? 0:00 xterm
junekim 442 1 0 21:28:43 ttyp0 0:00 /bin/athena/tcsh -f /usr/athena/lib/init/xsession
junekim 514 434 0 23:39:57 ttyp0 0:00 /bin/sh /usr/athena/bin/end_session
Which is pretty much all of her processes.
I looked at dm and dash, and concluded nothing; the former seemed to
be operating normally and the later was inscrutable (But did not
actually seem to be running exec()).
Dunno how to account for this lossage...
I decided to force a core dump and leave it around in case anyone
wanted to look, on the off chance that that it'd be useful, since
it is one of the problem SS4s.
kadb[0]: $q
Type 'go' to resume
ok sync
panic: zero
syncing file systems... 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 done
2583 static and sysmap kernel pages
49 dynamic kernel data pages
112 kernel-pageable pages
0 segkmap kernel pages
0 segvn kernel pages
0 current user process pages
2744 total pages (2744 chunks)
dumping to vp f5bc0e1c, offset 265304
2744 total pages, dump succeeded
rebooting...
Resetting ...
...
Go figure.
--jhawk