[15598] in Athena Bugs

home help back first fref pref prev next nref lref last post

8.1.9 sun4: ws hanging lossage

daemon@ATHENA.MIT.EDU (John Hawkinson)
Thu Oct 23 02:18:42 1997

Date: Thu, 23 Oct 1997 02:18:38 -0400
To: bugs@MIT.EDU
From: John Hawkinson <jhawk@MIT.EDU>


w20-575-33 is a sparc 4 that's been seeing a higher-than-average
number of "sparc 4 hangs".

A couple times this week, users have reported new problems on it --
that they've tried to logout and found it doesn't work -- clicking
dash's Logout and saying Yes seems to do nothing.

Since the machine happened to have a serial console, I did some
more investigation.

It appears there are a bunch of spinning processes. Here's a dump:

w20-575-33# ps -ef
     UID   PID  PPID  C    STIME TTY      TIME CMD
    root     0     0  0 21:06:23 ?        0:01 sched
    root     1     0  0 21:06:26 ?        0:00 /etc/init -v
    root     2     0  0 21:06:26 ?        0:00 pageout
    root     3     0  1 21:06:26 ?        0:29 fsflush
    root   269     1  0 21:08:09 console  0:00 /usr/bin/login
    root   153     1  0 21:07:41 ?        0:00 /usr/sbin/syslogd
    root   400     1  0 21:24:46 ?        0:00 /etc/athena/dm /etc/athena/login/config ttyp0 console
    root   104     1  0 21:07:22 ?        0:01 /etc/afsd -stat 2000 -dcache 800 -daemons 4 -volumes 70
    root   103     1  0 21:07:22 ?        0:00 /etc/afsd -stat 2000 -dcache 800 -daemons 4 -volumes 70
    root   105     1  0 21:07:22 ?        0:00 /etc/afsd -stat 2000 -dcache 800 -daemons 4 -volumes 70
    root   106     1  0 21:07:22 ?        0:00 /etc/afsd -stat 2000 -dcache 800 -daemons 4 -volumes 70
    root   107     1  0 21:07:22 ?        0:00 /etc/afsd -stat 2000 -dcache 800 -daemons 4 -volumes 70
    root   108     1  0 21:07:22 ?        0:00 /etc/afsd -stat 2000 -dcache 800 -daemons 4 -volumes 70
    root   109     1  0 21:07:22 ?        0:00 /etc/afsd -stat 2000 -dcache 800 -daemons 4 -volumes 70
    root   110     1  0 21:07:27 ?        0:01 /etc/afsd -stat 2000 -dcache 800 -daemons 4 -volumes 70
    root   132     1  0 21:07:39 ?        0:01 /usr/sbin/in.named
    root   117     1  0 21:07:31 ?        0:00 /usr/sbin/rpcbind
    root   260     1  0 21:08:08 ?        0:02 /usr/sbin/vold
    root   137     1  0 21:07:40 ?        0:00 /usr/sbin/inetd -s
    root   268     1  0 21:08:09 ?        0:00 /usr/lib/saf/sac -t 300
    root   163     1  0 21:07:48 ?        0:00 /usr/sbin/cron
    root   173     1  0 21:07:49 ?        0:00 /usr/lib/utmpd
    root   255     1  0 21:08:07 ?        0:01 /etc/athena/zhm
    root   244     1  0 21:08:05 ?        0:00 /etc/athena/inetd
  daemon   249     1  0 21:08:06 ?        0:02 /etc/athena/snmpd
    root   273   268  0 21:08:12 ?        0:00 /usr/lib/saf/ttymon
 junekim   421     1  0 21:28:37 ttyp0    0:03 mwm
    root   401   400  0 21:24:47 ?        0:26 X -fp /usr/openwin/lib/X11/fonts/75dpi,/usr/openwin/lib/X11/fonts/misc :0 -noba
 junekim   429     1  0 21:28:39 ttyp0    0:01 zwgc
 junekim   504   434  0 23:39:46 ttyp0    0:00 /bin/sh /usr/athena/bin/end_session
 junekim   434     1  0 21:28:40 ttyp0    0:02 dash -default -logout
 junekim   491   434  0 23:39:25 ttyp0    0:00 /bin/sh /usr/athena/bin/end_session
    root   767   766  0 00:37:01 ?        0:00 /etc/athena/desync
 junekim   443   442  0 21:28:43 ttyp0    0:00 from -v
 junekim   582   434  0 23:40:39 ttyp0    0:00 xterm
    root   766   163  0 00:37:00 ?        0:00 sh -c /etc/athena/desync; /usr/lib/sendmail -q
    root   413   409  0                   0:00 <defunct>
 junekim   467   409  0 21:37:01 ttyp0    0:00 kdestroy
 junekim   409   400  0 21:24:52 ttyp0    0:01 /bin/sh /etc/athena/login/Xsession 1 
 junekim   608   434  0 23:40:55 ttyp0    0:00 /bin/sh /usr/athena/bin/end_session
 junekim   566   434  0 23:40:29 ttyp0    0:00 /bin/sh /usr/athena/bin/end_session
 junekim   631   421  0 23:41:40 ?        0:00 tcsh -c xterm &
    root   668   269  0 23:47:00 console  0:01 -tcsh
 junekim   592   434  0 23:40:47 ?        0:00 xterm
 junekim   643   592  0 23:42:03 pts/1    0:00 tcsh
    root   870   869  0 00:45:00 ?        0:00 /etc/athena/desync 39600
 junekim   664   631  0 23:46:53 ?        0:00 xterm
 junekim   442     1  0 21:28:43 ttyp0    0:00 /bin/athena/tcsh -f /usr/athena/lib/init/xsession
 junekim   514   434  0 23:39:57 ttyp0    0:00 /bin/sh /usr/athena/bin/end_session
 junekim   677   664  0 23:48:28 pts/2    0:00 tcsh
    root   869   163  0 00:45:00 ?        0:00 sh -c /etc/athena/desync 39600; . /etc/athena/rc.conf; /etc/athena/gettime -s $
    root   908   668  0 00:53:59 console  0:02 /mit/sunsoft_v5.1/SUNWspro/bin/dbx /usr/athena/bin/kdestroy 467
    root  1012   668  2 01:14:36 console  0:00 ps -ef
w20-575-33# 

kdestroy certainly stands out as being bogus. dbx tells us:

(/mit/sunsoft_v5.1/SUNWspro/bin/dbx) where
=>[1] _libc_write(0x2, 0x27cc0, 0x13, 0xef6932bc, 0xef6932bc, 0xef6932bc), at 0xef639590
  [2] _fwrite_unlocked(0x27cc0, 0x0, 0x13, 0x13, 0x0, 0x4), at 0xef66ecd8
  [3] _dowrite(0x27cc0, 0x13, 0x3f158, 0xeffffc84, 0xef6945d4, 0xef6945d8), at 0xef6607f8
  [4] _doprnt(0x27cd3, 0x0, 0x0, 0x13, 0x27cc0, 0x0), at 0xef660a48
  [5] _fprintf(0x3f158, 0x27cc0, 0x27c00, 0x0, 0xeffffd64, 0x0), at 0xef66ac44
  [6] main(0x1, 0xeffffe04, 0xeffffe0c, 0x3ef20, 0x1, 0x0), at 0x13528

and it's trying to fprintf(stderr, "Tickets destroyed.\n"); this
seems pretty strange. It's stuck trying to do a system call trap
and it's not returning:

(/mit/sunsoft_v5.1/SUNWspro/bin/dbx) examine $pc-10/11i 
0xef639568: waitid+0x0028:      or      %o5, 0x178, %o5          ! 0x00054d78
0xef63956c: waitid+0x002c:      add     %o5, %o7, %o5
0xef639570: waitid+0x0030:      mov     %g1, %o7
0xef639574: waitid+0x0034:      ld      [%o5 + 0xdf8], %o5
0xef639578: waitid+0x0038:      jmp     %o5
0xef63957c: waitid+0x003c:      nop     
0xef639580: waitid+0x0040:      jmp     %o7 + 0x8
0xef639584: waitid+0x0044:      nop     
0xef639588: _libc_write       : st      %o0, [%sp + 0x44]
0xef63958c: _libc_write+0x0004: mov     0x4, %g1
0xef639590: _libc_write+0x0008: ta      0x8

w20-575-33# grep SYS_write /usr/include/sys/syscall.h
#define SYS_write       4

What's it trying to write to?:

(/mit/sunsoft_v5.1/SUNWspro/bin/dbx) examine 0x3f158/10x
0x0003f158: __iob+0x0020:        0x0000 0x0000 0xef69 0x32bc 0xef69 0x32bc 0x060

Write to stderr, which is:

fgw20-575-33# /usr/proc/bin/pfiles 467
467:    kdestroy
  Current rlimit: 64 file descriptors
   0: S_IFCHR mode:0666 dev:32,24 ino:7618 uid:0 gid:3 rdev:13,2
      O_RDONLY
   1: S_IFCHR mode:0620 dev:32,24 ino:7767 uid:1021 gid:7 rdev:26,0
      O_RDWR
   2: S_IFCHR mode:0620 dev:32,24 ino:7767 uid:1021 gid:7 rdev:26,0
      O_RDWR
   4: S_IFREG mode:0664 dev:32,29 ino:26457 uid:0 gid:2 size:2528
      O_RDONLY
   5: S_IFREG mode:0644 dev:32,30 ino:18883 uid:0 gid:101 size:432
      O_RDWR
   6: S_IFREG mode:0644 dev:32,30 ino:18884 uid:0 gid:101 size:4464
      O_RDWR
   7: S_IFREG mode:0644 dev:32,30 ino:18884 uid:0 gid:101 size:4464
      O_RDWR

w20-575-33# find /dev -follow -inum 7767 -ls
 7767    0 crw--w----  1 junekim  tty       26,   0 Oct 23 00:51 /dev/ttyp0

Not sure why it wouldn't be writable.
mycroft suggested looking at /dev/ptyp0:

w20-575-33# fuser /dev/ptyp0
/dev/ptyp0:      400o
w20-575-33# ps -fp400
     UID   PID  PPID  C    STIME TTY      TIME CMD
    root   400     1  0 21:24:46 ?        0:00 /etc/athena/dm /etc/athena/login/config ttyp0 console

Other processes interested in ttyp0 are:

w20-575-33# fuser /dev/ttyp0
/dev/ttyp0:      664o     631o     608o     592o     582o     566o     514o     504o     491o     467o     443o     442o     434o     429o     421o     409o
w20-575-33# ps -fp 664,631,608,592,582,566,514,504,491,467,443,442,434,429,421,409
     UID   PID  PPID  C    STIME TTY      TIME CMD
 junekim   421     1  0 21:28:37 ttyp0    0:03 mwm
 junekim   429     1  0 21:28:39 ttyp0    0:01 zwgc
 junekim   504   434  0 23:39:46 ttyp0    0:00 /bin/sh /usr/athena/bin/end_session
 junekim   434     1  0 21:28:40 ttyp0    0:02 dash -default -logout
 junekim   491   434  0 23:39:25 ttyp0    0:00 /bin/sh /usr/athena/bin/end_session
 junekim   443   442  0 21:28:43 ttyp0    0:00 from -v
 junekim   582   434  0 23:40:39 ttyp0    0:00 xterm
 junekim   467   409  0 21:37:01 ttyp0    0:00 kdestroy
 junekim   409   400  0 21:24:52 ttyp0    0:01 /bin/sh /etc/athena/login/Xsession 1 
 junekim   608   434  0 23:40:55 ttyp0    0:00 /bin/sh /usr/athena/bin/end_session
 junekim   566   434  0 23:40:29 ttyp0    0:00 /bin/sh /usr/athena/bin/end_session
 junekim   631   421  0 23:41:40 ?        0:00 tcsh -c xterm &
 junekim   592   434  0 23:40:47 ?        0:00 xterm
 junekim   664   631  0 23:46:53 ?        0:00 xterm
 junekim   442     1  0 21:28:43 ttyp0    0:00 /bin/athena/tcsh -f /usr/athena/lib/init/xsession
 junekim   514   434  0 23:39:57 ttyp0    0:00 /bin/sh /usr/athena/bin/end_session

Which is pretty much all of her processes.

I looked at dm and dash, and concluded nothing; the former seemed to
be operating normally and the later was inscrutable (But did not
actually seem to be running exec()).

Dunno how to account for this lossage...

I decided to force a core dump and leave it around in case anyone
wanted to look, on the off chance that that it'd be useful, since
it is one of the problem SS4s.

kadb[0]: $q
Type  'go' to resume
ok sync
panic: zero
syncing file systems... 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 done
 2583 static and sysmap kernel pages
   49 dynamic kernel data pages
  112 kernel-pageable pages
    0 segkmap kernel pages
    0 segvn kernel pages
    0 current user process pages
 2744 total pages (2744 chunks)

dumping to vp f5bc0e1c, offset 265304
2744 total pages, dump succeeded
rebooting...
Resetting ... 
...

Go figure.

--jhawk

home help back first fref pref prev next nref lref last post