[3014] in testers

home help back first fref pref prev next nref lref last post

8.0C: savecore lossage

daemon@ATHENA.MIT.EDU (John Hawkinson)
Sun Jul 21 00:05:44 1996

Date: Sun, 21 Jul 1996 00:05:29 -0400
To: testers@MIT.EDU
From: John Hawkinson <jhawk@MIT.EDU>


Saveing crash dumps doesn't seem to work quite right.

I booted portnoy single user after a crash and editted
/etc/init.d/sysetup to uncomment the crash dump stuff at the end:

##
## Default is to not do a savecore
##
if [ ! -d /var/crash/`uname -n` ]
then mkdir -p /var/crash/`uname -n`
fi
                echo 'checking for crash dump...\c '
savecore /var/crash/`uname -n`
                echo ''

I then started to boot multiuser. Savecore ran and copied
the dump from the swap area and then produced 3 errors:

savecore: Warning: can't find 'ufs' in module path: /kernel /usr/kernel
savecore: Warning: can't find 'ip' in module path: /kernel /usr/kernel
savecore: Warning: can't find 'udp' in module path: /kernel /usr/kernel

At first I thought this was because those kernel modules were in /afs,
hence my other bug report about loading afs by hand. It turns out
this wasn't true. Actually, they're in /kernel/drv and /kernel/fs.

Once those errors showed up, I immediately hit L1-A and booted single
user again, fscked, and tried to re-run savecore with -v (documented
option for verbosity). This was unsuccessful, as savecore reported
it had already been run.

The BSD savecore has tradionally allowed you to
override this behavior. On the assumption that the Solaris
savecore was the same, I stringsed it:

[portnoy!jhawk] /kernel> strings /bin/savecore | head -2
vdf:
%s: %m

Sure enough, the first returned string looks like a getopt
string. Using -vd caused savecore to ignore the fact
that a dump was already extracted and try again; unfortunately
it overwrote /var/crash/portnoy/vmcore.4. I'm not sure if it
did this because the original savecore did not properly
update the bounds file (perhaps because of the warnings?), or
because -d decides to ignore the bounds file somehow (then why didn't
it write .0?). I would like someone to:

	1.	Check the Solaris sources and clarify just
		what -d does.
	2.	Complain to Sun (low priority) that -d is not documented.
		Perhaps the most effective way would be to send them
		a patch to savecore.1m; I'll be happy to write up reasonable
		wording if someone else does (1) [doing (1) is hard for me :-)]


Anyhow, further investigation showed that the problem seemed to be that
savecore was seeing reference to kernel modules as "ufs", "ip", and "udp",
and it was searching for those modules in /kernel/drv and /usr/kernel.
The verbose output produed by -vd indicates that most kernel modules
loaded are referenced by relative paths including the subdirs:

# savecore -vd /var/crash/portnoy
System went down at Sat Jul 20 22:53:39 1996
Saving 5103 pages of image in vmcore.0
  5103 pages saved.
Modules loaded at the time of crash:
        /kernel/unix            fs/specfs               misc/swapgeneric    
        sched/TS                sched/TS_DPTBL          ufs                 
        drv/rootnex             drv/options             drv/dma             
        drv/sbus                drv/iommu               drv/sad             
        drv/pseudo              drv/sd                  misc/scsi           
        drv/esp                 fs/procfs               sys/c2audit         
        misc/strplumb           drv/clone               ip                  
        drv/tcp                 udp                     drv/icmp            

So I think the bug here is that somehow those modules were loaded
with a different search path than that used by savecore.

The obvious workaround (which I used) was to make symlinks:

# pwd
/kernel
# ls -l ip udp ufs
lrwxrwxrwx   1 root     root           6 Jul 20 23:14 ip -> drv/ip
lrwxrwxrwx   1 root     root           7 Jul 20 23:14 udp -> drv/udp
lrwxrwxrwx   1 root     root           6 Jul 20 23:14 ufs -> fs/ufs

I think the correct fix is to have the modules loaded with the full
path relative to /kernel. I'm not sure how to accomplish this --
kernel(1m) suggests /etc/system might be pertinent, but this doesn't
seem to actually be the case (though perhaps a better workaround could
be installed there).

The output of "sysdef" seems instructive. The Loadable Objects
section begins:

*
* Loadable Objects
*
unix
ufs
ip
udp
strmod/arp
drv/arp
drv/arp
...

which seems telling; something is wrong with /kernel/unix.

Further staring at Intro(9s) is not very helpful, so
I'll stop here. Hopefully someone can figure this out.

--jhawk

home help back first fref pref prev next nref lref last post