[2352] in Release_7.7_team

home help back first fref pref prev next nref lref last post

LARGE 8.4 os verification bug

daemon@ATHENA.MIT.EDU (Garry Zacheiss)
Sun Jul 16 18:00:59 2000

Message-Id: <200007162200.SAA29623@mary-kay-commandos.mit.edu>
To: miki@MIT.EDU, bugs@MIT.EDU
cc: cluster-services@MIT.EDU, release-team@MIT.EDU
Date: Sun, 16 Jul 2000 18:00:43 -0400
From: Garry Zacheiss <zacheiss@MIT.EDU>

	In recent hotline transactions, there've been a few veiled
references to a "single user error" that required reinstalling the
workstation (see hotline[51387] and hotline[51365], for examples).
Yesterday, I was reading through hotline and noticed the reports, and
realized they all occured on machines running 8.4; at around the same
time, the problem occured on w20-575-28, and I decided to investigate.

	What I found is that the OS verification on public 8.4 Suns is
clobbering /etc/name_to_major with a broken version, leaving the machine
in an unbootable state.  Specifically, /etc/name_to_major is an
architecture specific file, and differs between sun4u and sun4m
machines.  The entry in question that's biting us is the "sysmsg"
device.  On a sun4m machine running 8.4:

mary-kay-commandos.mit.edu# grep sysmsg /etc/name_to_major
sysmsg 9
mary-kay-commandos.mit.edu# ls -l /devices/pseudo/sys*
crw-------   1 root     sys        9,  1 Jul 11 16:16 /devices/pseudo/sysmsg@0:msglog
crw-------   1 root     sys        9,  0 Jul 16 17:53 /devices/pseudo/sysmsg@0:sysmsg

On a sun4u machine:

contents-vnder-pressvre.mit.edu# grep sysmsg /etc/name_to_major
sysmsg 31
contents-vnder-pressvre.mit.edu# ls -l /devices/pseudo/sys*
crw-------   1 root     sys       31,  1 Jul  5 14:13 /devices/pseudo/sysmsg@0:msglog
crw-------   1 root     sys       31,  0 Jun 30 19:34 /devices/pseudo/sysmsg@0:sysmsg

The AFS copy of name_to_major in the OS pack is appropriate to neither
platform:

[zacheiss@mary-kay-commandos] /afs/athena.mit.edu/system/sun4x_57/os/etc> grep sysmsg name_to_major
sysmsg 97

       On PUBLIC=true machines, the AFS copy is getting copied to local
disk, with the result that the machine fails to boot on the next
reboot, because the major number of the device nodes doesn't match the
sysmsg entry in /etc/name_to_major.

       The correct short term answer to this would seem to be to add
name_to_major to the exceptions file for the OS verifications, and in
the long term look into some way of correctly verifying this file.
Since this problem pretty thoroughly clobbers PUBLIC machines running
8.4, a fix to this should go out before we go public.

       cluster-services: If you're encountering a large number of
machines that are failing to boot in the same way while we're in a
release cycle, we'd all appreciate if you verified whether or not the
machines was running the new release, and reported the problem.  Had we
gone public with this problem still present in the release, it would've
been very bad.

Garry


home help back first fref pref prev next nref lref last post