[2353] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Re: LARGE 8.4 os verification bug

daemon@ATHENA.MIT.EDU (Kyle Pope)
Mon Jul 17 09:31:02 2000

Message-Id: <200007171330.JAA24143@melbourne-city-street.MIT.EDU>
Date: Mon, 17 Jul 2000 09:32:55 -0400
To: Garry Zacheiss <zacheiss@mit.edu>, miki@mit.edu, bugs@mit.edu
From: Kyle Pope <ndpope@MIT.EDU>
Cc: cluster-services@mit.edu, release-team@mit.edu
In-Reply-To: <200007162200.SAA29623@mary-kay-commandos.mit.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"

Garry, 
	We had not verified which version of the release was running
on the machines we have reinstalled, but we have had about 30 instances
of the "single-user" mode errors.  The greatest concentration was in the
Bldg 37 cluster.  Systems 21-40 were all down last week.  All machines were
Suns - Sparc5, Ultra5, Ultra10.

	If we know what the Bldg 37 cluster was running, that would answer 
the question of what release is affected.

			Kyle Pope
			ndpope@mit.edu
			MIT Hardware Services
			hardserv@mit.edu

At 06:00 PM 7/16/2000 -0400, Garry Zacheiss wrote:
>	In recent hotline transactions, there've been a few veiled
>references to a "single user error" that required reinstalling the
>workstation (see hotline[51387] and hotline[51365], for examples).
>Yesterday, I was reading through hotline and noticed the reports, and
>realized they all occured on machines running 8.4; at around the same
>time, the problem occured on w20-575-28, and I decided to investigate.
>
>	What I found is that the OS verification on public 8.4 Suns is
>clobbering /etc/name_to_major with a broken version, leaving the machine
>in an unbootable state.  Specifically, /etc/name_to_major is an
>architecture specific file, and differs between sun4u and sun4m
>machines.  The entry in question that's biting us is the "sysmsg"
>device.  On a sun4m machine running 8.4:
>
>mary-kay-commandos.mit.edu# grep sysmsg /etc/name_to_major
>sysmsg 9
>mary-kay-commandos.mit.edu# ls -l /devices/pseudo/sys*
>crw-------   1 root     sys        9,  1 Jul 11 16:16
/devices/pseudo/sysmsg@0:msglog
>crw-------   1 root     sys        9,  0 Jul 16 17:53
/devices/pseudo/sysmsg@0:sysmsg
>
>On a sun4u machine:
>
>contents-vnder-pressvre.mit.edu# grep sysmsg /etc/name_to_major
>sysmsg 31
>contents-vnder-pressvre.mit.edu# ls -l /devices/pseudo/sys*
>crw-------   1 root     sys       31,  1 Jul  5 14:13
/devices/pseudo/sysmsg@0:msglog
>crw-------   1 root     sys       31,  0 Jun 30 19:34
/devices/pseudo/sysmsg@0:sysmsg
>
>The AFS copy of name_to_major in the OS pack is appropriate to neither
>platform:
>
>[zacheiss@mary-kay-commandos] /afs/athena.mit.edu/system/sun4x_57/os/etc>
grep sysmsg name_to_major
>sysmsg 97
>
>       On PUBLIC=true machines, the AFS copy is getting copied to local
>disk, with the result that the machine fails to boot on the next
>reboot, because the major number of the device nodes doesn't match the
>sysmsg entry in /etc/name_to_major.
>
>       The correct short term answer to this would seem to be to add
>name_to_major to the exceptions file for the OS verifications, and in
>the long term look into some way of correctly verifying this file.
>Since this problem pretty thoroughly clobbers PUBLIC machines running
>8.4, a fix to this should go out before we go public.
>
>       cluster-services: If you're encountering a large number of
>machines that are failing to boot in the same way while we're in a
>release cycle, we'd all appreciate if you verified whether or not the
>machines was running the new release, and reported the problem.  Had we
>gone public with this problem still present in the release, it would've
>been very bad.
>
>Garry
> 


home help back first fref pref prev next nref lref last post