[2353] in Release_7.7_team
Re: LARGE 8.4 os verification bug
daemon@ATHENA.MIT.EDU (Kyle Pope)
Mon Jul 17 09:31:02 2000
Message-Id: <200007171330.JAA24143@melbourne-city-street.MIT.EDU>
Date: Mon, 17 Jul 2000 09:32:55 -0400
To: Garry Zacheiss <zacheiss@mit.edu>, miki@mit.edu, bugs@mit.edu
From: Kyle Pope <ndpope@MIT.EDU>
Cc: cluster-services@mit.edu, release-team@mit.edu
In-Reply-To: <200007162200.SAA29623@mary-kay-commandos.mit.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Garry,
We had not verified which version of the release was running
on the machines we have reinstalled, but we have had about 30 instances
of the "single-user" mode errors. The greatest concentration was in the
Bldg 37 cluster. Systems 21-40 were all down last week. All machines were
Suns - Sparc5, Ultra5, Ultra10.
If we know what the Bldg 37 cluster was running, that would answer
the question of what release is affected.
Kyle Pope
ndpope@mit.edu
MIT Hardware Services
hardserv@mit.edu
At 06:00 PM 7/16/2000 -0400, Garry Zacheiss wrote:
> In recent hotline transactions, there've been a few veiled
>references to a "single user error" that required reinstalling the
>workstation (see hotline[51387] and hotline[51365], for examples).
>Yesterday, I was reading through hotline and noticed the reports, and
>realized they all occured on machines running 8.4; at around the same
>time, the problem occured on w20-575-28, and I decided to investigate.
>
> What I found is that the OS verification on public 8.4 Suns is
>clobbering /etc/name_to_major with a broken version, leaving the machine
>in an unbootable state. Specifically, /etc/name_to_major is an
>architecture specific file, and differs between sun4u and sun4m
>machines. The entry in question that's biting us is the "sysmsg"
>device. On a sun4m machine running 8.4:
>
>mary-kay-commandos.mit.edu# grep sysmsg /etc/name_to_major
>sysmsg 9
>mary-kay-commandos.mit.edu# ls -l /devices/pseudo/sys*
>crw------- 1 root sys 9, 1 Jul 11 16:16
/devices/pseudo/sysmsg@0:msglog
>crw------- 1 root sys 9, 0 Jul 16 17:53
/devices/pseudo/sysmsg@0:sysmsg
>
>On a sun4u machine:
>
>contents-vnder-pressvre.mit.edu# grep sysmsg /etc/name_to_major
>sysmsg 31
>contents-vnder-pressvre.mit.edu# ls -l /devices/pseudo/sys*
>crw------- 1 root sys 31, 1 Jul 5 14:13
/devices/pseudo/sysmsg@0:msglog
>crw------- 1 root sys 31, 0 Jun 30 19:34
/devices/pseudo/sysmsg@0:sysmsg
>
>The AFS copy of name_to_major in the OS pack is appropriate to neither
>platform:
>
>[zacheiss@mary-kay-commandos] /afs/athena.mit.edu/system/sun4x_57/os/etc>
grep sysmsg name_to_major
>sysmsg 97
>
> On PUBLIC=true machines, the AFS copy is getting copied to local
>disk, with the result that the machine fails to boot on the next
>reboot, because the major number of the device nodes doesn't match the
>sysmsg entry in /etc/name_to_major.
>
> The correct short term answer to this would seem to be to add
>name_to_major to the exceptions file for the OS verifications, and in
>the long term look into some way of correctly verifying this file.
>Since this problem pretty thoroughly clobbers PUBLIC machines running
>8.4, a fix to this should go out before we go public.
>
> cluster-services: If you're encountering a large number of
>machines that are failing to boot in the same way while we're in a
>release cycle, we'd all appreciate if you verified whether or not the
>machines was running the new release, and reported the problem. Had we
>gone public with this problem still present in the release, it would've
>been very bad.
>
>Garry
>