[3857] in testers
cluster walk
daemon@ATHENA.MIT.EDU (Jonathon Weiss)
Thu Jul 30 02:43:12 1998
From: Jonathon Weiss <jweiss@MIT.EDU>
To: testers@MIT.EDU, cluster-services@MIT.EDU
Date: Thu, 30 Jul 1998 02:43:03 EDT
Tonight I did a cluster walk with Lou and mwhitson, and mvsilis.
I started with a list of machines that were not running 8.2 according
to snmp (or, in the case of indy's were note running IRIX 6.2).
We found a several of classes of problems:
Problem #0:
Symptoms:
Machine is still running 8.1, but has been logged into (either
remotely or locally) fairly constantly since the release.
Suspected cause:
Machines that are in use will not take the update, and won;t even
start their 4 hour desyncronization.
Solution:
The machine will update itself after it is idle for long enough
What in affected:
We ran into 5 or 10 machines (mix of sun and SGI) in this state.
Problem #1:
Symptoms:
still running 8.1, when you log in you get a message in
the consle that says there isn't enough space on the root partition
(less than 5M free) to do the update, machine has teh 8.2 packs
attached, so there may be some other errors in the console.
Suspected cause:
Stupid software, or stupid users leaving garbage on the root disk.
Solution:
Look for things that can get removed. some good things to check are
/.mosaic* /.netscape /Mail and any actual files (ie, not symlinks or
directories) in /dev. Note that some private workstation owners may
care about some of these things, so you probably want to give them
the job of finding the files to remove.
What it affected:
We found about 5 to 10 sun machines in this state.
Problem #2:
Symptoms:
no MOTD is displayed on the idle screen befor login. There are
messages in the console about no hesinfo. Xlogin won't let you
login, because the workstation hasn't activated properly. Logging
in in console more (ie, after a control-P) hangs for a minute and
then gives you a prompt. ls /afs may say connection timed out.
Suspected cause: none.
Solution:
Login as root in console mode. reboot the machine. let it come all
the way back up and start xlogin. Login to the machine as root
again, and reboot again. (After the first reboot the machine should
be in the problem #3 state).
What it affected:
We found 6 sun machines in this state.
Problem #3:
Symptoms:
When you log in a line like the following appears in the console:
Athena Workstation (sun4) Version Reboot Auto 8.2.8 Wed Jul 29 16:29:39 EDT 1998
rather than:
Athena Workstation (sun4) Version 8.2.8 Wed Jul 22 00:09:03 EDT 1998
Machine may complain about being in the middle of an update.
If it is a private machine, mkserv services may not have been configured
Machine won't take future patch releases
Machine is otherwise completely usable as a workstation.
Suspected cause:
Machine boots into state #2, and hence can't attach system packs.
Machine is then rebooted, but because there were no system packs
attached it fails to run the post reboot part of the update, which
normally occurs before the systempacks are detached and re-attached
during the boot process.
Solution:
log into the machine as root and reboot it.
What it affected:
We found 6 sun machines in this state.
Problem #4:
Symptoms:
Machines presents a console login prompt, instead of xlogin.
Logging in as root does not require a password.
Suspected cause:
Machine was interupted in the middle of the update
Solution:
No known solution, except re-installing. Bob, do you know anyhitng
I don't?
What it affected:
We found 4 SGI Indys in this state.
Problem #5:
Symptoms:
Machine is still running 8.1. There is an error message in the
console abou being unable to update sash, and requesting that you
re-install the machine. Machine is completely usable as an 8.1
workstation.
Suspected cause:
Sash partition on the disk is too small
Solution:
Back up any important data that is on the machine and re-install.
What it affected:
We found 1 SGI Indy in this state.
We also found the several machines with corrupt software, most of
which we re-installed. However, w20-575-83 and 84 (both Indy's) were
in an especially weird state. We re-installed 84, but powered off 83,
in case someone (Bob?) wants to go look at it. We also found a couple
of hardware problems that Lou said he'd log into hotline.