[3857] in testers

home help back first fref pref prev next nref lref last post

cluster walk

daemon@ATHENA.MIT.EDU (Jonathon Weiss)
Thu Jul 30 02:43:12 1998

From: Jonathon Weiss <jweiss@MIT.EDU>
To: testers@MIT.EDU, cluster-services@MIT.EDU
Date: Thu, 30 Jul 1998 02:43:03 EDT


Tonight I did a cluster walk with Lou and mwhitson, and mvsilis.

I started with a list of machines that were not running 8.2 according
to snmp (or, in the case of indy's were note running IRIX 6.2).

We found a several of classes of problems:

Problem #0:
Symptoms:
  Machine is still running 8.1, but has been logged into (either
  remotely or locally) fairly constantly since the release.

Suspected cause:
  Machines that are in use will not take the update, and won;t even
  start their 4 hour desyncronization.

Solution:
  The machine will update itself after it is idle for long enough

What in affected:
  We ran into 5 or 10 machines (mix of sun and SGI) in this state.



Problem #1:
Symptoms:
  still running 8.1, when you log in you get a message in
  the consle that says there isn't enough space on the root partition
  (less than 5M free) to do the update, machine has teh 8.2 packs
  attached, so there may be some other errors in the console.

Suspected cause:
  Stupid software, or stupid users leaving garbage on the root disk.

Solution:
  Look for things that can get removed. some good things to check are
  /.mosaic* /.netscape /Mail and any actual files (ie, not symlinks or
  directories) in /dev.  Note that some private workstation owners may
  care about some of these things, so you probably want to give them
  the job of finding the files to remove.

What it affected:
  We found about 5 to 10 sun machines in this state.



Problem #2:  
Symptoms:
  no MOTD is displayed on the idle screen befor login.  There are
  messages in the console about no hesinfo.  Xlogin won't let you
  login, because the workstation hasn't activated properly.  Logging
  in in console more (ie, after a control-P) hangs for a minute and
  then gives you a prompt.  ls /afs may say connection timed out.

Suspected cause: none.

Solution: 
  Login as root in console mode.  reboot the machine. let it come all
  the way back up and start xlogin.  Login to the machine as root
  again, and reboot again.  (After the first reboot the machine should
  be in the problem #3 state).

What it affected:
  We found 6 sun machines in this state.



Problem #3:
Symptoms:
  When you log in a line like the following appears in the console:
  Athena Workstation (sun4) Version Reboot Auto 8.2.8 Wed Jul 29 16:29:39 EDT 1998
  rather than:
  Athena Workstation (sun4) Version 8.2.8 Wed Jul 22 00:09:03 EDT 1998
  Machine may complain about being in the middle of an update.
  If it is a private machine, mkserv services may not have been configured
  Machine won't take future patch releases
  Machine is otherwise completely usable as a workstation.

Suspected cause:
  Machine boots into state #2, and hence can't attach system packs.
  Machine is then rebooted, but because there were no system packs
  attached it fails to run the post reboot part of the update, which
  normally occurs before the systempacks are detached and re-attached
  during the boot process.

Solution:
  log into the machine as root and reboot it.

What it affected:
  We found 6 sun machines in this state.



Problem #4:
Symptoms:
  Machines presents a console login prompt, instead of xlogin.
  Logging in as root does not require a password.

Suspected cause:
  Machine was interupted in the middle of the update

Solution:
  No known solution, except re-installing.  Bob, do you know anyhitng
  I don't?

What it affected:
  We found 4 SGI Indys in this state.



Problem #5:
Symptoms:
  Machine is still running 8.1.  There is an error message in the
  console abou being unable to update sash, and requesting that you
  re-install the machine.  Machine is completely usable as an 8.1
  workstation.

Suspected cause:
  Sash partition on the disk is too small

Solution:
  Back up any important data that is on the machine and re-install.

What it affected:
  We found 1 SGI Indy in this state.


We also found the several machines with corrupt software, most of
which we re-installed.  However, w20-575-83 and 84 (both Indy's) were
in an especially weird state.  We re-installed 84, but powered off 83,
in case someone (Bob?) wants to go look at it.  We also found a couple
of hardware problems that Lou said he'd log into hotline.

home help back first fref pref prev next nref lref last post