[21691] in Athena Bugs

home help back first fref pref prev next nref lref last post

serious problems with 9.1.25 update

daemon@ATHENA.MIT.EDU (Tom Cavin)
Thu Mar 27 16:43:31 2003

MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <16003.28800.804496.636673@lap1-wccf.mit.edu>
Date: Thu, 27 Mar 2003 16:43:28 -0500
From: Tom Cavin <cavin@MIT.EDU>
To: Athena Bugs list <bugs@MIT.EDU>
CC: Tom Cavin <cavin@MIT.EDU>
In-Reply-To: <880E288B-6085-11D7-9557-00039347B6FE@mit.edu>


Hi,

This appeared on linux-help but Nick didn't post to bugs.

I looked at the bug Nick mentionse, and it is marked as a duplicate of this
one:

    http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=69920

The gist of it seems to be that either the "noapic" option needs to be
used, or a patch to the tg3 driver.

Has anyone looked at this?  Is there a quick fix I can apply from an AFS
locker?  Or do I need to tweak something else.

Other points of note:  The Athena system in question is a SCSI system that
is still running LILO instead of GRUB.  The  RAIDking disk is assigned
/dev/sda if it is turned on, with the system disk as /dev/sdb.  Without the
RAIDking, the system disk is /dev/sda.  This causes problems in booting if
you don't pay attention.

The RAID is using ext2 file systems and hasn't yet been converted to ext3.
(That would be a big win here, but the system hasn't been reliable enough
to make the change yet.)

(I'm now going to finish reading the 69920 bug report and see if there are
other suggestions.)

Any pointers would be appreciated.  

Thanks,

	--Tom

Nicholas Knouf writes:
 > As a follow-up, I found this information regarding frequent crashes on 
 > Dell PowerEdge servers and kernel 2.4.18-18.xsmp, the very kernel we 
 > are using:
 > 
 > http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=78166
 > 
 > If this is the case, would people suggested downloading and compiling a 
 > more recent kernel?
 > 
 > Nick
 > 
 > On Thursday, March 27, 2003, at 01:40  pm, Nicholas Knouf wrote:
 > 
 > > Hi All,
 > >
 > > Yesterday we updated our main analysis server in lab from 9.1.11 to 
 > > 9.1.25.  The update completed successfully, and we were able to reboot 
 > > into 9.1.25.  However, since the update, the server locks up hard 
 > > approximately every 2 hours.  By "hard" I mean complete console 
 > > lockup: cannot switch to any text console, nothing, requiring a hard 
 > > reboot.
 > >
 > > Notes about the server:  the server is a Dell PowerEdge 2650 with dual 
 > > 2.4Ghz Xeon processors with 4GB of RAM.  Athena is installed on a 
 > > local hard drive.  There is a RaidKing 460 720GB RAID (in RAID-5) 
 > > attached to the server over a SCSI 160 connection.  Partitions from 
 > > the RAID are served to a limited set of lab machines by NFS (since 
 > > knfs is not stable enough (at last check) for Linux and because we 
 > > found that AFS slowed things down dramatically)
 > >
 > > The logs don't show anything striking; the last log entry before each 
 > > reboot is of the form (I cannot give you the real log entry at the 
 > > moment; the server is rebooting, and because we have to fsck a 720GB 
 > > RAID, it'll take an hour and a half):
 > >
 > > Mar 27 13:00:00 pasque CROND[3392]: (root) CMD (/etc/athena/desync 
 > > 360; /etc/athena/reactivate > /dev/console 2>&1)
 > >
 > > After looking through the reactivate script, we cannot find anything 
 > > that stands out as the obvious source of the problem.  Running the 
 > > reactivate script from the command line as root doesn't cause the 
 > > system to lock up, up to about 16 runs (since we thought there might 
 > > be an issue with /var/athena/reactivate.count).
 > >
 > > There is nothing else in any other log file that is correlated with 
 > > the crash.  We get a new ksysms on each reboot, and each is slightly 
 > > different from the other (mostly in the 2nd decimal place of the CPU 
 > > speed).  However, there are changes in the "Physical Processor ID" 
 > > from one ksyms to the next (but nothing that can be correlated, even 
 > > with pre- and post-update).
 > >
 > > I'm completely stumped, and at this point in time, I would do almost 
 > > anything to get it back up and working again (including reverting back 
 > > to a previous release, if possible).
 > >
 > > Thanks,
 > >
 > > Nick Knouf
 > > Lab Manager, Kanwisher Lab
 > >
 > 

-- 
Tom Cavin                                  Phone:  (617) 258 - 7806
Computer Operations Manager                Email:     cavin@mit.edu
MIT - Whitaker College Computer Facility          or tec@ai.mit.edu

home help back first fref pref prev next nref lref last post