[2962] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Re: Suspecting Major Problems:

daemon@ATHENA.MIT.EDU (andrew m. boardman)
Fri Sep 21 00:38:36 2001

Date: Fri, 21 Sep 2001 00:38:32 -0400
Message-Id: <200109210438.AAA05652@pothole.mit.edu>
From: "andrew m. boardman" <amb@MIT.EDU>
To: zacheiss@mit.edu
CC: larugsi@mit.edu, release-team@mit.edu, bmurphy@mit.edu, hardserv@mit.edu
In-reply-to: <200109210356.XAA19564@indian-burial-ground-pet-store.mit.edu>
	(message from Garry Zacheiss on Thu, 20 Sep 2001 23:56:28 -0400 (EDT))


>On Linux, it's a little bit more of an open issue.  ext2 (the
>filesystem Linux is using) is known to be less robust than we would
>like, and there isn't any built in logging/journaling filesystem we can
>use.  Andrew, do you have any suggestions?

Nothing that's directly applicable.  I suspect we could switch to ext3
without too much pain, but it would be a huge change and not really
contemplatable until IAP, if even then.  The best solution for now is
probably to fix whatever is causing the crashes.

I do however have a suspicion that the cause of the crashes isn't with
software, it's with users.  I've seen frustrated people hard-reset
cluster machines when the network was being flaky, and in a conversation
with Jonathon and Lou earlier it was suggested that users (particularly
incoming frosh) may be looking at something that looks like a Microsoft
box they may be used to abusing, so they just nail the power switch when
they're done and think nothing of it.

I'll head for W20 and see what I can find.  Looking at who (if anyone)
was logged in when the machine came down, and asking that person (if any)
what they did, might be interesting.  Especially if it's the same couple
of people all the time.

home help back first fref pref prev next nref lref last post