[4284] in Athena Bugs
Disk Server Problems Damage Student Opinion
daemon@ATHENA.MIT.EDU (Jonathan I. Kamens)
Wed Feb 21 19:03:21 1990
Date: Wed, 21 Feb 90 19:02:43 -0500
From: Jonathan I. Kamens <jik@PIT-MANAGER.MIT.EDU>
To: tldavis@ATHENA.MIT.EDU
Cc: bugs@ATHENA.MIT.EDU, lcomeau@HSTBME.MIT.EDU
In-Reply-To: bugs[4283]
Before responding to your message, let me point out that the "bugs"
mailing list is probably the wrong place to send it. In particular,
questions about operations problems (such as, "Why does this keep
happening?") should be addressed to the "hotline" mailing list (or to
their phone number, x3-1410), so that the people in charge of hardware
operations see them. Questions about what to do about the problems
(such as "Why can't csh suspend hung jobs?" or "Should hung
workstations be rebooted?") should be addressed to the Project Athena
consulting staff. You can contact the Project Athena Consultants
either by stopping by their office (11-115), by calling them on the
phone (x3-4435), or by asking a question in olc (type "olc" at the
athena% prompt).
In any case, I will attempt to respond to the questions raised in
your mail, since I happen to be a consultant in addition to my role as
a member of the Project Athena Quality Assurance Team (which deals
with all mail sent to "bugs").
From: tldavis@ATHENA.MIT.EDU
Date: Wed, 21 Feb 90 18:24:19 EST
What is it that is happening when everyone starts to get AFS and
NFS errors popping up and then suddenly everything freezes for 1 -
20 minutes? This happens frequently at our cluster (E25-131), and
I've seen it happen elsewhere as well.
Unfortunately, the MIT Campus Network has been having some problems
recently. The telecommunications group from Information Services is
working on fixing the problems, but until they succeed, there are
going to continue to be network outages.
The most problematic portion of the network right now is the gateway
to the building 11 subnet. Unfortunately, many important fileservers
and other service-providing machines are on that subnet, so when that
gateway goes down, lots of people notice.
It is EXTREMELY frustrating when a server goes on the blink, mainly
because every process reading or writing to it gets stuck in a DISK
WAIT, from which there is absolutely no recovery for several
minutes. Why can't csh KILL or at least SUSPEND (^Z) those
processes while they are waiting for the disk server?
Believe me, it is extremely frustrating for us as well. We consider
it very important to make the service provided by Project Athena as
reliable as possible (indeed, that is our top priority), and several
problems such as these are a cause of much hand-wringing.
As for your specific question, there really is no way for csh to
kill or suspent such processes, because they are hung in kernel wait.
It is totally out of the hands of the shell process, and in the hands
of the kernel. Usually, the kernel will time out on the connection
after a few minutes (I think 5 is the current setting, but it may be
more), at which point the processes can be suspended or killed.
therefore, if you type ^Z when you realize a process is hung, it may
take a few minutes, but it will eventually take effect.
I know that once I have a csh going, I can create my new shells
with the -f (fast) option to avoid that deadly search of my home
directory. I guess I'm going to have to add a "fast csh" to my
window manager menu for such emergency exits.
You can probably also do "set path = ($athena_path)" to get only the
system packs in your path. Unless your RVD servers are also down,
the system pack directories should all work, and you will have the
full Athena set of commands available to you without having to to
through your home directory.
Basically, I think Athena is great except for this one giant
problem which is absolutely undocumented, as far as I can tell. My
current answer to student inquiries is to say "The network or the
file server is sick. Try again in a few minutes, and if it still
doesn't work just come back later." Is this an appropriate answer?
Often, students are unable even to logout. Should a workstation in
this state be rebooted?
The Athena Consultants can provide information about hardware
problems, network outages, etc. if they are asked. You do have to
ask, though.
I hope this information helps.
Jonathan Kamens
Project Athena Quality Assurance