[488] in Moira
Re: Mailhub ALARM CONDITION
daemon@ATHENA.MIT.EDU (Theodore Ts'o)
Sat Nov 28 11:29:58 1992
Date: Sat, 28 Nov 92 11:29:36 EST
From: tytso@Athena.MIT.EDU (Theodore Ts'o)
To: hoffmann@MIT.EDU
Cc: root@tsx-11.MIT.EDU, network@MIT.EDU, bug-moira@Athena.MIT.EDU
In-Reply-To: hoffmann@MIT.EDU's message of Sat, 28 Nov 92 09:15:01 -0500,
I did some investigating on the various log files of athena-as-well.
According to the mqueue syslogs, it looks like it went catatonic shortly
after Wednesday, Nov 25, at 16:55:44. This would explain why the DCM
hung Wednesday night, as well as why the mailhub checker was hanging.
(I suspect that TCP connections were being established, and ICMP echoes
were being returned, but not much else.) I will look into putting some
timeout code in the mailhub checker, so we won't get caught this way
again.
Presumably the reason why moira didn't try to generate a new aliases.out
file that a DCM was still trying to process athena-as-well, so it didn't
want to bash the aliases.out file out from under it. Perhaps there
should be some timeout code in the DCM processing, as well?
I checked /usr/adm/messages and the uerf logs on athena-as-well, and
couldn't find anything interesting. Unless we managed to get a crash
dump, I guess we'll just have to chalk this up to the joys and wonders
of relying on DEC system programmers. :-)
- Ted
P.S. The MX fallback probably both helped and hurt; if mailhub
processing had failed completely, we probably would have noticed before
going home for Thanksgiving. Then again, given the time, this was
probably the best time for us to have learned about this particular
failure mode, since the load on Thanksgiving was probably much lighter,
and so wouldn't have caused as much of a strain on the Vax; and fewer
people would be using the mailhub, so hopefully nobody noticed any real
service impact.