[7011] in Release_7.7_team

home help back first fref pref prev next nref lref last post

Nagios monitor for the clusters

daemon@ATHENA.MIT.EDU (Jonathon Weiss)
Fri Oct 15 16:37:21 2010

Date: Fri, 15 Oct 2010 16:38:25 -0400
Message-Id: <201010152038.o9FKcPFl013694@distraction.mit.edu>
To: acis-team@MIT.EDU, release-team@MIT.EDU
From: Jonathon Weiss <jweiss@MIT.EDU>


I've sent this out before, but I think there are some new people on
these lists since I have.  We have a nagios installation that monitors
the cluster/dorm/etc workstations and printers.  This nagios
installation is set up not to send any notifications, but just be
monitored via the web.  We have a wiki page with links to some of the
most interesting nagios pages.

https://sowiki.mit.edu/wiki/index.php/Info:Cluster_Nagios

On the "Hosts that don't ping" page there are about 60 machines
listed.  I'm fairly certain that somen of them don't actually exist
anymore.  If you let me know which ones don't exist, I'll delete (or
reserve) them in moira, and they'll fall out of monitoring.

The "Hosts that are up but have a serious problem" page lists things
that can be software problems, though some of them may be false
positives.  At least initially, I think this page will be more useful
to the release-team folks, than the acis-teams folks.  Longer term we
may be able to make this more transparent to the acis folks.

The "Hosts that are up but have a mild problem" is similar, but with
less important messages.

-- 

	Jonathon

home help back first fref pref prev next nref lref last post