[33878] in North American Network Operators' Group
Re: Monitoring highly redundant operations
daemon@ATHENA.MIT.EDU (poptix@sleepybox.poptix.net)
Wed Jan 24 22:15:16 2001
Date: Wed, 24 Jan 2001 19:46:53 -0600 (CST)
From: <poptix@sleepybox.poptix.net>
To: Simon Lockhart <simonl@rd.bbc.co.uk>
Cc: <nanog@merit.edu>
In-Reply-To: <5552.980379833@sunf25>
Message-ID: <Pine.LNX.4.30.0101241939540.7664-100000@sleepybox.poptix.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Errors-To: owner-nanog-outgoing@merit.edu
On Wed, 24 Jan 2001, Simon Lockhart wrote:
>
> >But he does raise an interesting problem. How do you know if your
> >highly redudant, diverse, etc system has a problem. With an ordinary
> >system its easy. It stops working. In a highly redudant system you
> >can start losing critical components, but not be able to tell if
> >your operation is in fact seriously compromised, because it continues
> >to "work."
>
> Indeed. We currently monitor each part of our operation from a monitoring
> station on our network. Under certain conditions, this can give us both
> false positives and false negatives:
>
> - We've lost off-site routing. Our monitoring station can see all our
> nodes okay, so it thinks everything is fine, but no-one else can see them.
>
With our monitoring software we also check a few off-site links (our
interfaces on our uplinks routers and the router after that) it tends to
work well.
> - We've lost routing to just the part of our network with the monitoring
> station on. It reports that everything is down, when in fact stuff is
> working fine for serving the rest of the internet.
>
For that situation the software we use allows us to set dependencies, ie,
servers A B & C depend on router Z, if router Z is down, assume server A B
& C are unreachable/down (but dont start spewing out alerts about it)
Unfortunately the software is MS based (Enterprise Monitor, now named IP
monitor iirc) I first came across it while working at Xerox, it resides on
the only MS box on our network (beyond customer machines, and yes, it's
kinda of an oxymoron, a windows monitoring box).
> One way we plan to overcome these issues is to locate monitoring stations
> on other ISPs networks at random places on the internet. If you correlate
> the results from these multiple monitoring stations, then you get a better
> view of what the rest of the internet is seeing.
>
A kind of distributed monitoring system would be nice, or just having
people who agree to give you access to add your systems to their
monitoring systems (easily done with some software, not so easily with
others) I also do this to a small extent.
Matthew S. Hallacy
XtraTyme Technologies
> Simon
> --
> Simon Lockhart | Tel: +44 (0)1737 839676
> Internet Engineering Manager | Fax: +44 (0)1737 839516
> BBC Internet Services | Email: Simon.Lockhart@bbc.co.uk
> Kingswood Warren,Tadworth,Surrey,UK | URL: http://support.bbc.co.uk/
>
>
>