[61752] in North American Network Operators' Group
Real network failure causes Was: What do you want your ISP to
daemon@ATHENA.MIT.EDU (Ian Mason)
Thu Sep 4 10:24:27 2003
Date: Thu, 04 Sep 2003 14:59:09 +0100
To: Rob Thomas <robt@cymru.com>,
Johannes Ullrich <jullrich@euclidian.com>
From: Ian Mason <nanog@ian.co.uk>
Cc: NANOG <nanog@merit.edu>
In-Reply-To: <Pine.GSO.4.56.0309031615380.28717@dragon.sauron.net>
Errors-To: owner-nanog-outgoing@merit.edu
At 22:30 03/09/2003, Rob Thomas wrote:
[snip]
>effects. We all know better. Bugs aren't restricted only to
>products from Redmond, typos happen, and the performance hit can
>be quite painful.
In my experience more network downtime is caused by configuration errors
that all other causes together.
The best diagnostic tool I've ever had is a script I cobbled together over
two hours one night. Once an hour, it simply collected all the router
configs across the network, did a 'diff' between the current and last
config, and if there were changes, emailed them to me, along with a TACACS+
log summary that showed who had logged into which router when.
Experience with this quickly taught me to check these summary change logs
whenever a problem was escalated to me. Most times the problem was related
to a config change, not an external cause. Further experience taught me to
look out for one particular engineers name in the logs but that's another
story.