[190811] in North American Network Operators' Group
Re: Operations task management software?
daemon@ATHENA.MIT.EDU (Lee)
Wed Jul 27 19:19:50 2016
X-Original-To: nanog@nanog.org
In-Reply-To: <51622BA9-0A59-4E0C-B5CB-518D53015D33@dino.hostasaurus.com>
From: Lee <ler762@gmail.com>
Date: Wed, 27 Jul 2016 19:19:45 -0400
To: David Hubbard <dhubbard@dino.hostasaurus.com>
Cc: "nanog@nanog.org" <nanog@nanog.org>
Errors-To: nanog-bounces@nanog.org
On 7/27/16, David Hubbard <dhubbard@dino.hostasaurus.com> wrote:
> Hi all, curious if anyone has recommendations on software that helps mana=
ge
> routine duties assigned to operations staff?
Have computers do the routine scut work - not people.
> For example, let=E2=80=99s say we have a P&P that says someone from the n=
etops group
> must check that Rancid is successfully backing up all router configs
> bi-weekly.
You've got the source code for rancid, so change rancid-run to do something=
like
LOGFILE=3D$LOGDIR/$GROUP.`date +%Y%m%d.%H%M%S`; export LOGFILE
change the
) >$LOGDIR/$GROUP.`date +%Y%m%d.%H%M%S` 2>&1
to
) >$LOGFILE 2>&1
and then in control_rancid do something like
grep "clogin error:" $LOGFILE | sort | uniq -c >$TMP.fail
if [ -s $TMP.fail ]; then
# got some output, mail the report
...
Do the same type thing for checking on
> backup failures, backup internet circuit status, out of band interfaces, =
etc.
Automate the checks, put the scripts in crontab & mail out an
"OhNoes!" or "all clear" msg at the end. At which point you're left
with the problem of making sure the managers are looking at the emails
& making sure whatever problems are found actually get fixed :)
Regards,
Lee
> Ideally, it would send an email reminder to this pre-defined
> group of people saying hey, it=E2=80=99s Monday, someone needs to check t=
his and
> come acknowledge the task as having been completed. If that doesn=E2=80=
=99t occur,
> pre-defined manager X is notified on Tuesday. If manager X doesn=E2=80=
=99t get
> someone to complete the task, director Y is notified, so on and so forth.
> Then, perhaps periodically it emails manager X anyway and says hey, it=E2=
=80=99s
> been three months, you need to audit netops to ensure they=E2=80=99re act=
ually doing
> the Rancid audit and not just checking that it was done. This could be
> applied to the staff who check on backup failures, backup internet circui=
t
> status, out of band interfaces, etc.
>
> A data center I looked at recently had QR code stickers on all of their
> infrastructure stuff and there were staff assigned to check and log certa=
in
> displayed values each day. The software would at least ensure they actua=
lly
> visited the equipment by requiring they scan the relevant QR code when in
> front of it. So I figure something that does what I=E2=80=99m looking fo=
r properly
> already exists.
>
> Thanks,
>
> David
>