[190811] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: Operations task management software?

daemon@ATHENA.MIT.EDU (Lee)
Wed Jul 27 19:19:50 2016

X-Original-To: nanog@nanog.org
In-Reply-To: <51622BA9-0A59-4E0C-B5CB-518D53015D33@dino.hostasaurus.com>
From: Lee <ler762@gmail.com>
Date: Wed, 27 Jul 2016 19:19:45 -0400
To: David Hubbard <dhubbard@dino.hostasaurus.com>
Cc: "nanog@nanog.org" <nanog@nanog.org>
Errors-To: nanog-bounces@nanog.org

On 7/27/16, David Hubbard <dhubbard@dino.hostasaurus.com> wrote:
> Hi all, curious if anyone has recommendations on software that helps mana=
ge
> routine duties assigned to operations staff?

Have computers do the routine scut work - not people.

> For example, let=E2=80=99s say we have a P&P that says someone from the n=
etops group
> must check that Rancid is successfully backing up all router configs
> bi-weekly.

You've got the source code for rancid, so change rancid-run to do something=
 like
  LOGFILE=3D$LOGDIR/$GROUP.`date +%Y%m%d.%H%M%S`; export LOGFILE
change the
  ) >$LOGDIR/$GROUP.`date +%Y%m%d.%H%M%S` 2>&1
to
  ) >$LOGFILE 2>&1

and then in control_rancid do something like
  grep "clogin error:" $LOGFILE | sort | uniq -c >$TMP.fail
  if [ -s $TMP.fail ]; then
     # got some output, mail the report
     ...

Do the same type thing for checking on
> backup failures, backup internet circuit status, out of band interfaces, =
etc.

Automate the checks, put the scripts in crontab & mail out an
"OhNoes!" or "all clear" msg at the end.   At which point you're left
with the problem of making sure the managers are looking at the emails
& making sure whatever problems are found actually get fixed :)

Regards,
Lee



>  Ideally, it would send an email reminder to this pre-defined
> group of people saying hey, it=E2=80=99s Monday, someone needs to check t=
his and
> come acknowledge the task as having been completed.  If that doesn=E2=80=
=99t occur,
> pre-defined manager X is notified on Tuesday.  If manager X doesn=E2=80=
=99t get
> someone to complete the task, director Y is notified, so on and so forth.
> Then, perhaps periodically it emails manager X anyway and says hey, it=E2=
=80=99s
> been three months, you need to audit netops to ensure they=E2=80=99re act=
ually doing
> the Rancid audit and not just checking that it was done.  This could be
> applied to the staff who check on backup failures, backup internet circui=
t
> status, out of band interfaces, etc.
>
> A data center I looked at recently had QR code stickers on all of their
> infrastructure stuff and there were staff assigned to check and log certa=
in
> displayed values each day.  The software would at least ensure they actua=
lly
> visited the equipment by requiring they scan the relevant QR code when in
> front of it.  So I figure something that does what I=E2=80=99m looking fo=
r properly
> already exists.
>
> Thanks,
>
> David
>

home help back first fref pref prev next nref lref last post