[190813] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: Operations task management software?

daemon@ATHENA.MIT.EDU (Lee)
Wed Jul 27 20:20:33 2016

X-Original-To: nanog@nanog.org
In-Reply-To: <712E5359-5217-41AA-A779-F13DCE597537@dino.hostasaurus.com>
From: Lee <ler762@gmail.com>
Date: Wed, 27 Jul 2016 20:20:29 -0400
To: David Hubbard <dhubbard@dino.hostasaurus.com>
Cc: "nanog@nanog.org" <nanog@nanog.org>
Errors-To: nanog-bounces@nanog.org

On 7/27/16, David Hubbard <dhubbard@dino.hostasaurus.com> wrote:
> Full automation is planned but does not eliminate the need for the softwa=
re.
>  Zero human auditing of fully automated processes and data collection are
> not acceptable to various certifying entities, the relevant auditors, the
> inevitably involved lawyers, and won=E2=80=99t pick up on bad data, like =
a bad
> thermometer or snmp counter that says a CRAC is 65 degrees when it=E2=80=
=99s really
> 90.  So I=E2=80=99m still going to need a management solution to the issu=
e whether
> it=E2=80=99s to tell someone to do the work or to tell someone to check t=
he
> automated work.

You have a ticketing system - right?  Create a cron job that creates a
ticket to check whatever.

Regards,
Lee


>
> David
>
> On 7/27/16, 7:19 PM, "Lee" <ler762@gmail.com> wrote:
>
>     On 7/27/16, David Hubbard <dhubbard@dino.hostasaurus.com> wrote:
>     > Hi all, curious if anyone has recommendations on software that help=
s
> manage
>     > routine duties assigned to operations staff?
>
>     Have computers do the routine scut work - not people.
>
>     > For example, let=E2=80=99s say we have a P&P that says someone from=
 the netops
> group
>     > must check that Rancid is successfully backing up all router config=
s
>     > bi-weekly.
>
>     You've got the source code for rancid, so change rancid-run to do
> something like
>       LOGFILE=3D$LOGDIR/$GROUP.`date +%Y%m%d.%H%M%S`; export LOGFILE
>     change the
>       ) >$LOGDIR/$GROUP.`date +%Y%m%d.%H%M%S` 2>&1
>     to
>       ) >$LOGFILE 2>&1
>
>     and then in control_rancid do something like
>       grep "clogin error:" $LOGFILE | sort | uniq -c >$TMP.fail
>       if [ -s $TMP.fail ]; then
>          # got some output, mail the report
>          ...
>
>     Do the same type thing for checking on
>     > backup failures, backup internet circuit status, out of band
> interfaces, etc.
>
>     Automate the checks, put the scripts in crontab & mail out an
>     "OhNoes!" or "all clear" msg at the end.   At which point you're left
>     with the problem of making sure the managers are looking at the email=
s
>     & making sure whatever problems are found actually get fixed :)
>
>     Regards,
>     Lee
>
>
>

home help back first fref pref prev next nref lref last post