[366] in Info-AFS_Redistribution

home help back first fref pref prev next nref lref last post

Re: SNMP for AFS Monitoring

daemon@ATHENA.MIT.EDU (Wallace Colyer)
Thu Oct 31 03:47:07 1991

Date: Wed, 30 Oct 1991 10:27:37 -0500 (EST)
From: Wallace Colyer <wally+@andrew.cmu.edu>
To: Info-AFS@transarc.com, Michael Allen Shiplett <walrus+@ifs.umich.edu>
In-Reply-To: <Added.Id3d5ZP0Bi81E3N09t@transarc.com>

Unfortunately there is no SNMP monitoring package for fileservers at
this time that I know of. 

We do various types of fileserver monitoring because we have had major
load problems in the past.  The following is what we do and plan to do
in some detail:

1) We have replaced scout with an X11/Motif based operations monitoring
tool called sentinel which uses the fsprobe library that scout uses. 
Sentinel has three modes:

	1)  The supervisor mode which is a small rectangle when everything is ok.

	2) The trouble window which opens on trouble events like down
fileservers and highlights those events.

	3) The nitpicker window which shows information like available disk
space, last restart time, last boot time.

	4) Sentinel keeps a log of all events.  There is a button to open the
log window.

We plan to at some point in the future allow the user to click on a
fileserver and get all the information that fsprobe makes available.

Sentinel has proven to be essential to our support and operational
staff.  The output and format from scout was not readable accross the
room.  Since sentinel will open a big window when there is a new problem
event is forces people to notice, but ussually uses little or no screen
realestate.

2) fsprobe - fsprobe is a library supplied by Transarc, but there are no
general purpose programs that give output of the library.  Here is an
example of the what it can give:

% fsprobe vice1.fs.andrew.cmu.edu

Server VICE1.FS.ANDREW.CMU.EDU:
        Value of probeOK for this server: 0
        CurrentMsgNumber:       0
        OldestMsgNumber:        0
        CurrentTime:    688834828       Wed Oct 30 10:00:28 1991
        BootTime:       687628706       Wed Oct 16 11:58:26 1991
        StartTime:      688551543       Sun Oct 27 03:19:03 1991
        CurrentConnections:     1733
        TotalViceCalls: 1996383
        TotalFetchs:    1193804
        FetchDatas:     268620
        FetchedBytes:   214713007
        FetchDataRate:  178333
        TotalStores:    168849
        StoreDatas:     107771
        StoredBytes:    0
        StoreDataRate:  0
        TotalRPCBytesSent:      0
        TotalRPCBytesReceived:  0
        TotalRPCPacketsSent:    0
        TotalRPCPacketsReceived:        0
        TotalRPCPacketsLost:    0
        TotalRPCBogusPackets:   0
        SystemCPU:      7519714
        UserCPU:        3401962
        NiceCPU:        0
        IdleCPU:        109699432
        TotalIO:        18939274
        ActiveVM:       5376
        TotalVM:        91812
        EtherNetTotalErrors:    0
        EtherNetTotalWrites:    0
        EtherNetTotalInterupts: 0
        EtherNetGoodReads:      0
        EtherNetTotalBytesWritten:      0
        EtherNetTotalBytesRead: 0
        ProcessSize:    2642
        WorkStations:   209
        ActiveWorkStations:     122
        Disk1: avail 226779     total 822252    name /vicepa
        Disk2: avail 219317     total 822252    name /vicepb
        Disk3: avail 205358     total 767611    name /vicepe
        Disk4: avail 194977     total 822252    name /vicepf

You can do whatever you wish with the output.  We have something that
keeps uptime stats on the fileservers from this.

3) Am important measure of performance is on the rx queues on the
various parts of the system.  The  most important parts are:

	1) filesever (port 7001)
	2) vldb (port 7003)

The kaserver and ptserver can cause problems if they get backloged, but
since they see much less activity this is much more rarely seen.

The rxdebug command supplied by Transarc can be used to determine the
rxbacklog.  The quick and dirty way to do it is:

% rxdebug vice1.fs.andrew.cmu.edu | grep Connection | wc
0

This is not 100% accurate, but it gives you a good estimate of
performance.  Any sustained number greater than 0 is a problem.  If it
ever gets to 100 you are in real trouble.  The rx queues have been a
very good explaination for workstation performance problems here at CMU.

We currently have an ATK console which monitors and graphs these queues,
but are working on a good system of monitoring the rx queues as well as
general response time.

4) Volume usage.  This is what you look at when fileservers have big
sustained rx backlogs.  Because of AMS we seen very a very high load on
some read-write volume.  For example, we have a 1 meg volume, ams-ml.bb
which gets over 300,000 accesses per day.  Vos will report the dayuse of
a volume.  We have modified vos to show the weekuse as well.  This
information is valid as long as the volume has not been moved.

This is what our vos will do:

% vos volinfo ams-lt.bb -format
name            ams-lt.bb
id              67120207
serv            128.2.10.10     VICE10.FS.ANDREW.CMU.EDU
part            /vicepe
status          OK
backupID        151045296
parentID        67120207
cloneID         0
inUse           Y
needsSalvaged   N
destroyMe       N
type            RW
creationDate    512335006       Thu Mar 27 14:16:46 1986
accessDate      0               Wed Dec 31 19:00:00 1969
updateDate      688835278       Wed Oct 30 10:07:58 1991
backupDate      618725514       Thu Aug 10 00:11:54 1989
copyDate        618725514       Wed Oct 23 09:34:27 1991
flags           0       (Optional)
diskused        1065
maxquota        75000
minquota        0       (Optional)
filecount       3827
dayUse          301216
weekUse         2105256 (Optional)
spare2          0       (Optional)
spare3          0       (Optional)

ams-lt.bb
        readWriteID 67120207    valid
        readOnlyID  0           invalid
        backUpID    151045296   valid
    number of sites -> 1
    server VICE10.FS.ANDREW.CMU.EDU partition /vicepe RW Site
        readWriteID 67120207    valid
        readOnlyID  0           invalid
        backUpID    151045296   valid
    number of sites -> 1
    server VICE10.FS.ANDREW.CMU.EDU partition /vicepe RW Site

Accesses are reads, writes, and stats that require contacting the fileservers.

In systems with high use volumes, the load has to be spread amongst
fileservers.  We currently do this by hand, but plan to one day have an
automatic load balancing system in place.

---

There are other parts of the system that can be monitored and I would be
interested in hearing what others know.  

Much of what we have done , we can not release to people without AFS
source licenses.    

We have currently not tried to resolve any licencing issues for outside
sites, but if there is a great deal of interest in any of what we have I
can talk with Transarc and see what can be done to either make them
available through Transarc, or other means.

-Wallace

home help back first fref pref prev next nref lref last post