[366] in Info-AFS_Redistribution
Re: SNMP for AFS Monitoring
daemon@ATHENA.MIT.EDU (Wallace Colyer)
Thu Oct 31 03:47:07 1991
Date: Wed, 30 Oct 1991 10:27:37 -0500 (EST)
From: Wallace Colyer <wally+@andrew.cmu.edu>
To: Info-AFS@transarc.com, Michael Allen Shiplett <walrus+@ifs.umich.edu>
In-Reply-To: <Added.Id3d5ZP0Bi81E3N09t@transarc.com>
Unfortunately there is no SNMP monitoring package for fileservers at
this time that I know of.
We do various types of fileserver monitoring because we have had major
load problems in the past. The following is what we do and plan to do
in some detail:
1) We have replaced scout with an X11/Motif based operations monitoring
tool called sentinel which uses the fsprobe library that scout uses.
Sentinel has three modes:
1) The supervisor mode which is a small rectangle when everything is ok.
2) The trouble window which opens on trouble events like down
fileservers and highlights those events.
3) The nitpicker window which shows information like available disk
space, last restart time, last boot time.
4) Sentinel keeps a log of all events. There is a button to open the
log window.
We plan to at some point in the future allow the user to click on a
fileserver and get all the information that fsprobe makes available.
Sentinel has proven to be essential to our support and operational
staff. The output and format from scout was not readable accross the
room. Since sentinel will open a big window when there is a new problem
event is forces people to notice, but ussually uses little or no screen
realestate.
2) fsprobe - fsprobe is a library supplied by Transarc, but there are no
general purpose programs that give output of the library. Here is an
example of the what it can give:
% fsprobe vice1.fs.andrew.cmu.edu
Server VICE1.FS.ANDREW.CMU.EDU:
Value of probeOK for this server: 0
CurrentMsgNumber: 0
OldestMsgNumber: 0
CurrentTime: 688834828 Wed Oct 30 10:00:28 1991
BootTime: 687628706 Wed Oct 16 11:58:26 1991
StartTime: 688551543 Sun Oct 27 03:19:03 1991
CurrentConnections: 1733
TotalViceCalls: 1996383
TotalFetchs: 1193804
FetchDatas: 268620
FetchedBytes: 214713007
FetchDataRate: 178333
TotalStores: 168849
StoreDatas: 107771
StoredBytes: 0
StoreDataRate: 0
TotalRPCBytesSent: 0
TotalRPCBytesReceived: 0
TotalRPCPacketsSent: 0
TotalRPCPacketsReceived: 0
TotalRPCPacketsLost: 0
TotalRPCBogusPackets: 0
SystemCPU: 7519714
UserCPU: 3401962
NiceCPU: 0
IdleCPU: 109699432
TotalIO: 18939274
ActiveVM: 5376
TotalVM: 91812
EtherNetTotalErrors: 0
EtherNetTotalWrites: 0
EtherNetTotalInterupts: 0
EtherNetGoodReads: 0
EtherNetTotalBytesWritten: 0
EtherNetTotalBytesRead: 0
ProcessSize: 2642
WorkStations: 209
ActiveWorkStations: 122
Disk1: avail 226779 total 822252 name /vicepa
Disk2: avail 219317 total 822252 name /vicepb
Disk3: avail 205358 total 767611 name /vicepe
Disk4: avail 194977 total 822252 name /vicepf
You can do whatever you wish with the output. We have something that
keeps uptime stats on the fileservers from this.
3) Am important measure of performance is on the rx queues on the
various parts of the system. The most important parts are:
1) filesever (port 7001)
2) vldb (port 7003)
The kaserver and ptserver can cause problems if they get backloged, but
since they see much less activity this is much more rarely seen.
The rxdebug command supplied by Transarc can be used to determine the
rxbacklog. The quick and dirty way to do it is:
% rxdebug vice1.fs.andrew.cmu.edu | grep Connection | wc
0
This is not 100% accurate, but it gives you a good estimate of
performance. Any sustained number greater than 0 is a problem. If it
ever gets to 100 you are in real trouble. The rx queues have been a
very good explaination for workstation performance problems here at CMU.
We currently have an ATK console which monitors and graphs these queues,
but are working on a good system of monitoring the rx queues as well as
general response time.
4) Volume usage. This is what you look at when fileservers have big
sustained rx backlogs. Because of AMS we seen very a very high load on
some read-write volume. For example, we have a 1 meg volume, ams-ml.bb
which gets over 300,000 accesses per day. Vos will report the dayuse of
a volume. We have modified vos to show the weekuse as well. This
information is valid as long as the volume has not been moved.
This is what our vos will do:
% vos volinfo ams-lt.bb -format
name ams-lt.bb
id 67120207
serv 128.2.10.10 VICE10.FS.ANDREW.CMU.EDU
part /vicepe
status OK
backupID 151045296
parentID 67120207
cloneID 0
inUse Y
needsSalvaged N
destroyMe N
type RW
creationDate 512335006 Thu Mar 27 14:16:46 1986
accessDate 0 Wed Dec 31 19:00:00 1969
updateDate 688835278 Wed Oct 30 10:07:58 1991
backupDate 618725514 Thu Aug 10 00:11:54 1989
copyDate 618725514 Wed Oct 23 09:34:27 1991
flags 0 (Optional)
diskused 1065
maxquota 75000
minquota 0 (Optional)
filecount 3827
dayUse 301216
weekUse 2105256 (Optional)
spare2 0 (Optional)
spare3 0 (Optional)
ams-lt.bb
readWriteID 67120207 valid
readOnlyID 0 invalid
backUpID 151045296 valid
number of sites -> 1
server VICE10.FS.ANDREW.CMU.EDU partition /vicepe RW Site
readWriteID 67120207 valid
readOnlyID 0 invalid
backUpID 151045296 valid
number of sites -> 1
server VICE10.FS.ANDREW.CMU.EDU partition /vicepe RW Site
Accesses are reads, writes, and stats that require contacting the fileservers.
In systems with high use volumes, the load has to be spread amongst
fileservers. We currently do this by hand, but plan to one day have an
automatic load balancing system in place.
---
There are other parts of the system that can be monitored and I would be
interested in hearing what others know.
Much of what we have done , we can not release to people without AFS
source licenses.
We have currently not tried to resolve any licencing issues for outside
sites, but if there is a great deal of interest in any of what we have I
can talk with Transarc and see what can be done to either make them
available through Transarc, or other means.
-Wallace