[1754] in WWW Security List Archive
Site Scaning & IP graps
daemon@ATHENA.MIT.EDU (A. P. Harris)
Fri Mar 29 16:30:08 1996
To: "Brian W. Spolarich" <briansp@ans.net>
cc: KGANNON@dit.ie, www-security@ns2.rutgers.edu
In-reply-to: Your message of Thu, 21 Mar 1996 19:33:49 EST.
<Pine.SOL.3.91.960321193127.9027Y-100000@thebrain.aa.ans.net>
Date: Fri, 29 Mar 1996 11:25:19 -0600
From: "A. P. Harris" <apharris@onshore.com>
Errors-To: owner-www-security@ns2.rutgers.edu
[You ("Brian W. Spolarich")]
> Good spiders will ask for /robots.txt and find out what to do with
>themselves if they find it.
>
> Generally grepping for /robots.txt will give you a list of spiders that
>have found you.
Very true. In fact, on my server I've ScriptAliased /robots.txt to
the following little perl script. This lets me grab a little more information
from the robot which the server by default doesn't get, namely, the
HTTP_FROM address advertised.
--------------------------code snippet
#!/usr/bin/perl
$Log = '/var/adm/httpd_robots';
@Interesting = ('HTTP_USER_AGENT', 'REMOTE_ADDR', 'REMOTE_HOST', 'HTTP_FROM');
print "Content-type: text/plain\n\n";
print "User-agent: *\nDisallow:\n\n";
open(LOG, ">>$Log") || die("Can't open $Log: $!\n");
print LOG '[' . localtime() . ']';
foreach $env (@Interesting) {
print LOG "\t$env=$ENV{$env}";
}
print LOG "\n";
close LOG;
--------------------------end code snippet
Some of the lines produced by this (I've wrapped returns with '\'):
[Thu Feb 8 00:44:51 1996] HTTP_USER_AGENT=Scoutget 1.0 REMOTE_ADDR=206.\
101.96.35 REMOTE_HOST=seventeen.srv.lycos.com HTTP_FROM=
[Thu Feb 8 01:48:31 1996] HTTP_USER_AGENT=OTI_Spider/OTWR:002p116 libwww/\
2.17 REMOTE_ADDR=205.216.146.179 REMOTE_HOST=205.216.146.179 HTTP_FRO\
M=gregf@opentext.com
[Thu Feb 8 15:29:17 1996] HTTP_USER_AGENT=OTI_Spider/OTWR:002p116 libwww/\
2.17 REMOTE_ADDR=205.216.146.179 REMOTE_HOST=dialup-a.mv.opentext.com\
HTTP_FROM=gregf@opentext.com
[Sun Feb 11 03:00:29 1996] HTTP_USER_AGENT=CERN-LineMode/2.15 libwww/2.17\
REMOTE_ADDR=199.107.235.42 REMOTE_HOST=199.107.235.42 HTTP_FROM=vic@ap\
ollo.alphaspace.com
Interestingly, it seems that Lycos doesn't populate the HTTP_FROM environment.
Odd.
.....A. P. Harris...apharris@onShore.com...<URL:http://www.onShore.com/>