[4983] in WWW Security List Archive
Re: ROBOTS
daemon@ATHENA.MIT.EDU (Irving Reid)
Sat Apr 5 17:33:46 1997
To: Deep Summer - Home of Web Site Designs Extraordinare <frank@deepsummer.com>
cc: "'www-security@ns2.rutgers.edu'" <www-security@ns2.rutgers.edu>
In-reply-to: frank's message of "Tue, 01 Apr 1997 19:23:58 -0500".
<97Apr1.223658est.11652@janus.border.com>
From: "Irving Reid" <irving@border.com>
Date: Sat, 5 Apr 1997 13:40:37 -0500
Errors-To: owner-www-security@ns2.rutgers.edu
> If you want a copy it's at http://www.deepsummer.com/robots.txt
> (should be able to do a shift-click on it to retrieve). If
> not, let me know and I'll mail you a copy.
>
> Also, if anyone wants to take a peek at it and let me know if
> you see anything I might have done better then by all means
> do so.
>
> -frank
Here's an excerpt from your robots.txt file:
# /robots.txt for http://www.deepsummer.com
# comments to webmaster@deepsummer.com
User-agent: *
Disallow: /azure/
You've just given me the exact path name for a directory you don't want
the web crawlers to know about.
Stupid Net Trick #47: If you want to see things that people think are
hidden, look for "Disallow" lines in their robots.txt files.
The right thing to do is deny _all_, and then explicitly allow the
files you want indexed. That way you don't leak information to nosy
people.
- irving -
(a really nasty person might write a crawler/indexer that _only_
indexed pages reached from peoples' "Disallow" lines. I'm not sure if
I'm not nasty enough, or just too lazy...)