[5004] in WWW Security List Archive
RE: ROBOTS
daemon@ATHENA.MIT.EDU (DeepSummer-HomeofWebSiteDesignsExt)
Tue Apr 8 21:40:26 1997
From: Deep Summer - Home of Web Site Designs Extraordinare
<frank@deepsummer.com>
To: Irving Reid <irving@border.com>,
Deep Summer - Home of Web Site Designs Extraordinare <frank@deepsummer.com>,
"'Christopher Petrilli'" <petrilli@amber.org>
Cc: "'www-security@ns2.rutgers.edu'" <www-security@ns2.rutgers.edu>
Date: Tue, 8 Apr 1997 16:08:16 -0600
Errors-To: owner-www-security@ns2.rutgers.edu
In second reply to the 'Stupid Net Trick #47...', if I
understand the syntax correctly, one has to first do an
Allow: on all entities allowed, and then follow it with
a Dissallow: '/'. If the Dissallow '/' is first, the bot
simply goes on to another site. Though I did modify my
robots.txt to reflect this (my site is rather small),
I'd find it ridiculous to do that for a large site. Can
you say 'M A I N T H E L L'? Read the spec (it's included
in comment in my robots.txt - I'm pretty certain this
is how the syntax works.
In reply to Christopher, you hit the nail squarly on
the head. I'd find it totally absurd to consider using
robots.txt as a 'see only what I want you to see' peep-
hole. As I said in my last reply - I use it for one
purpose - to guide search engine bots in the task of
indexing the things I wish to be indexed, and to not
have them index things that would obsure my entries in
any search engines (lovely to find my site on a search
for 'This is an autoresponder test'... NOT.)
However, I think the main issue (it's fading) was to
do with how well robots.txt files work. For benevolent
bots, my logs indicate that robots.txt works wonderfully.
For evil bots (remember Arnold? Okay, so he was a
cybernetic organism...) there are other ways of dealing
with security that have nothing at all to do with
robots.txt. In hindsight (I see my chair) I think it
might have been best had I made that my two cents
in the first place rather than to issue anything
resembling philosophy, so I'll take the hit for that.
Anyway, I've done the dasterdly deed of obscuring my
robots.txt now - mainly just to sate my curiosity on
proper syntax, but I still have to agree with Christopher
in that that sort of thinking can lead to lots of
problems. Especially if obsurity a'la security is
carried into other realms of life (if I paint my
house like a police car I don't think it's going
to make it any less easy to rob).
-frank
----------
From: Christopher Petrilli[SMTP:petrilli@amber.org]
Sent: Monday, April 07, 1997 6:57 AM
To: Irving Reid; Deep Summer - Home of Web Site Designs Extraordinare
Cc: 'www-security@ns2.rutgers.edu'
Subject: Re: ROBOTS
In reply to Irving Reid at irving@border.com:
>Here's an excerpt from your robots.txt file:
>
> # /robots.txt for http://www.deepsummer.com
> # comments to webmaster@deepsummer.com
>
> User-agent: *
> Disallow: /azure/
>
>You've just given me the exact path name for a directory you don't want
>the web crawlers to know about.
>
>Stupid Net Trick #47: If you want to see things that people think are
>hidden, look for "Disallow" lines in their robots.txt files.
>
>The right thing to do is deny _all_, and then explicitly allow the
>files you want indexed. That way you don't leak information to nosy
>people.
Anyone who depends on robots.txt to give them "security" is getting what
they paid for. Simply obscuring your URL (i.e. security thru obscurity)
is silly, as I've been known to hunt around in web sites to find the real
pages to links that are broken. You should also turn off the directory
display as well.
If you don't want the general public, which is what a robot is, quite
honestly, to see something then put it behind a security domain and
require a user-id and password. We can argue about the strength of such
systems, but it's highly unlikely a robot can get back past it.
Christopher