[4995] in WWW Security List Archive
Re: ROBOTS
daemon@ATHENA.MIT.EDU (Chris Newton)
Mon Apr 7 20:15:02 1997
Date: Mon, 7 Apr 97 12:00:25 PDT
From: chris@sandpiper.com (Chris Newton)
To: www-security@ns2.rutgers.edu
Errors-To: owner-www-security@ns2.rutgers.edu
Certainly you shouldn't rely on a robots.txt file to hide sensitive data,
but it seems to me that the robot protocol should allow you to specify
which partial URLs you want to allow a robot to go to, as well as which you
don't want them to go to.
Has the standard been extended for this functionality? my robots.txt files
have been written to the 'standard' published at
http://info.webcrawler.com/mak/projects/robots/norobots.html
which makes no mention of an 'allow' statement. Is there a more up-to-date
spec anywhere?
chris
> From: "Irving Reid" <irving@border.com>
>
> > If you want a copy it's at http://www.deepsummer.com/robots.txt
> > (should be able to do a shift-click on it to retrieve). If
> > not, let me know and I'll mail you a copy.
> >
> > Also, if anyone wants to take a peek at it and let me know if
> > you see anything I might have done better then by all means
> > do so.
> >
> > -frank
>
> Here's an excerpt from your robots.txt file:
>
> # /robots.txt for http://www.deepsummer.com
> # comments to webmaster@deepsummer.com
>
> User-agent: *
> Disallow: /azure/
>
> You've just given me the exact path name for a directory you don't want
> the web crawlers to know about.
>
> Stupid Net Trick #47: If you want to see things that people think are
> hidden, look for "Disallow" lines in their robots.txt files.
>
> The right thing to do is deny _all_, and then explicitly allow the
> files you want indexed. That way you don't leak information to nosy
> people.
>
> - irving -
>
> (a really nasty person might write a crawler/indexer that _only_
> indexed pages reached from peoples' "Disallow" lines. I'm not sure if
> I'm not nasty enough, or just too lazy...)
>
>