[4962] in WWW Security List Archive

home help back first fref pref prev next nref lref last post

Re: ROBOTS

daemon@ATHENA.MIT.EDU (Robert P Cunningham)
Mon Mar 31 21:26:07 1997

Date: Mon, 31 Mar 97 11:50 WET
From: bob@lava.net (Robert P Cunningham)
To: jeffm@sgiserv3.aws.waii.com, www-security@ns2.rutgers.edu
Errors-To: owner-www-security@ns2.rutgers.edu


>Is there any threat caused by allowing web indexing robots to enter your site?
>...

No more than allowing browsers to enter your site.  Probably less.
In general, robots will not execute JavaScript nor Java, and will
ignore image maps (and often framesets as well).  And they don't
POST anything.  Most robots will try not to trigger CGI programs
if they can help it.  Plus, all major indexing robots will obey a
robots.txt file in your server root.  That file gives you a great
deal of control to tell robots what they can visit on your site,
and what they cannot.

Robots usually will not probe your site very deeply.  Different
robots have different cutoffs (and details are usually proprietary),
but going much deeper than 4 levels (more precisely: following a chain
of linked pages for that long) would be unusual.

There was a problem with some early robots which would try to get
as much as possible, as quickly as possible from sites.  Which could
overload some servers.  But the current crop of robots--at least those
of the major search sites--are much better-behaved.  They will check
a few pages from your site, then take a break, then check a few more, etc.
(Actually, they're time-slicing between sites...).

And, there were some other problems having to do with circular links.
Most, probably all of the current robots now avoid those obvious traps.


home help back first fref pref prev next nref lref last post