[86731] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: STILL Paging Google...

daemon@ATHENA.MIT.EDU (MH)
Tue Nov 15 21:39:35 2005

Date: Wed, 16 Nov 2005 02:38:40 +0000 (UTC)
From: MH <malum@freeshell.org>
To: nanog@merit.edu
In-Reply-To: <437A83AC.8050805@elvey.com>
Errors-To: owner-nanog@merit.edu


Hi there,

Looking at your robots.txt... are you sure that is correct?

On the sites I host.. robots.txt always has:

User-Agent: *
Disallow: /

In /htdocs or wherever the httpd root lives.  Thus far it keeps the 
spiders away.

GoogleSpider also will obey: NOARCHIVE, NOFOLLOW, NOINDEX placed within 
the meta tag inside of the html header.

-M.

With the above for robots.txt I've had no problems th
> Still no word from google, or indication that there's anything wrong with the 
> robots.txt.  Google's estimated hit count is going slightly up, instead of 
> way down.
> Why am I bugging NANOG with this? Well, I'm sure if Googlebot keeps ignoring 
> my robots.txt file, thereby hammering the server and facilitating s pam, 
> they're doing the same with a google other sites.  (Well, ok, not a google, 
> but you get my point.)

> The above page says that
> User-agent: Googlebot
> Disallow: /*?
> will block all standard-looking dynamic content, i.e. URLs with "?" in them.
>> 
>> 
>> On Mon, 14 Nov 2005, Matthew Elvey wrote:
>> 
>>> 
>>> Doh!  I had no idea my thread would require login/be hidden from general 
>>> view!  (A robots.txt info site had directed me there...)   It seems I fell 
>>> for an SEO scam... how ironic.  I guess that's why I haven't heard from 
>>> google...
>>> 
>>> Anyway, here's the page content (with some editing and paraphrasing):
>>> 
>>> Subject: paging google! robots.txt being ignored!
>>> 
>>> Hi. My robots.txt was put in place in August!
>>> But google still has tons of results that violate the file.
>>> 
>>> http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
>>> doesn't complain (other than about the use of google's nonstandard 
>>> extensions described at
>>> http://www.google.com/webmasters/remove.html )
>>> 
>>> The above page says that it's OK that
>>> 
>>> #per [[AdminRequests]]
>>> User-agent: Googlebot
>>> Disallow: /*?*
>>> 
>>> is last (after User-agent: *)
>>> 
>>> and seems to suggest that the syntax is OK.
>>> 
>>> I also tried
>>> 
>>> User-agent: Googlebot
>>> Disallow: /*?
>>> but it hasn't helped.
>>> 
>>> 
>>> 
>>> I asked google to review it via the automatic URL removal system 
>>> (http://services.google.com/urlconsole/controller).
>>> Result:
>>> URLs cannot have wild cards in them (e.g. "*"). The following line 
>>> contains a wild card:
>>> DISALLOW: /*?
>>> 
>>> How insane is that?
>>> 
>>> Oh, and while /*?* wasn't per their example, it was legal, per their 
>>> syntax, same as /*?  !
>>> 
>>> The site as around 35,000 pages, and I don't think a small robots.txt to 
>>> do what I want is possible without using the wildcard extension.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>
>

home help back first fref pref prev next nref lref last post