[45237] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

RE: Crawler Ettiquette

daemon@ATHENA.MIT.EDU (Hunter, Jonathan)
Thu Jan 24 06:43:24 2002

content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Date: Thu, 24 Jan 2002 11:42:37 -0000
Message-ID: <E01E58B6EFE5BF43BA805A0B149823BB571907@liv-mis-ex-01.7global.com>
From: "Hunter, Jonathan" <JHunter@7global.com>
To: <deepak@ai.net>, <nanog@merit.edu>
Errors-To: owner-nanog-outgoing@merit.edu


Hi,

> 	a) Obey robots.txt files
> 	b) Allow network admins to automatically have their=20
> netblocks exempted on request
> 	c) Allow ISP's caches to sync with it.

I don't know if this is already on your list, but I'd also suggest "d) =
Rate-limiting of requests to a netblock/server". I haven't got any =
references immediately to hand, but I do seem to recall a crawler =
written in such a way that it remained "server-friendly" and would not =
fire off too many requests too quickly.

> ISPs who cache would have an advantage if they used the cache=20
> developed by this project to load their tables, but I do not
> know if there is an internet-wide WCCP or equivalent out there
> or if the improvement is worth the management overhead.

It may be worth having a quick look at http://www.ircache.net/ - there =
is a database of known caches available through a WHOIS interface, =
amongst other things.

HTH,

Jonathan

home help back first fref pref prev next nref lref last post