[129481] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: yahoo crawlers hammering us

daemon@ATHENA.MIT.EDU (Bruce Williams)
Wed Sep 8 05:22:32 2010

In-Reply-To: <AANLkTi=AHmvFAP9ZcYHG0ZHpw3oM3f33w6jSibc5=zFA@mail.gmail.com>
From: Bruce Williams <williams.bruce@gmail.com>
Date: Wed, 8 Sep 2010 02:21:31 -0700
Cc: nanog@nanog.org
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org

> I *am* curious--what makes it any worse for a search engine like Google
> to fetch the file than any other random user on the Internet

Possibly because that other user is who the customer pays have their
content delivered to?

Bruce Williams
---------------------------------------------------------------------------=
--------------------------------------------------
You can close your eyes to things you don't want to see, but you can't
close your heart to the things you don't want to feel.




On Wed, Sep 8, 2010 at 12:04 AM, Matthew Petach <mpetach@netflight.com> wro=
te:
> On Tue, Sep 7, 2010 at 1:19 PM, Ken Chase <ken@sizone.org> wrote:
>> So i guess im new at internets as my colleagues told me because I havent=
 gone
>> around to 30-40 systems I control (minus customer self-managed gear) and
>> installed a restrictive robots.txt everywhere to make the web less usefu=
l to
>> everyone.
>>
>> Does that really mean that a big outfit like yahoo should be expected to
>> download stuff at high speed off my customers servers? For varying value=
s of
>> 'high speed', ~500K/s (4Mbps+) for a 3 gig file is kinda... a bit harsh.
>> Especially for an exe a user left exposed in a webdir, thats possibly (C=
)
>> software and shouldnt have been there (now removed by customer, some kin=
da OS boot
>> cd/toolset thingy).
>
> The large search engines like Google, Bing, and Yahoo do try to be good
> netizens, and not have multiple crawlers hitting a given machine at the s=
ame
> time, and they put delays between each request, to be nice to the CPU loa=
d
> and bandwidth of the machines; but I don't think any of the crawlers expl=
icitly
> make efforts to slow down single-file-fetches. =A0Ordinarily, the transfe=
r speed
> doesn't matter as much for a single URL fetch, as it lasts a very short p=
eriod
> of time, and then the crawler waits before doing another fetch from the s=
ame
> machine/same site, reducing the load on the machine being crawled. =A0I d=
oubt
> any of them rate-limit down individual fetches, though, so you're likely =
to see
> more of an impact when serving up large single files like that.
>
> I *am* curious--what makes it any worse for a search engine like Google
> to fetch the file than any other random user on the Internet? =A0In eithe=
r case,
> the machine doing the fetch isn't going to rate-limit the fetch, so
> you're likely
> to see the same impact on the machine, and on the bandwidth.
>
>> Is this expected/my own fault or what?
>
> Well...if you put a 3GB file out on the web, unprotected, you've got to f=
igure
> at some point someone's going to stumble across it and download it to see
> what it is. =A0If you don't want to be serving it, it's probably best to
> not put it up
> on an unprotected web server where people can get to it. =A0^_^;
>
> Speaking purely for myself in this manner, as a random user who sometimes
> sucks down random files left in unprotected directories, just to see what=
 they
> are.
>
> Matt
> (now where did I put that antivirus software again...?)
>
>


home help back first fref pref prev next nref lref last post