[32420] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 3687 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Wed May 9 18:14:26 2012

Date: Wed, 9 May 2012 15:14:10 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Wed, 9 May 2012     Volume: 11 Number: 3687

Today's topics:
    Re: WWW::Mechanize and 3rd party APIs (Google) <justin.1203@purestblue.com>
    Re: WWW::Mechanize and 3rd party APIs (Google) (Randal L. Schwartz)
    Re: WWW::Mechanize and 3rd party APIs (Google) <ben@morrow.me.uk>
    Re: WWW::Mechanize and 3rd party APIs (Google) <justin.1203@purestblue.com>
    Re: WWW::Mechanize and 3rd party APIs (Google) <justin.1203@purestblue.com>
    Re: WWW::Mechanize and 3rd party APIs (Google) <ac.russell@live.com>
    Re: WWW::Mechanize and 3rd party APIs (Google) greymausg@mail.com
    Re: WWW::Mechanize and 3rd party APIs (Google) <*@eli.users.panix.com>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Tue, 8 May 2012 15:07:41 +0100
From: Justin C <justin.1203@purestblue.com>
Subject: Re: WWW::Mechanize and 3rd party APIs (Google)
Message-Id: <duln79-lpp.ln1@zem.masonsmusic.co.uk>

On 2012-05-03, Adam Russell <ac.russell@live.com> wrote:
> On 4/23/12 9:41 AM, Justin C wrote:
>> I'm trying to mechanize a site which returns data
>> depending the location you enter into a form, and a
>> distance from that location (selected from a
>> drop-down). I was completing the form and submitting
>> it, as I would in a browser, but not getting the
>> result I expected.
>>
>> On further investigation it appears that the site
>> sends the location information to GoogleMaps
>> (JavaScript), which returns a location that gets
>> substituted into the form before it is submitted.
>>
>> Has anyone automated a site like this before? Can you
>> offer any suggestion as to how I interact with
>> Google's API?
> I have never interacted with Google's API at all. Randal's
> advice is certainly sound.
> I would like to add that, in general, I have had good advice
> with WWW::Scripter which subclasses WWW::Mechanize and allows
> for use of plugins to handle the JavaScript.
> Sadly, the available plugins seem to work best with only the more
> basic JavaScript out their. In my experience JE is super slow
> and the SpiderMonkey plugin is buggy. For more complicated sites
> I have had the best results with WWW::Mechanize::Firefox which, of
> course, requires you to have FireFox up and running with the Mozrepl
> plugin. If you can work with those constraints I would just use that.

Thank you for this suggestion, I will take a look at
the documentation and see what I can do.

   Justin.

-- 
Justin C, by the sea.


------------------------------

Date: Tue, 08 May 2012 09:59:58 -0700
From: merlyn@stonehenge.com (Randal L. Schwartz)
To: Justin C <justin.1203@purestblue.com>
Subject: Re: WWW::Mechanize and 3rd party APIs (Google)
Message-Id: <86aa1i7gmp.fsf@red.stonehenge.com>

>>>>> "Justin" == Justin C <justin.1203@purestblue.com> writes:

Justin> The site we're hoping to scrape is, I'm not sure if I
Justin> mentioned, offering content based on a location
Justin> entered. IMO, if they're offering the data FOC, then
Justin> I should be able to access it, they're just making me
Justin> jump through hoops to do so. That's fair enough, I'm
Justin> using the site in a way they didn't intend, I don't
Justin> expect them to help me out!

And you still consider that ethical?

Shame on you.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
See http://methodsandmessages.posterous.com/ for Smalltalk discussion


------------------------------

Date: Wed, 9 May 2012 00:21:47 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: WWW::Mechanize and 3rd party APIs (Google)
Message-Id: <bdmo79-2d82.ln1@anubis.morrow.me.uk>


Quoth merlyn@stonehenge.com (Randal L. Schwartz):
> >>>>> "Justin" == Justin C <justin.1203@purestblue.com> writes:
> 
> Justin> The site we're hoping to scrape is, I'm not sure if I
> Justin> mentioned, offering content based on a location
> Justin> entered. IMO, if they're offering the data FOC, then
> Justin> I should be able to access it, they're just making me
> Justin> jump through hoops to do so. That's fair enough, I'm
> Justin> using the site in a way they didn't intend, I don't
> Justin> expect them to help me out!
> 
> And you still consider that ethical?
> 
> Shame on you.

The ethical issue here are not at all clear-cut. Much of the commercial
Web wishes to be able to say 'You may look at this information in one
sort of program for free, but if you want to use a different sort of
program to do something interesting with it you have to pay'. It's not
the least bit clear to me that's a reasonable position to take from a
technical point of view, or one which will be socially beneficial in the
long run, no matter which side the law ends up coming down on.

That said: AIUI (and IANAL, obviously), the law is (or at least the
lawyers are) fairly firmly on the side of the site operators and their
Terms of Use, so it's wise to be careful about these things.

Ben



------------------------------

Date: Wed, 9 May 2012 12:04:50 +0100
From: Justin C <justin.1203@purestblue.com>
Subject: Re: WWW::Mechanize and 3rd party APIs (Google)
Message-Id: <ijvp79-uia.ln1@zem.masonsmusic.co.uk>

On 2012-05-08, Randal L. Schwartz <merlyn@stonehenge.com> wrote:
>>>>>> "Justin" == Justin C <justin.1203@purestblue.com> writes:
>
>Justin> The site we're hoping to scrape is, I'm not sure if I
>Justin> mentioned, offering content based on a location
>Justin> entered. IMO, if they're offering the data FOC, then
>Justin> I should be able to access it, they're just making me
>Justin> jump through hoops to do so. That's fair enough, I'm
>Justin> using the site in a way they didn't intend, I don't
>Justin> expect them to help me out!
>
> And you still consider that ethical?

If the site allows a user to collate the data manually by visiting the
pages one at a time, then I believe they're happy to share that data. I
just happen to be lazy, and don't want to do this the hard way. I'm not
trying to access anything the site isn't happy to share.

I don't consider it unethical, but I do consider that the site owners
weren't aware that someone may want to access the data differently. Also,
WRT the Google API, I'm not after anything relating to that, Google's
data/code is something the page 'serves up' and is just getting in the
way, it's not Google's data that I want.


> Shame on you.

I'm sorry we disagree.


   Justin.

-- 
Justin C, by the sea.


------------------------------

Date: Wed, 9 May 2012 16:58:45 +0100
From: Justin C <justin.1203@purestblue.com>
Subject: Re: WWW::Mechanize and 3rd party APIs (Google)
Message-Id: <lqgq79-k9f.ln1@zem.masonsmusic.co.uk>

On 2012-05-03, Adam Russell <ac.russell@live.com> wrote:
> On 4/23/12 9:41 AM, Justin C wrote:
>> I'm trying to mechanize a site which returns data
>> depending the location you enter into a form, and a
>> distance from that location (selected from a
>> drop-down). I was completing the form and submitting
>> it, as I would in a browser, but not getting the
>> result I expected.
>>
>> On further investigation it appears that the site
>> sends the location information to GoogleMaps
>> (JavaScript), which returns a location that gets
>> substituted into the form before it is submitted.
>>
>> Has anyone automated a site like this before? Can you
>> offer any suggestion as to how I interact with
>> Google's API?
> I have never interacted with Google's API at all. Randal's
> advice is certainly sound.
> I would like to add that, in general, I have had good advice
> with WWW::Scripter which subclasses WWW::Mechanize and allows
> for use of plugins to handle the JavaScript.
> Sadly, the available plugins seem to work best with only the more
> basic JavaScript out their. In my experience JE is super slow
> and the SpiderMonkey plugin is buggy. For more complicated sites
> I have had the best results with WWW::Mechanize::Firefox which, of
> course, requires you to have FireFox up and running with the Mozrepl
> plugin. If you can work with those constraints I would just use that.

Thank you for mentioning Mech::Firefox, I wasn't aware of it and,
for my skill-level at least, I'm finding it easier to extract the
information I want than with Mech alone.... I have just compared the
code I had with the new code, there is very little difference. I
don't think I'm using much that is Mech::Firefox specific, the main
difference is that I'm not trying to fill in every hidden field on
the form. I suppose that 'click-button' in Mech::FF is actually
running the JS in the page, where plain old Mech was not able.
Still, it's all good.

Thank you all for the help. 

   Justin.

-- 
Justin C, by the sea.


------------------------------

Date: Wed, 09 May 2012 14:03:26 -0400
From: Adam Russell <ac.russell@live.com>
Subject: Re: WWW::Mechanize and 3rd party APIs (Google)
Message-Id: <b0972$4faab16f$813f0835$21142@news.eurofeeds.com>

> Thank you for mentioning Mech::Firefox, I wasn't aware of it and,
> for my skill-level at least, I'm finding it easier to extract the
> information I want than with Mech alone.... I have just compared the
> code I had with the new code, there is very little difference. I
> don't think I'm using much that is Mech::Firefox specific, the main
> difference is that I'm not trying to fill in every hidden field on
> the form. I suppose that 'click-button' in Mech::FF is actually
> running the JS in the page, where plain old Mech was not able.
> Still, it's all good.
>
> Thank you all for the help.
No problem! Glad I could help.
Not to stir the hornets nest up any more but I am curious
how one would view something like Mech::Firefox in terms of a
site's TOS. Would it be any different than a more typical bot script?
In this case, a "true" browser is being used. Any interactions are
plainly available to any human user. While these same interactions
could be done via any script the difference may seem
substantial to a less technical person. Appearances may be everything.
As far as I can tell this still probably breaks the spirit
of any rules though. In the end that is probably what matter most.




------------------------------

Date: 9 May 2012 19:42:26 GMT
From: greymausg@mail.com
Subject: Re: WWW::Mechanize and 3rd party APIs (Google)
Message-Id: <slrnjqlejc.2ee.greymausg@gmaus.xxx>

On 2012-05-09, Justin C <justin.1203@purestblue.com> wrote:
> On 2012-05-08, Randal L. Schwartz <merlyn@stonehenge.com> wrote:
>>>>>>> "Justin" == Justin C <justin.1203@purestblue.com> writes:
>>
>>Justin> The site we're hoping to scrape is, I'm not sure if I
>>Justin> mentioned, offering content based on a location
>>Justin> entered. IMO, if they're offering the data FOC, then
>>Justin> I should be able to access it, they're just making me
>>Justin> jump through hoops to do so. That's fair enough, I'm
>>Justin> using the site in a way they didn't intend, I don't
>>Justin> expect them to help me out!
>>
>> And you still consider that ethical?
>
> If the site allows a user to collate the data manually by visiting the
> pages one at a time, then I believe they're happy to share that data. I
> just happen to be lazy, and don't want to do this the hard way. I'm not
> trying to access anything the site isn't happy to share.
>
> I don't consider it unethical, but I do consider that the site owners
> weren't aware that someone may want to access the data differently. Also,
> WRT the Google API, I'm not after anything relating to that, Google's
> data/code is something the page 'serves up' and is just getting in the
> way, it's not Google's data that I want.
>
>
>> Shame on you.
>
> I'm sorry we disagree.
>
>
>   Justin.



No big thing, just that Go*gle does not like people scraping
unauthorized, until recently[1], as long as you let them know, I found 
out when using a Scraper, maybe 5, 6 years ago.

[1] At least.


-- 
maus
 .
  .
 ...


------------------------------

Date: Wed, 9 May 2012 21:56:33 +0000 (UTC)
From: Eli the Bearded <*@eli.users.panix.com>
Subject: Re: WWW::Mechanize and 3rd party APIs (Google)
Message-Id: <eli$1205091729@qz.little-neck.ny.us>

In comp.lang.perl.misc, Justin C  <justin.1203@purestblue.com> wrote:
> If the site allows a user to collate the data manually by visiting the
> pages one at a time, then I believe they're happy to share that data. I
> just happen to be lazy, and don't want to do this the hard way. I'm not
> trying to access anything the site isn't happy to share.

That's something of an assumption. Why do the site owners want to
share the data? Is it because they get advertising revenue from
ads shown with the data or is it something else?

Consider carefully who pays for the site and what they hope to get
from it's existance. 

Acme Spices might have a site full of recipes using their exotic
spice blends. They almost certainly don't care in what way people
get stuff from their site because the site exists to promote sales
of a real world item that is the source of their income.

Acme Weather might have a site full weather predictions and almanac
data. They likely care a whole lot about how people get the data
because web ads and upsell (say Acme Extreme Weather DVDs) are the
source of their income.

Meanwhile noaa.gov runs the National Weather Service and is charged
with producing weather reports for the US Government. They likely
don't care in what way people get their data because the more users
they have the more money they can ask Congress for.

Wikipedia has a site full of content. They are happy to share it
with the world, and rely on people who like their content donating.
They are not happy to share it with all random robots however,
because some place too much strain on their servers. Their robots.txt
is one of the largest I've seen. And they are actively hostile to
some bots. lwp-request cannot download anything, not even robots.txt.

> I don't consider it unethical, but I do consider that the site owners
> weren't aware that someone may want to access the data differently.

If you aren't paying for it, you have to consider strongly that you
are the product, not the customer. The site operators are likely
more interested in the customer's interests.

> Also,
> WRT the Google API, I'm not after anything relating to that, Google's
> data/code is something the page 'serves up' and is just getting in the
> way, it's not Google's data that I want.

If it is hosted by Google, then it is probably Google using low priced
content to attract eyeballs to their advertising customers. Same thing
with Facebook, Yahoo, and a thousand other sites.

Elijah
------
used user-agent spoofing to check five different "clients" for wikipedia


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3687
***************************************


home help back first fref pref prev next nref lref last post