[22183] in Perl-Users-Digest
Perl-Users Digest, Issue: 4404 Volume: 10
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Jan 14 18:10:43 2003
Date: Tue, 14 Jan 2003 15:10:13 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Tue, 14 Jan 2003 Volume: 10 Number: 4404
Today's topics:
Question about high performance spidering in perl <extendedpartition@NOSPAM.yahoo.com>
Re: Question about high performance spidering in perl <camerond@mail.uca.edu>
Re: Question about high performance spidering in perl <extendedpartition@NOSPAM.yahoo.com>
Re: Question about high performance spidering in perl <mark.seger@hp.com>
Re: Question about high performance spidering in perl <camerond@mail.uca.edu>
Re: Substitution of ampersand with a plus symbol stan@temple.edu
Re: Suggestions for counter <bilkay@xxxlocalnet.com>
Re: Suggestions for counter <camerond@mail.uca.edu>
Re: The "default thing" (Bruce McKenzie)
Re: Variable naming convention (Andrew Allaire)
Re: Variable naming convention (Stefan Adams)
Re: Why does this NOT work?? (ashok)
Re: Why does this NOT work?? (Tad McClellan)
Re: Why does this NOT work?? <bongie@gmx.net>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Tue, 14 Jan 2003 14:08:16 -0600
From: Extended Partition <extendedpartition@NOSPAM.yahoo.com>
Subject: Question about high performance spidering in perl
Message-Id: <j9q82v8qtrqanmgpfs3bb4jv0mrtcv91r1@4ax.com>
Hello Everyone,
I am looking at a project that aims to create a high performance
spider program to assist in internet searches. Actually, it's really a
combo program that combines spider and search agent technology
together if that makes any difference. The program will basicaly crawl
the web (starting at an abitrary URI) and search pages for specific
target words. If it finds a target word or phrase on a page then it
will save that page to a db. If not, it moves to the next page. In
addition, it needs to link out from each page to every page the
current page links to. It needs to do this over and over and over. In
effect "crawl" the web. My question is NOT how to do this as I have a
good understanding of what it needs to do. My question rather is
"should I use Perl to do it"?
Theoretically, this program might be faced with crawling millions of
pages. And, while time is not an object, I would like to make this as
quick as possible. In an ideal setting the program will run either on
mini computers or super computers and will have a high speed
connecting to the internet. The program will also run 24/7 so it can
cover as much ground as possible.
Now, I know that Perl is GREAT at text parsing and that's why I am
considering it for this project. But do you think it's a good choice
for a program of this scale? The program itself won't be massive
(perhaps a few thousand lines at most) but the recursive and
repetitive nature of its functionality might incur some overhead. What
do you think? I would like opinions, suggestions, etc.
Thanks!
Extended
------------------------------
Date: Tue, 14 Jan 2003 14:39:49 -0600
From: Cameron Dorey <camerond@mail.uca.edu>
Subject: Re: Question about high performance spidering in perl
Message-Id: <3E247595.4090308@mail.uca.edu>
Extended Partition wrote:
> Hello Everyone,
>
> I am looking at a project that aims to create a high performance
> spider program to assist in internet searches. [snip]
> "should I use Perl to do it"?
>
> Theoretically, this program might be faced with crawling millions of
> pages. And, while time is not an object, I would like to make this as
> quick as possible.
>
> Now, I know that Perl is GREAT at text parsing and that's why I am
> considering it for this project. But do you think it's a good choice
> for a program of this scale? The program itself won't be massive
> (perhaps a few thousand lines at most) but the recursive and
> repetitive nature of its functionality might incur some overhead. What
> do you think? I would like opinions, suggestions, etc.
When the internet is involved, it is almost always the slow step. Just
about anything you do in Perl on your end will take negligable time in
comparison. Get "Perl and LWP" and "Programming the Perl DBI" to help
get yourself up to speed with the details quickly.
Cameron
--
Cameron Dorey
Associate Professor of Chemistry
University of Central Arkansas
Phone: 501-450-5938
camerond@mail.uca.edu
------------------------------
Date: Tue, 14 Jan 2003 14:48:43 -0600
From: Extended Partition <extendedpartition@NOSPAM.yahoo.com>
Subject: Re: Question about high performance spidering in perl
Message-Id: <crt82vghnso6pqahlslc0pv2rkvgn1u99h@4ax.com>
<snip>
>> Now, I know that Perl is GREAT at text parsing and that's why I am
>> considering it for this project. But do you think it's a good choice
>> for a program of this scale? The program itself won't be massive
>> (perhaps a few thousand lines at most) but the recursive and
>> repetitive nature of its functionality might incur some overhead. What
>> do you think? I would like opinions, suggestions, etc.
>
>
>When the internet is involved, it is almost always the slow step. Just
>about anything you do in Perl on your end will take negligable time in
>comparison. Get "Perl and LWP" and "Programming the Perl DBI" to help
>get yourself up to speed with the details quickly.
Thank Cameron! Any suggestions on how to speed up the internet part or
will that pretty much remain an undefined variable?
Thanks!
Extended
------------------------------
Date: Tue, 14 Jan 2003 15:55:47 -0500
From: Mark Seger <mark.seger@hp.com>
Subject: Re: Question about high performance spidering in perl
Message-Id: <3E247953.A18F80B8@hp.com>
having done a similar type of thing, the way to 'speed up the internet' is
avoid connecting to it as much as possible! in other words, since it's very
likely a number of pages will reference the same page, be sure to do as much
intelligent caching as possible, only visiting pages you've never been to
before.
-mark
Extended Partition wrote:
> <snip>
> >> Now, I know that Perl is GREAT at text parsing and that's why I am
> >> considering it for this project. But do you think it's a good choice
> >> for a program of this scale? The program itself won't be massive
> >> (perhaps a few thousand lines at most) but the recursive and
> >> repetitive nature of its functionality might incur some overhead. What
> >> do you think? I would like opinions, suggestions, etc.
> >
> >
> >When the internet is involved, it is almost always the slow step. Just
> >about anything you do in Perl on your end will take negligable time in
> >comparison. Get "Perl and LWP" and "Programming the Perl DBI" to help
> >get yourself up to speed with the details quickly.
>
> Thank Cameron! Any suggestions on how to speed up the internet part or
> will that pretty much remain an undefined variable?
>
> Thanks!
> Extended
------------------------------
Date: Tue, 14 Jan 2003 16:23:25 -0600
From: Cameron Dorey <camerond@mail.uca.edu>
Subject: Re: Question about high performance spidering in perl
Message-Id: <3E248DDD.1050402@mail.uca.edu>
Extended Partition wrote:
> <snip>
> Thank Cameron! Any suggestions on how to speed up the internet part or
> will that pretty much remain an undefined variable?
Get yourself a T-3 line might help a bit (but only a bit); get onto
Internet-2 and only interact with other sites on I-2; Rent the office
right beside Google's computer and dig a tunnel to tap directly into
it's database, avoiding the internet altogether.
Or take Mark Seger's advice which has already shown up on my newsfeed.
Cameron
--
Cameron Dorey
Associate Professor of Chemistry
University of Central Arkansas
Phone: 501-450-5938
camerond@mail.uca.edu
------------------------------
Date: 14 Jan 2003 20:53:31 GMT
From: stan@temple.edu
Subject: Re: Substitution of ampersand with a plus symbol
Message-Id: <b01tcb$k3o$1@cronkite.temple.edu>
Harald H.-J. Bongartz <bongie@gmx.net> wrote:
>
> The problem must be elsewhere. Are you working on the right string?
Yup, but I ran out of time so I just used an index call to find the
location in the string of the "&" character, then formed a new string
with the two parts of the old string before and after the location of
that character with a "+" concatonated in between. This works nicely, but
its not an elegant solution.
------------------------------
Date: Tue, 14 Jan 2003 16:30:48 -0500
From: "Bill K." <bilkay@xxxlocalnet.com>
Subject: Re: Suggestions for counter
Message-Id: <20030114.162745.492067917.1641@xxxlocalnet.com>
In article <x73cnvmx8h.fsf@mail.sysarch.com>, "Uri Guttman"
<uri@stemsystems.com> wrote:
> ...
> but you are ignoring that major piece of advice. your code is not safe
> from file access collisions because it doesn't lock the file. this is a
> common failure of the kiddie counter scripts and rarely one of a
> 'professional' script. if you want to have a working counter regardless
> of the features, you need to do locking.
>
> so brushing off criticism like that is not doing you any good.
I'm not brushing anything off, except the advice to go find someone
else's program. Most of the stuff from "other people" I've seen is almost
as crappy as what I did myself, including glaring omissions like file
locking. As far as functionality goes, most are much crappier.
> the
> comments on the rest of the code are valid as well. it would behoove you
> to study and learn from them even if you are not a professional. you are
> trying to be a perl hacker so you might as well do it right.
True, but one person's definition of "right" isn't everyone's, and
"right" for a program tracking a corporation's inventory isn't
necessarily the level of "rightness" needed for a program tracking hits
on an insignificant web site. Thanks for your insight, though.
------------------------------
Date: Tue, 14 Jan 2003 16:15:54 -0600
From: Cameron Dorey <camerond@mail.uca.edu>
Subject: Re: Suggestions for counter
Message-Id: <3E248C1A.1080109@mail.uca.edu>
Bill K. wrote:
> In article <x73cnvmx8h.fsf@mail.sysarch.com>, "Uri Guttman"
> <uri@stemsystems.com> wrote:
>
>
>>...
>>but you are ignoring that major piece of advice. your code is not safe
>>from file access collisions because it doesn't lock the file. this is a
>>common failure of the kiddie counter scripts and rarely one of a
>>'professional' script. if you want to have a working counter regardless
>>of the features, you need to do locking.
>>
>>so brushing off criticism like that is not doing you any good.
>>
>
> I'm not brushing anything off, except the advice to go find someone
> else's program. Most of the stuff from "other people" I've seen is almost
> as crappy as what I did myself, including glaring omissions like file
> locking. As far as functionality goes, most are much crappier.
If you had been lurking here for a while, you would have recognized
Tad's name as one who gives excellent advice (as does uri) and the
particular site Tad recommends as being one which many of the experts in
perl have contributed to. It was started as a direct result of many of
those "crappy" (this is probably too mild a word here) scripts out there
by people who wanted to do the community a BIG favor and give a robust
alternative to the bad coding out there. You might want to give it a
look and thank them for their advice, even if you don't use it now.
Cameron
--
Cameron Dorey
Associate Professor of Chemistry
University of Central Arkansas
Phone: 501-450-5938
camerond@mail.uca.edu
------------------------------
Date: 14 Jan 2003 11:28:39 -0800
From: mckenzie@bigmultimedia.com (Bruce McKenzie)
Subject: Re: The "default thing"
Message-Id: <bd848f76.0301141128.1584ec77@posting.google.com>
"Tassilo v. Parseval" <tassilo.parseval@post.rwth-aachen.de> wrote in message news:<b0152f$o8i$1@nets3.rz.RWTH-Aachen.DE>...
> Also sprach Bruce McKenzie:
>
> > I know this is risky, but let me try asking another way (I have been
> > using tied hashes and whatnot, but I don't handle the @_ array with
> > such concision).
> >
> > Is this sort of how it goes?
> > $bucks is tied, so when we say
> > $bucks = 45.00,
> >
> > we're saying something like "(using the methods defined in Centsible
> > class),
> > STORE($bucks, 45.00)"
>
> STORE is invoked, but with a different first argument:
>
> my $ret = tie $scalar, "Class";
> # scalar is now tied but behaves like an ordiany scalar
>
> $scalar = 45.00;
> # this invokes $ret->STORE(45.00);
> # which equals
> # Class::STORE($ret, 45.00)
>
> So the method used to implement an operation is not called with the tied
> variable itself but rather with the return value of tie(). This is in
> fact the variable holding an instance of 'Class'. Now you also see how
> tieing relates to object-orientedness.
>
> > And, written less tersely, STORE becomes
> >
> > sub STORE {
> > # don't do confusing default thing ${ $_[0] } = $_[1] -- instead do
> > my ($self, $value) = @_; # $self is a ref to a scalar (self thinks
> >:-)
> $self = \$value; # $self now refs $value;
> > }
>
> Here $self is a copy of $ret from the above code. $self is not $scalar!
> If you do
>
> print $scalar;
>
> the following method is invoked:
>
> sub FETCH {
> my $self = shift;
> return $$self; # return what $self refers to
> }
>
> which would make the print-line equivalent to:
>
> print $ret->FETCH;
> # or
> print Class::FETCH($ret)
>
> Have you read 'perldoc perltie' already?
>
> Tassilo
Yes, but I think I'll understand it better next time. Thanks for the
lucid explanation.
Bruce
------------------------------
Date: 14 Jan 2003 12:59:59 -0800
From: Andrew.Allaire@na.teleatlas.com (Andrew Allaire)
Subject: Re: Variable naming convention
Message-Id: <6bdb91de.0301141259.7f0f3af5@posting.google.com>
falconflyr@snet.net (Pete) wrote in message news:<4ca21189.0301140802.153b57a3@posting.google.com>...
> Is there a way to dynamically define a set of variable names such that
> the name itself consists of alpha and numeric characters, but where
> the alpha portion remains the same and the numeric portion changes for
> based on numbers in a loop or range? If the loop or range of numbers
> is say ($num=1, $num<=10, $num++), and the base variable name is
> $name, then I want my actual variable names to be like this: $name1,
> $name2, $name3, etc.
I think you are going against camel hair here. Why not use an array?
But just for the sake of an obscure exercise you could handle it like
this:
for (1..10) {
${'name' . $_} = $_ ;
}
print ( "name1 is $name1\n") ;
print ("name2 is $name2\n") ;
# and so forth
------------------------------
Date: 14 Jan 2003 13:28:40 -0800
From: stefan@borgia.com (Stefan Adams)
Subject: Re: Variable naming convention
Message-Id: <dcc927de.0301141328.223cdc0a@posting.google.com>
falconflyr@snet.net (Pete) wrote in message news:<4ca21189.0301140802.153b57a3@posting.google.com>...
> Is there a way to dynamically define a set of variable names such that
> the name itself consists of alpha and numeric characters, but where
> the alpha portion remains the same and the numeric portion changes for
> based on numbers in a loop or range? If the loop or range of numbers
> is say ($num=1, $num<=10, $num++), and the base variable name is
> $name, then I want my actual variable names to be like this: $name1,
> $name2, $name3, etc.
It's called a hash. They're quite useful!!
for ( $num=1; $num<=10; $num++ ) {
$name{$num}=76
}
print $name{3};
It's not EXACTLY what you're looking for, but this is probably what you want.
Stefan
------------------------------
Date: 14 Jan 2003 12:28:13 -0800
From: ashokc@qualcomm.com (ashok)
Subject: Re: Why does this NOT work??
Message-Id: <c13619fd.0301141228.341906a9@posting.google.com>
Thank you all for pointing out my errors. I should have:
1. used "use struct;"
2. extracted the contents of a hash reference with "->" operator
3. used 'my' in several places.
4. a more meaningful subject for this post!! (next time)
Thanks again.
- ashok
------------------------------
Date: Tue, 14 Jan 2003 15:20:30 -0600
From: tadmc@augustmail.com (Tad McClellan)
Subject: Re: Why does this NOT work??
Message-Id: <slrnb28vou.8ft.tadmc@magna.augustmail.com>
ashok <ashokc@qualcomm.com> wrote:
> Thank you all for pointing out my errors. I should have:
>
> 1. used "use struct;"
^^^^^^ strict
> 4. a more meaningful subject for this post!! (next time)
You could have avoided those "should haves" if you'd seen
the Posting Guidelines that are posted here weekly:
http://mail.augustmail.com/~tadmc/clpmisc.shtml
--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
------------------------------
Date: Tue, 14 Jan 2003 23:04:05 +0100
From: "Harald H.-J. Bongartz" <bongie@gmx.net>
Subject: Re: Why does this NOT work??
Message-Id: <1727314.ryalDC0rBX@nyoga.dubu.de>
ashok wrote:
> 1. used "use struct;"
'strict', that is. (Just a typo, I presume.)
Ciao,
Harald
--
Harald H.-J. Bongartz <bongie@gmx.net>
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"I stand by all the misstatements that I've made."
-- George W. Bush Jr.
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
The Perl-Users Digest is a retransmission of the USENET newsgroup
comp.lang.perl.misc. For subscription or unsubscription requests, send
the single line:
subscribe perl-users
or:
unsubscribe perl-users
to almanac@ruby.oce.orst.edu.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
To request back copies (available for a week or so), send your request
to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
where x is the volume number and y is the issue number.
For other requests pertaining to the digest, send mail to
perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
sending perl questions to the -request address, I don't have time to
answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V10 Issue 4404
***************************************