[22189] in Perl-Users-Digest
Perl-Users Digest, Issue: 4410 Volume: 10
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Wed Jan 15 18:05:46 2003
Date: Wed, 15 Jan 2003 15:05:09 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Wed, 15 Jan 2003 Volume: 10 Number: 4410
Today's topics:
Re: A Regular Problem <abigail@abigail.nl>
ActivePerl and file upload problem <gjewell@usdatalink.com>
Can anyone answer this? (Paulo)
Re: Can anyone answer this? (Walter Roberson)
CPAN install causes Winzip to Open and Error (Scott H)
Re: CPAN install causes Winzip to Open and Error (Ben Morrow)
DBI question with MySQL <nospam@nospam.org>
Re: DBI question with MySQL <nospam@nospam.org>
Re: DBI question with MySQL (Tad McClellan)
Forced switch from PERL to ASP/VBSCRIPT. Where do I beg <dharding@uiuc.edu>
Re: Forced switch from PERL to ASP/VBSCRIPT. Where do I <toms@dakcs.com>
Re: Forced switch from PERL to ASP/VBSCRIPT. Where do I <goldbb2@earthlink.net>
Re: Need Help (Jay Tilton)
Re: newbie: chapter 4 exercise Llama book (Randal L. Schwartz)
Re: Perl and Ruby (Daniel Berger)
Re: Perl and Ruby <goldbb2@earthlink.net>
Re: Problem with DBI <jeff@vpservices.com>
Re: Question about high performance spidering in perl <extendedpartition@NOSPAM.yahoo.com>
Re: Question about high performance spidering in perl <extendedpartition@NOSPAM.yahoo.com>
Re: Question about high performance spidering in perl <uri@stemsystems.com>
Re: Question about high performance spidering in perl <raffles2@att.net>
Re: Question about high performance spidering in perl <goldbb2@earthlink.net>
Re: Renaming files *.txt to 1234.txt (Stefan Adams)
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: 15 Jan 2003 22:56:01 GMT
From: Abigail <abigail@abigail.nl>
Subject: Re: A Regular Problem
Message-Id: <slrnb2bpt7.t5g.abigail@alexandra.abigail.nl>
Stephen Adam (00056312@brookes.ac.uk) wrote on MMMCDXXIV September
MCMXCIII in <URL:news:945bf980.0301141921.61f2384c@posting.google.com>:
?? Hi guys and girls,
??
?? I'm looking for a way of using a regular expresion to find the nth
?? instance (or first if thats easier) of a string in an array and return
?? it's position in that array. Here is the vague pseudocode:
??
?? $position = (@array =~ m/<P class=g>/);
??
?? Were $position does NOT hold a boolean value of if the string exists
?? in the array but instead holds the value of its position.
my $N = 1; # Indicates _second_ instance.
my $RE = qr /<P class=g>/;
my $i;
my $p = (map {$_ -> [1]} grep {$$_ [0] =~ /$RE/} map {[$_, $i ++]} @array) [$N];
$p now contains the require index, or undef if none is found.
Abigail
--
perl -wlpe '}$_=$.;{' file # Count the number of lines.
------------------------------
Date: Wed, 15 Jan 2003 21:33:37 GMT
From: "George Jewell" <gjewell@usdatalink.com>
Subject: ActivePerl and file upload problem
Message-Id: <RskV9.12083$Qr4.1195020@newsread1.prod.itd.earthlink.net>
Hello,
Our production web server (Win2K) has ActivePerl 5.8 installed - I have a
page which allows a user to upload a file to a folder. All works fine.
However, on my in-house test server (configured the same way), when trying
to upload the file, I get the following error message:
Couldn't open d:\inetpub\users\upload/cgi-lib.[xxxx].1 ( The xxxx is
usually a different 3 - 5 digit number.)
I've installed ActivePerl, and can run scripts from the command prompt, but
not through the browser.
Any ideas?
Thanks.
------------------------------
Date: 15 Jan 2003 08:14:24 -0800
From: threepio23@yahoo.com (Paulo)
Subject: Can anyone answer this?
Message-Id: <2a47eecc.0301150814.68e20b7d@posting.google.com>
What strings parse according to:
$phone =~ /^
(?:1-?)? # optional 1 or 1-
(?: # start alternate
\(
(\d{3}) # case of number enclosed in parentheses
\)
|
(\d{3}) # case of bare number
) # end alternate
-? # optional -
(\d{3})
-?
(\d{4})
$/x;
print "Parts are\n";
print "$1\n";
print "$2\n";
print "$3\n";
print "$4\n";
------------------------------
Date: 15 Jan 2003 16:28:10 GMT
From: roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson)
Subject: Re: Can anyone answer this?
Message-Id: <b0426q$fvj$1@canopus.cc.umanitoba.ca>
In article <2a47eecc.0301150814.68e20b7d@posting.google.com>,
Paulo <threepio23@yahoo.com> wrote:
:What strings parse according to:
:$phone =~ /^
: (?:1-?)? # optional 1 or 1-
[etc]
Urrr, you mean like "North American Phone numbers" ?
The longest strings that would match:
1-(405)-222-9876
You can omit all the punctuation from this, and you can omit the
leading 1. It would, though, not allow simply a 7 digit local number:
the area code is mandatory.
--
"Meme" is self-referential; memes exist if and only if the "meme" meme
exists. "Meme" is thus logically a meta-meme; but until the existance
of meta-memes is more widely recognized, "meta-meme" is not a meme.
-- A Child's Garden Of Memes
------------------------------
Date: 15 Jan 2003 11:34:38 -0800
From: shammond@att.com (Scott H)
Subject: CPAN install causes Winzip to Open and Error
Message-Id: <8086c70d.0301151134.1b15e84c@posting.google.com>
I need your help. OK, from the command prompt (Win2k box), I type
"perl -MCPAN -e shell", which brings me to the cpan> prompt. I then
type install LWP::UserAgent and hit <enter>. Every time, it pops up
Winzip like it wants me to add something to a zip file in
c:\perl\cpan\build\tmp\ dir. There is nothing to add nor is there
anything in the main winzip window (that opened). I also get a dialog
that says "winzip parameter validation error".
I'm wondering if this is due to my original configuration. I used the
c:\progra~1\winzip\winzip.exe file as the application for gzip, tar
and unzip config options.
Can you help?
Thanks,
Scott
------------------------------
Date: Wed, 15 Jan 2003 21:57:14 +0000 (UTC)
From: mauzo@mimosa.csv.warwick.ac.uk (Ben Morrow)
Subject: Re: CPAN install causes Winzip to Open and Error
Message-Id: <b04lfq$g1f$1@wisteria.csv.warwick.ac.uk>
shammond@att.com (Scott H) wrote:
>I need your help. OK, from the command prompt (Win2k box), I type
>"perl -MCPAN -e shell", which brings me to the cpan> prompt. I then
>type install LWP::UserAgent and hit <enter>. Every time, it pops up
>Winzip like it wants me to add something to a zip file in
>c:\perl\cpan\build\tmp\ dir. There is nothing to add nor is there
>anything in the main winzip window (that opened). I also get a dialog
>that says "winzip parameter validation error".
>
>I'm wondering if this is due to my original configuration. I used the
>c:\progra~1\winzip\winzip.exe file as the application for gzip, tar
>and unzip config options.
I guess from this that you built Perl yourself?
The solution to your problem is to rebuild Perl, specifying command-line
tools to handle those things. The tools you want are called (surprise) gzip,
tar and unzip, respectively. If you built Perl using Cygwin, then you can use
the cygwin version out of c:/cygwin/bin (or wherever): you may need to install
the packages with the cygwin package mangler. If you're using MinGW or VC then
you can get versions from http://unxutils.sourceforge.net/. I can't quite
work out from that page if it includes unzip or not: if it doesn't, there's a
link to info-zip further down the page, who have Win32 native versions.
Ben
------------------------------
Date: Wed, 15 Jan 2003 14:37:17 -0500
From: "Christian Caron" <nospam@nospam.org>
Subject: DBI question with MySQL
Message-Id: <b04d9d$bj64@nrn2.NRCan.gc.ca>
Hi all,
I'm executing this select statement:
my $query = $database_read->prepare("SELECT COUNT(*) FROM Requests WHERE
uniqueid=?");
$query->execute($username) or die "$query->errstr\n";
my $data = $query->fetchall_arrayref or die "$query->errstr\n";
$query->finish;
print "data = $data->[0][0]\n";
Everything works fine, but isn't there a better way to get my result (count)
than "$data->[0][0]"? If I print $data, I get a strange string (I guess it's
a array reference as I requested it by "fetchall_arrayref"). Instead of
using "fetchall_arrayref", could I use something else as I definitely know
it will return only one result, the count?
I read a bit the cpan.org documentation page about DBI, but I can't find a
place where they would list all the possible commands.
Thanks!
Christian
------------------------------
Date: Wed, 15 Jan 2003 14:42:57 -0500
From: "Christian Caron" <nospam@nospam.org>
Subject: Re: DBI question with MySQL
Message-Id: <b04dk2$bi36@nrn2.NRCan.gc.ca>
"Christian Caron" <nospam@nospam.org> wrote in message
news:b04d9d$bj64@nrn2.NRCan.gc.ca...
> Everything works fine, but isn't there a better way to get my result
(count)
> than "$data->[0][0]"? If I print $data, I get a strange string (I guess
it's
> a array reference as I requested it by "fetchall_arrayref"). Instead of
> using "fetchall_arrayref", could I use something else as I definitely know
> it will return only one result, the count?
>
> I read a bit the cpan.org documentation page about DBI, but I can't find a
> place where they would list all the possible commands.
>
Sorry, I just read http://www.mysql.com/doc/en/Perl_DBI_Class.html and found
out four (only?) possible methods:
-fetchrow_array - Fetches the next row as an array of fields.
-fetchrow_arrayref - Fetches next row as a reference array of fields.
-fetchrow_hashref - Fetches next row as a reference to a hashtable.
-fetchall_arrayref - Fetches all data as an array of arrays.
Are there any other?
------------------------------
Date: Wed, 15 Jan 2003 14:50:51 -0600
From: tadmc@augustmail.com (Tad McClellan)
Subject: Re: DBI question with MySQL
Message-Id: <slrnb2bidb.c4l.tadmc@magna.augustmail.com>
Christian Caron <nospam@nospam.org> wrote:
> I'm executing this select statement:
>
> my $query = $database_read->prepare("SELECT COUNT(*) FROM Requests WHERE
> uniqueid=?");
> $query->execute($username) or die "$query->errstr\n";
> my $data = $query->fetchall_arrayref or die "$query->errstr\n";
> $query->finish;
> print "data = $data->[0][0]\n";
>
> Everything works fine, but isn't there a better way to get my result (count)
> than "$data->[0][0]"?
Use selectrow_array() in a scalar context:
perldoc DBI
"selectrow_array"
@row_ary = $dbh->selectrow_array($statement);
@row_ary = $dbh->selectrow_array($statement, \%attr);
@row_ary = $dbh->selectrow_array($statement, \%attr, @bind_values);
This utility method combines "prepare", "execute" and
"fetchrow_array" into a single call. If called in a list context, it
returns the first row of data from the statement. If called in a
scalar context, it returns the first field of the first row...
--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
------------------------------
Date: Wed, 15 Jan 2003 13:10:10 CST
From: Dan <dharding@uiuc.edu>
Subject: Forced switch from PERL to ASP/VBSCRIPT. Where do I begin?
Message-Id: <81iV9.12677$Vf3.133412@vixen.cso.uiuc.edu>
Due to a merger of university departments, I am potentially being forced
to change the way I do web development. I've always used PERL for my CGI
development. I'm now being told by the new derpartment head that "he
doesn't want PERL running on any of his servers; it's too CPU-intensive"
so I must now do all coding in VBScript/ASP. Where do I start? (other
than finding a new job). Are there any "VB for PERL Afficionados" types
of books or resources? What do you recommend for language references
(downloadable/printable preferred)? How about "teach yourself" books?
Assume I've never had any exposure to Visual Basic.
Not only do I have to learn the language(s)/techniques, but I then have
to recode all existing applications. Yay.
I'm doing a bit of a panic/scramble dance here...
Thanks in advance,
-Dan
--
PLEASE NOTE: comp.infosystems.www.authoring.cgi is a
SELF-MODERATED newsgroup. aa.net and boutell.com are
NOT the originators of the articles and are NOT responsible
for their content.
HOW TO POST to comp.infosystems.www.authoring.cgi:
http://www.thinkspot.net/ciwac/howtopost.html
------------------------------
Date: Wed, 15 Jan 2003 14:15:52 CST
From: "Tom Shelton" <toms@dakcs.com>
Subject: Re: Forced switch from PERL to ASP/VBSCRIPT. Where do I begin?
Message-Id: <0fjV9.80$iw.115323@news.uswest.net>
Dan,
I think the problem is more with CGI then PERL... Have you checked out
Active State PERL? It allows you to use PerlScript as your language for ASP
developement... That way, you can do the ASP thang, but continue to
leverage your PERL skills. Maybe you could at least check it out.
Tom Shelton
"Dan" <dharding@uiuc.edu> wrote in message
news:81iV9.12677$Vf3.133412@vixen.cso.uiuc.edu...
> Due to a merger of university departments, I am potentially being forced
> to change the way I do web development. I've always used PERL for my CGI
> development. I'm now being told by the new derpartment head that "he
> doesn't want PERL running on any of his servers; it's too CPU-intensive"
> so I must now do all coding in VBScript/ASP. Where do I start? (other
> than finding a new job). Are there any "VB for PERL Afficionados" types
> of books or resources? What do you recommend for language references
> (downloadable/printable preferred)? How about "teach yourself" books?
> Assume I've never had any exposure to Visual Basic.
>
> Not only do I have to learn the language(s)/techniques, but I then have
> to recode all existing applications. Yay.
>
> I'm doing a bit of a panic/scramble dance here...
>
> Thanks in advance,
>
> -Dan
>
> --
> PLEASE NOTE: comp.infosystems.www.authoring.cgi is a
> SELF-MODERATED newsgroup. aa.net and boutell.com are
> NOT the originators of the articles and are NOT responsible
> for their content.
>
> HOW TO POST to comp.infosystems.www.authoring.cgi:
> http://www.thinkspot.net/ciwac/howtopost.html
>
--
PLEASE NOTE: comp.infosystems.www.authoring.cgi is a
SELF-MODERATED newsgroup. aa.net and boutell.com are
NOT the originators of the articles and are NOT responsible
for their content.
HOW TO POST to comp.infosystems.www.authoring.cgi:
http://www.thinkspot.net/ciwac/howtopost.html
------------------------------
Date: Wed, 15 Jan 2003 14:59:58 CST
From: Benjamin Goldberg <goldbb2@earthlink.net>
Subject: Re: Forced switch from PERL to ASP/VBSCRIPT. Where do I begin?
Message-Id: <3E25CE2C.BD455CAB@earthlink.net>
Dan wrote:
>
> Due to a merger of university departments, I am potentially being
> forced to change the way I do web development. I've always used PERL
> for my CGI development. I'm now being told by the new derpartment head
> that "he doesn't want PERL running on any of his servers; it's too
> CPU-intensive"
This is a foolish statement.
The overhead of CGI is the starting up of a new instance of a script for
each http request. It does not matter what language you write your CGI
programs in, you'll still have this overhead.
Changing to ASP allows your program be run by an ASP interpreter
embedded into the web server.
Tell your administrator about mod_perl, which allows a perl interpreter
to be embedded into a web server.
--
$..='(?:(?{local$^C=$^C|'.(1<<$_).'})|)'for+a..4;
$..='(?{print+substr"\n !,$^C,1 if $^C<26})(?!)';
$.=~s'!'haktrsreltanPJ,r coeueh"';BEGIN{${"\cH"}
|=(1<<21)}""=~$.;qw(Just another Perl hacker,\n);
--
PLEASE NOTE: comp.infosystems.www.authoring.cgi is a
SELF-MODERATED newsgroup. aa.net and boutell.com are
NOT the originators of the articles and are NOT responsible
for their content.
HOW TO POST to comp.infosystems.www.authoring.cgi:
http://www.thinkspot.net/ciwac/howtopost.html
------------------------------
Date: Wed, 15 Jan 2003 22:55:42 GMT
From: tiltonj@erols.com (Jay Tilton)
Subject: Re: Need Help
Message-Id: <3e25e559.62737707@news.erols.com>
Dr P Singh <psinghp@emirates.net.ae> wrote:
: Jay Tilton wrote:
:
: > Dr P Singh <psinghp@emirates.net.ae> wrote:
: >
: > : (1) I get this warning message "Constant Subroutine emptyenum redefined at
: > : c:/site/lib/win32/ole/constant.pm line 65535. I get a lot of this message.
:
: The actual message is like this. I could not cut and paste so to type, missed the
: 'perl/' bit in the process.
:
: c:/perl/site/lib/win32/ole/constant.pm line 65535.
How about the "constant.pm" part? Did you mean to type "Const.pm"?
: > For a guess, from looking at the Printout method in Word help, the
: > "PrintToFile" and "OutputFileName" arguments look promising.
:
: I did try that but it pops up another window asking for file name.
This cannot be a unique problem. Adobe surely has anticipated the
need and has provided the means to get it done. PDFWriter and its API
are where efforts should be focused.
A language-specific newsgroup like clpm is not the ideal place to find
the solution, since Perl's involvement is only incidental. You will
need to find out what has to be done before writing code to do it, in
Perl or any other language.
A newsgroup like comp.text.pdf would be more likely to have readers
who have accomplished this (groups.google.com would be an excellent
resource to consult first), or who can put you on the trail of an even
more suitable newsgroup. If a Perl answer exists, great. If the
answer is in some other language, it can be done in Perl as well.
------------------------------
Date: 15 Jan 2003 10:32:39 -0800
From: merlyn@stonehenge.com (Randal L. Schwartz)
Subject: Re: newbie: chapter 4 exercise Llama book
Message-Id: <86el7emfg8.fsf@red.stonehenge.com>
>>>>> "Michael" == Michael Budash <mbudash@sonic.net> writes:
Michael> In article <3e246a62@news.sahara.com.sa>, Eri Mendz <eriz00@yahoo.com>
Michael> wrote:
>> hi to all,
>>
>> im stumped how to get total of numbers using <STDIN>. This is in
>> exercise 1 chapter 4 of Llama book 3rd ed:
Michael> my @barney = split /\s+/, $input;
That's all well and good, but not available to the student by chapter
4.
The answer is simply:
my $sum = total(<STDIN>);
or more explicitly:
my @lines = <STDIN>;
my $sum = total(@lines);
These are explained in the book. Sorry if the explanation is
insufficient -- we're always open to feedback.
print "Just another Perl [book] hacker,"
--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
------------------------------
Date: 15 Jan 2003 12:49:59 -0800
From: djberg96@hotmail.com (Daniel Berger)
Subject: Re: Perl and Ruby
Message-Id: <6e613a32.0301151249.3c9fd4fa@posting.google.com>
"J rgen Exner" <jurgenex@hotmail.com> wrote in message news:<erZT9.3045$%V.517@nwrddc02.gnilink.net>...
> I just read a magazine article about Ruby and I was wondering if some
> Perlers have any experience with that language.
> How does it compare to Perl? Where are the individual strengths and
> weaknesses of each language when compared to each other?
>
> jue
The ultra quick version:
Ruby's strength's over Perl:
1) Cleaner syntax
2) Better OO model (and syntax)
3) Easier to write extensions (by far)
4) Threads work and are stable (as compared to Perl 5.6.1 and earlier)
5) Structured exceptions (ties into #2 I guess)
Perl's strength's over Ruby:
1) CPAN (although the RAA is coming along)
2) Faster (by about 5-10% - very rough estimate, depending on code)
3) More flexible when it comes to calling context (e.g. wantarray)
4) Easier to write with in procedural style, imo (although you can use
that style in Ruby, too)
Regards,
Dan
------------------------------
Date: Wed, 15 Jan 2003 16:44:24 -0500
From: Benjamin Goldberg <goldbb2@earthlink.net>
Subject: Re: Perl and Ruby
Message-Id: <3E25D638.3E86AAFD@earthlink.net>
Daniel Berger wrote:
[snip]
> The ultra quick version:
>
> Ruby's strength's over Perl:
>
> 1) Cleaner syntax
> 2) Better OO model (and syntax)
> 3) Easier to write extensions (by far)
> 4) Threads work and are stable (as compared to Perl 5.6.1 and earlier)
> 5) Structured exceptions (ties into #2 I guess)
>
> Perl's strength's over Ruby:
>
> 1) CPAN (although the RAA is coming along)
> 2) Faster (by about 5-10% - very rough estimate, depending on code)
> 3) More flexible when it comes to calling context (e.g. wantarray)
> 4) Easier to write with in procedural style, imo (although you can use
> that style in Ruby, too)
6) Lexical variables must be explicitly declared.
How good is Ruby's nationalization support (utf8 and such)?
Perl 5.8 has full unicode support, and there's the Encode module, what
does Ruby have?
How good is Ruby's thread support compared to perl5.8's threads?
What kind of IO model does Ruby have? Is there anything like perlio's
layers?
--
$..='(?:(?{local$^C=$^C|'.(1<<$_).'})|)'for+a..4;
$..='(?{print+substr"\n !,$^C,1 if $^C<26})(?!)';
$.=~s'!'haktrsreltanPJ,r coeueh"';BEGIN{${"\cH"}
|=(1<<21)}""=~$.;qw(Just another Perl hacker,\n);
------------------------------
Date: Wed, 15 Jan 2003 09:55:19 -0800
From: Jeff Zucker <jeff@vpservices.com>
Subject: Re: Problem with DBI
Message-Id: <3E25A087.3010802@vpservices.com>
Jason Singleton wrote:
> Can't locate auto/DBI/prepare.al in @INC (@INC contains: C:/Perl/lib
> C:/Perl/site/lib . C:/Apache2/cgi-bin/WebOPAC/packages) at
> drivers/text.pl line 27 Compilation failed in require at
> C:/Apache2/cgi-bin/WebOPAC/webopac.cgi line 69.
It looks like DBI is not installed, use PPM to install it if you are
using ActiveState.
--
Jeff
------------------------------
Date: Wed, 15 Jan 2003 10:14:14 -0600
From: Extended Partition <extendedpartition@NOSPAM.yahoo.com>
Subject: Re: Question about high performance spidering in perl
Message-Id: <9t1b2vs634d7b9k4d1cm7ttvdqi11eb5im@4ax.com>
>> I am looking at a project that aims to create a high performance
>> spider program to assist in internet searches. Actually, it's really a
>> combo program that combines spider and search agent technology
>> together if that makes any difference.
>
>This sounds like a bad idea. Do us all a favor and use google.
Well, it's a mandantory thing so bad idea or not I'm probably going to
have to do it. And I have used Google. So do me a favor and don't make
assumptions :-)
>> The program will basicaly crawl
>> the web (starting at an abitrary URI)
>
>Hopefully a URI internal to your organization, that stays there.
Unfortunately, no it isn't. The URI could be ANY URI on the internet.
Like I said, it's abitrary.
> and search pages for specific
>> target words. If it finds a target word or phrase on a page then it
>> will save that page to a db.
>
>You had better save that page to a db whether it has the word or phrase
>or not, or you will get stuck in an infinite loop.
Good idea. I could just put the URI in a "visited" table and check
that before going out to a new page to see if it's been visited
before. But then that adds even more overhead to the program :-(
>> If not, it moves to the next page. In
>> addition, it needs to link out from each page to every page the
>> current page links to.
>
>Someone is going to interpret this as a DOS attack and nuke you.
How would they think it's a DOS attack? Wouldn't it just look like a
spider crawling the page? Hmmm...another issue to consider I
suppose...
>> It needs to do this over and over and over. In
>> effect "crawl" the web. My question is NOT how to do this as I have a
>> good understanding of what it needs to do. My question rather is
>> "should I use Perl to do it"?
>
>If I were to do this in the first place, I'd use Perl and Mysql
>(or some fancier rdbms, if you have one.).
I'm certainly not set on MySQL as the backend db. But I'm wanting to
make this solution as cost effective to my client as possible. That's
the main reason for my choice of MySQL.
Thanks for your input,
Extended
------------------------------
Date: Wed, 15 Jan 2003 10:15:37 -0600
From: Extended Partition <extendedpartition@NOSPAM.yahoo.com>
Subject: Re: Question about high performance spidering in perl
Message-Id: <b72b2v8u3vkp9nvl8jrq08gjs9p8255ve9@4ax.com>
<snip>
>Why don't you use a package that's already been written - in Perl - to
>do what you want? Use Harvest-NG, a powerful open source spidering
>and indexing system. http://webharvest.sourceforge.net/ng/.
I considered that. But my client has some specialized requirements
that would require an almost total rewrite of an existing package. All
in all, it's easier to develop from scratch.
Extended
------------------------------
Date: Wed, 15 Jan 2003 17:31:01 GMT
From: Uri Guttman <uri@stemsystems.com>
Subject: Re: Question about high performance spidering in perl
Message-Id: <x77kd6l3qi.fsf@mail.sysarch.com>
>>>>> "EP" == Extended Partition <extendedpartition@NOSPAM.yahoo.com> writes:
EP> <snip>
>> Why don't you use a package that's already been written - in Perl - to
>> do what you want? Use Harvest-NG, a powerful open source spidering
>> and indexing system. http://webharvest.sourceforge.net/ng/.
EP> I considered that. But my client has some specialized requirements
EP> that would require an almost total rewrite of an existing package. All
EP> in all, it's easier to develop from scratch.
i have developed 2 major crawlers (one in c, the other in perl) and
there are many issues to deal with and it will take you longer to do
than you would think. the bigges issue usually is scaling more than
crawling logic. depending on how many sites you want to crawl in total
and how often you want to crawl them, you can choose many different
crawler architectures. also you say you need special processing which
can be the hardest part (it was in the perl crawler i did. WAY too many
special crunching rules). some of the processing steps may cause major
design changes in how the crawler works (e.g. rules to extract new urls,
revisit url frequency). i wouldn't recommend you tackle this without
getting experienced help before you get in too deep. the perl crawler i
did was a complete rewrite of a very bad program the client had done for
them. unfortunately most of the crawling and processing rules were
already encoded in that program and we had to painfully analyze it to
keep the same rules. had we started from a fresh design, the whole thing
would have been easier. so don't just code up this with the first perl
hacker you can hire. realize that it is a critical part of your system
and you should get it done correctly the first time. this is a
professional and courteous piece of advice for you. real world class
crawlers are not trivial.
uri
--
Uri Guttman ------ uri@stemsystems.com -------- http://www.stemsystems.com
----- Stem and Perl Development, Systems Architecture, Design and Coding ----
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
Damian Conway Perl Classes - January 2003 -- http://www.stemsystems.com/class
------------------------------
Date: Wed, 15 Jan 2003 17:39:01 GMT
From: "Ron Ruble" <raffles2@att.net>
Subject: Re: Question about high performance spidering in perl
Message-Id: <V0hV9.112212$hK4.9132696@bgtnsc05-news.ops.worldnet.att.net>
"Extended Partition" <extendedpartition@NOSPAM.yahoo.com> wrote in message news:j9q82v8qtrqanmgpfs3bb4jv0mrtcv91r1@4ax.com...
> Hello Everyone,
>
> I am looking at a project that aims to create a high performance
> spider program to assist in internet searches.
In addition to the replies you've already received,
check out the FAQs:
http://www.robotstxt.org/wc/guidelines.html
http://www.robotstxt.org/wc/robots.html
Particularly the ones about avoiding p***ing people
off. What you need to do is not only design your
algorithms for good performance from your standpoint,
but also from the server's standpoint.
A _lot_ of sites have banned all sorts of automated
search features like this, because of people who
didn't play by the rules. Don't make it worse.
From the Robot Writer's Guidelines, above:
"Walk, don't run
Make sure your robot runs slowly: although robots can handle
hundreds of documents per minute, this puts a large strain on
a server, and is guaranteed to infuriate the server maintainer.
Instead, put a sleep in, or if you're clever rotate queries
between different servers in a round-robin fashion."
Don't hammer any single server; distribute the load,
searching broadly rather than drilling down fast.
Also pay attention to the exclusions and announcing
yourself guidelines.
------------------------------
Date: Wed, 15 Jan 2003 14:16:11 -0500
From: Benjamin Goldberg <goldbb2@earthlink.net>
Subject: Re: Question about high performance spidering in perl
Message-Id: <3E25B37B.DBDC1646@earthlink.net>
Extended Partition wrote:
[snip]
> My question rather is "should I use Perl to do it"?
It depends entirely on your processing needs.
I'll assume that the 'web crawling' part is fairly standard, and can be
handled by any existing (C or perl) crawler.
How many perlish features are needed in identifying the presence of your
particular keywords?
Or, to be more specific... how hard would it be to write a parser in
"lex" or "flex" which does what you need?
> The program itself won't be massive (perhaps a few thousand lines at
> most)
Don't make guesses about size this early -- it's not uncommon to run
into numerous special cases, and the code for each of them makes your
program expand a few dozen lines. This adds up.
> but the recursive andepetitive nature of its functionality might incur
> some overhead.
A web crawler should NOT be recursive. That would be very poor design.
A recursive design would be something like the following psuedocode:
webcrawl(url) {
x = fetch(url)
foreach u in ( extract urls from x ) {
webcrawl(u) unless already_fetched(u)
}
}
That's just sooo wrong, since you'll have little control of the order
that pages are fetched, and the depth of recursion can get quite high.
A better design is:
webcrawl(url) {
q = new priority_queue;
q.enqueue(url);
until( q.empty ) {
x = fetch(q.dequeue);
foreach u in ( extract urls from x ) {
q.enqueue(u) unless already_fetched(u)
}
}
}
This allows you, through your priority queue implementation, to have
*complete* control over when urls are fetched, and it avoids recursion
of any depth.
As others have said, you want to avoid hitting any single webserver too
many times in a short period. One way to do this is, each time you
fetch a url from a particular webserver, all other urls from that
particular server get a lower priority than any url from any other
webserver. At the same time, though, you can design your queue in such
a way that if you've just finished fetching a url from a server which
implements connection:keep-alive, you can extract from your queue
another url from that server, so as to keep using that connection.
--
$..='(?:(?{local$^C=$^C|'.(1<<$_).'})|)'for+a..4;
$..='(?{print+substr"\n !,$^C,1 if $^C<26})(?!)';
$.=~s'!'haktrsreltanPJ,r coeueh"';BEGIN{${"\cH"}
|=(1<<21)}""=~$.;qw(Just another Perl hacker,\n);
------------------------------
Date: 15 Jan 2003 08:18:38 -0800
From: stefan@borgia.com (Stefan Adams)
Subject: Re: Renaming files *.txt to 1234.txt
Message-Id: <dcc927de.0301150818.66aa9b88@posting.google.com>
"Paul Tomlinson" <rubberducky703@hotmail.com> wrote in message news:<b03hjr$kt805$1@ID-116287.news.dfncis.de>...
> Renaming files *.txt to 1234.txt
>
> I need to rename every .txt file in a directory to 1234.txt. I understand
> that this will cause some files to get overridden, but this is ok. Is there
> any one line expression that will do this sort of thing for me?
Some? I would think all except the last one...
Since we're in the Perl group...
$ perl -e 'foreach ( glob("*.txt") ) { rename $_, "1234.txt" }'
The way I'd accomplish that task...
$ for i in *.txt; do mv $i 1234.txt; done
You might want to apply some logic to this so that it sorts by last
modified time or something. That way you'd have at least some sort of
an idea of what will end up in that 1234.txt file.
Or, better yet, delete all of the .txt files except the one that you
want to eventually become 1234.txt.
Since we're in the Perl group...
$ perl -e 'foreach ( grep { !/^keeper.txt$/ } glob("*.txt") ) { unlink
$_ } rename "keeper.txt", "1234.txt"'
The way I'd accomplish that task...
$ mv keeper.txt keeper && rm *.txt && mv keeper 1234.txt
Much love.
Stefan
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
The Perl-Users Digest is a retransmission of the USENET newsgroup
comp.lang.perl.misc. For subscription or unsubscription requests, send
the single line:
subscribe perl-users
or:
unsubscribe perl-users
to almanac@ruby.oce.orst.edu.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
To request back copies (available for a week or so), send your request
to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
where x is the volume number and y is the issue number.
For other requests pertaining to the digest, send mail to
perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
sending perl questions to the -request address, I don't have time to
answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V10 Issue 4410
***************************************