[30362] in Perl-Users-Digest
Perl-Users Digest, Issue: 1605 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Jun 3 11:09:49 2008
Date: Tue, 3 Jun 2008 08:09:14 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Tue, 3 Jun 2008 Volume: 11 Number: 1605
Today's topics:
Counting lines in big number of files - in parallel. hadzio@gmail.com
Re: Counting lines in big number of files - in parallel <RedGrittyBrick@SpamWeary.foo>
Re: Counting lines in big number of files - in parallel hadzio@gmail.com
Re: Counting lines in big number of files - in parallel <1usa@llenroc.ude.invalid>
Re: Counting lines in big number of files - in parallel <simon.chao@fmr.com>
Re: Counting lines in big number of files - in parallel <tzz@lifelogs.com>
Re: Counting lines in big number of files - in parallel <someone@example.com>
Re: HTML::Tokerparser / parsing <p>,<br> <mislam@nospam.ciuc.edu>
Re: HTML::Tokerparser / parsing <p>,<br> <mislam@nospam.ciuc.edu>
Re: Perl grep and Perl 4 <noreply@gunnar.cc>
Re: Perl grep and Perl 4 fourfour2@gmail.com
Re: Perl grep and Perl 4 <devnull4711@web.de>
Re: Perl grep and Perl 4 <tadmc@seesig.invalid>
Re: Perl grep and Perl 4 (Randal L. Schwartz)
Re: Perl grep and Perl 4 <1usa@llenroc.ude.invalid>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Tue, 3 Jun 2008 02:14:26 -0700 (PDT)
From: hadzio@gmail.com
Subject: Counting lines in big number of files - in parallel.
Message-Id: <c4857a09-1348-4b5c-8b42-affc273f949d@k37g2000hsf.googlegroups.com>
Hi,
I have the following issue. I have a directory with 25000 text files
in it (about 1-10000 lines in each file). I have a perl script that
generates some reports for me and this script needs to count the
number of lines in each of these 25000 files (for each file I need a
number of lines in it). And it is not so difficult, I iterate over the
directory and count the number of lines using "wc -l" as follows:
open (WCCOUNT, "cat $file_to_read | wc -l |");
$file_number_of_lines = <WCCOUNT>;
chomp($file_number_of_lines);
close(WCCOUNT);
But the above sequencial counting is very slow (2-3 hours). My server
is quite powerfull (72 CPU and fast filesystems) so I would like to
run the counting in parallel (eg. counting in 72 files at the same
time). So the questions are:
1) Is it possible to run the above command (cat ... | wc -l) in
background & (the same way as in shell) and receive the returned
results when it is finished.
2) Is it possible to implement 1) without threads?
3) The above code I may write using system() instead of open(), but
the same issue is: how to do it in parallel.
Or maybe someone has any other better idea how to count number of
lines in each of 25000 files? Maye someone may recommend me some other
solution. Thank you in advance.
Regards
Pawel
------------------------------
Date: Tue, 03 Jun 2008 12:18:40 +0100
From: RedGrittyBrick <RedGrittyBrick@SpamWeary.foo>
Subject: Re: Counting lines in big number of files - in parallel.
Message-Id: <48452891$0$10631$fa0fcedb@news.zen.co.uk>
hadzio@gmail.com wrote:
> Hi,
>
> I have the following issue. I have a directory with 25000 text files
> in it (about 1-10000 lines in each file). I have a perl script that
> generates some reports for me and this script needs to count the
> number of lines in each of these 25000 files (for each file I need a
> number of lines in it). And it is not so difficult, I iterate over the
> directory and count the number of lines using "wc -l" as follows:
>
> open (WCCOUNT, "cat $file_to_read | wc -l |");
> $file_number_of_lines = <WCCOUNT>;
> chomp($file_number_of_lines);
> close(WCCOUNT);
>
1) Useless use of cat
"cat $file_to_read | wc -l |" starts two processes (plus shell etc)
I'd use "wc -l $file_to_be read"
2) I suspect you are invoking this 25000 times it would be much more
efficient to invoke it once this
my %filelines
open (my $fh, '-|', 'wc -l *.txt')
or die "can't open wc because $!";
while(<$fh>) {
chomp;
my ($filename, $lines) = split;
$filelines{$filename} = $lines;
# or do something else with $filename & lines
# to avoid iterating over a hash later
}
close $fh;
Untested - caveat emptor.
> But the above sequencial counting is very slow (2-3 hours). My server
> is quite powerfull (72 CPU and fast filesystems) so I would like to
> run the counting in parallel (eg. counting in 72 files at the same
> time). So the questions are:
> 1) Is it possible to run the above command (cat ... | wc -l) in
> background & (the same way as in shell) and receive the returned
> results when it is finished.
Yes
> 2) Is it possible to implement 1) without threads?
Yes, you might use processes. In either case, use a limited size pool
(e.g. the 72 you suggested) and queueing. 25000 threads or 25000
processes would be silly.
> 3) The above code I may write using system() instead of open(), but
> the same issue is: how to do it in parallel.
There are CPAN modules for this.
>
> Or maybe someone has any other better idea how to count number of
> lines in each of 25000 files? Maye someone may recommend me some other
> solution. Thank you in advance.
I'd let wc do them all at once and see if that is fast enough. It will
certainly be faster than invoking cat and wc 25000 times.
If you are checking for file content changes I'd use stat instead to
check mtime, at least as a first step.
--
RGB
------------------------------
Date: Tue, 3 Jun 2008 04:45:18 -0700 (PDT)
From: hadzio@gmail.com
Subject: Re: Counting lines in big number of files - in parallel.
Message-Id: <75c7fc7a-c0c9-4b39-afb1-5a0e42149446@x35g2000hsb.googlegroups.com>
Hi,
Thank you for these remarks:
> 1) Useless use of cat
> "cat $file_to_read | wc -l |" starts two processes (plus shell etc)
> I'd use "wc -l $file_to_be read"
My command returns a value in a format easier to process ;)
> 2) I suspect you are invoking this 25000 times it would be much more
> efficient to invoke it once this
>
> my %filelines
> open (my $fh, '-|', 'wc -l *.txt')
Takes almost the same time. Invoking command is not an issue comparing
to time spent on counting lines.
> I'd let wc do them all at once and see if that is fast enough. It will
> certainly be faster than invoking cat and wc 25000 times.
Not remarkable difference.
Regards
Pawel
------------------------------
Date: Tue, 03 Jun 2008 13:18:28 GMT
From: "A. Sinan Unur" <1usa@llenroc.ude.invalid>
Subject: Re: Counting lines in big number of files - in parallel.
Message-Id: <Xns9AB25EACD12B2asu1cornelledu@127.0.0.1>
hadzio@gmail.com wrote in news:75c7fc7a-c0c9-4b39-afb1-5a0e42149446
@x35g2000hsb.googlegroups.com:
> Hi,
>
> Thank you for these remarks:
>
>> 1) Useless use of cat
>> "cat $file_to_read | wc -l |" starts two processes (plus shell etc)
>> I'd use "wc -l $file_to_be read"
>
> My command returns a value in a format easier to process ;)
perldoc perlvar
HANDLE->input_line_number(EXPR)
$INPUT_LINE_NUMBER
$NR
$. Current line number for the last filehandle accessed.
>> 2) I suspect you are invoking this 25000 times it would be much more
>> efficient to invoke it once this
I suspect running through each file an recording the line number for that
file will be much faster. On the other hand, the files have to be read,
making this IO bound. The number of CPUs you have is pretty much
irrelevant while the number of different physical hard drives over which
the files are spread is.
If you try with a few line counters running in parallel, they may get into
each others' way because of contention for the same physical hard drive.
So, let's say, on average 5,000 lines per file, 80 characters per line and
25,000 files. That's roughly 10GB of data that have to be read for this
processing to be done.
You know, wc, at least on my system, can process multiple files at a time.
For 10,000 files with 1 - 10,000 lines of 80 characters each:
timethis wc -l file*.txt > linecounts.txt
TimeThis : Command Line : wc -l file*.txt
TimeThis : Start Time : Tue Jun 03 08:41:53 2008
TimeThis : End Time : Tue Jun 03 08:45:57 2008
TimeThis : Elapsed Time : 00:04:04.437
That was 4Gb in 4 minutes.
Can we speed that up?
My guess is no. At least not by much.
So it took 2-3 hours huh? Using 72 CPUs huh? Maybe you should have first
read the man page for wc.
Bummer.
Sinan
--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)
comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
------------------------------
Date: Tue, 3 Jun 2008 06:33:03 -0700 (PDT)
From: nolo contendere <simon.chao@fmr.com>
Subject: Re: Counting lines in big number of files - in parallel.
Message-Id: <aeeac9f4-4a6d-4cbe-b2ee-dabfa282d08e@79g2000hsk.googlegroups.com>
On Jun 3, 7:45=A0am, had...@gmail.com wrote:
> Hi,
>
> Thank you for these remarks:
>
> > 1) Useless use of cat
> > "cat $file_to_read | wc -l |" starts two processes (plus shell etc)
> > I'd use "wc -l $file_to_be read"
>
> My command returns a value in a format easier to process ;)
>
> > 2) I suspect you are invoking this 25000 times it would be much more
> > efficient to invoke it once this
>
> > =A0 =A0 my %filelines
> > =A0 =A0 open (my $fh, '-|', 'wc -l *.txt')
>
> Takes almost the same time. Invoking command is not an issue comparing
> to time spent on counting lines.
>
> > I'd let wc do them all at once and see if that is fast enough. It will
> > certainly be faster than invoking cat and wc 25000 times.
>
> Not remarkable difference.
>
Try something like is. It does what you ask, but not sure if it does
what you want (i.e. don't know if this will be faster than Sinan's
solution). You can test and let us know :-). I'm sure you can figure
out how to sum the numbers. Here I just print them.
#!/usr/bin/perl
use strict; use warnings;
use Parallel::ForkManager;
$|++;
# should be I/O bound, so num_cpus doesn't matter so much
# can tune this number
my $max_procs =3D 72;
my $pm =3D new Parallel::ForkManager( $max_procs );
chomp( my $somedir =3D `pwd` );
opendir DIR, $somedir or die "can't opendir '$somedir': $!";
while ( my $f =3D readdir DIR ) {
next if $f =3D~ m/^\.\.?$/;
next if -d $f;
$pm->start and next;
print `wc -l $f`;
$pm->finish;
}
closedir DIR;
$pm->wait_all_children;
------------------------------
Date: Tue, 03 Jun 2008 09:20:08 -0500
From: Ted Zlatanov <tzz@lifelogs.com>
Subject: Re: Counting lines in big number of files - in parallel.
Message-Id: <86lk1m8usn.fsf@lifelogs.com>
On Tue, 3 Jun 2008 04:45:18 -0700 (PDT) hadzio@gmail.com wrote:
>> 2) I suspect you are invoking this 25000 times it would be much more
>> efficient to invoke it once this
>>
>> my %filelines
>> open (my $fh, '-|', 'wc -l *.txt')
h> Takes almost the same time. Invoking command is not an issue comparing
h> to time spent on counting lines.
Yes, and running IO-bound processes in parallel is only likely to be
faster if you have multiple I/O devices. There are exceptions with a
single disk but it's pretty unlikely you'll hit them. If you really
think seeking all over the disk will be faster, you can try
Tie::ShareLite to hold your list of files in a hash, pointing to -2
values (we'll use -2 for "available" and -1 for "in flight"). Then
start any number of worker processes and each one does a
lock hash
get available (value == -2) file name and set value to -1
unlock hash
...count lines...
lock hash
set file value to number of lines
unlock hash
when it needs the next file to count. The advantage here is that you
can start 1 or 1000 processes, and no special coordination is needed.
If a process dies you need to clean up, so when no processes are
running, you look for -1 values in the hash and set them back to -2.
This is unlikely to be needed if you're just counting words.
I would just use `ls | xargs wc' personally...
Ted
------------------------------
Date: Tue, 03 Jun 2008 14:42:57 GMT
From: "John W. Krahn" <someone@example.com>
Subject: Re: Counting lines in big number of files - in parallel.
Message-Id: <RLc1k.835$Gn.687@edtnps92>
hadzio@gmail.com wrote:
> Hi,
>
> Thank you for these remarks:
>
>> 1) Useless use of cat
>> "cat $file_to_read | wc -l |" starts two processes (plus shell etc)
>> I'd use "wc -l $file_to_be read"
>
> My command returns a value in a format easier to process ;)
Still a UUOC. Instead of "cat $file_to_read | wc -l |" use "wc -l <
$file_to_read |" or just:
chomp( my $file_number_of_lines = `wc -l < $file_to_read` );
John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall
------------------------------
Date: Tue, 03 Jun 2008 09:04:24 -0500
From: Sharif <mislam@nospam.ciuc.edu>
Subject: Re: HTML::Tokerparser / parsing <p>,<br>
Message-Id: <g23j18$pog$1@news.acm.uiuc.edu>
A. Sinan Unur wrote:
> Sharif <mislam@nospam.ciuc.edu> wrote in
> news:g21mk6$9k7$1@news.acm.uiuc.edu:
>
>> I was able to extract the title and description, how do I get the rest
>> of the information?
>>
>> # perl toke.pl
>> 124- Chocolat
>> A young woman returns to Cameroon to trace her past. Soon the sights,
>> sounds, and smells sweep her back to her childhood and memories of the
>> people who populated her youth. (French with English and Spanish
>> subtitles.)
>
> Wait a second.
>
> So I created a movie.html file and used your program to parse it:
>
> C:\Temp> cat movie.html
> <p>124- Chocolat<br>
> Santa Monica, CA, 106 minutes<br>
> MGM Home Entertainment, 2001<br>
> DVD 791.43653 C451</p>
>
> <p>A young woman returns to Cameroon to trace her past.
> Soon the sights, sounds, and smells sweep her back to her childhood and
> memories of the people who populated her youth. (French with English and
> Spanish subtitles.)</p> <p></p>
>
>
> C:\Temp> cat tt.pl
> #!/usr/bin/perl
>
> use strict ;
> use warnings;
>
> use HTML::TokeParser;
>
> my $p = HTML::TokeParser->new("movie.html") || die "Can't open: $!";
>
> while (my $token = $p->get_token) {
> if ( $p->get_tag("p") ) {
> my $title = $p->get_trimmed_text;
> print "$title\n";
> }
> elsif ( $p->get_tag("br") ) {
> my $desc = $p->get_trimmed_text;
> print "$desc \n";
> }
> }
>
> __END__
>
> C:\Temp> tt
> A young woman returns to Cameroon to trace her past. Soon the sights,
> sounds, and smells sweep her back to her childhood and memories of the
> people who populated her youth. (French with English and Spanish
> subtitles.)
>
> Notice how the output I got is different than the output you posted?
>
> Why is that?
>
Sorry. I hurriedly wrote the original post. the movie.html that
<html></html> around the code. I didn't copy that part. Thanks for your
first suggestion, I think that would work. And I will be carefully in
copying the code next time.
--s
------------------------------
Date: Tue, 03 Jun 2008 09:08:15 -0500
From: Sharif <mislam@nospam.ciuc.edu>
Subject: Re: HTML::Tokerparser / parsing <p>,<br>
Message-Id: <g23j8e$pog$2@news.acm.uiuc.edu>
Gunnar Hjalmarsson wrote:
> Sharif wrote:
>> I am using the Tokeparser module for a html file that contains movie
>> title, description, library call number, year, duration etc. But the
>> html files are marked so I have to rely on the the paragraph and line
>> break.
>
> There needs to be something else that identifies the start of a movie, I
> think, or else I doubt that HTML::TokeParser is a suitable tool for the
> task.
>
> Do you have a URL?
>
http://www.afrst.uiuc.edu/libvideos1.htm
There's a number if front of the title, that's the only identifier.
--s
------------------------------
Date: Tue, 03 Jun 2008 09:39:05 +0200
From: Gunnar Hjalmarsson <noreply@gunnar.cc>
Subject: Re: Perl grep and Perl 4
Message-Id: <6akas6F37ofmpU1@mid.individual.net>
fourfour2@gmail.com wrote:
>> This works in Perl 5 but not Perl 4:
>>
>> ...
>> $string1="thisisstring(one)";
>> @stringlist=("thisisstring(one)", "thisisafunnystring");
>
> I mean
> $potatoe="thisisstring(one)";
> @listofpotatoes=("thisisstring(one)", "thisisafunnystring");
>
> if (( !grep { $potatoe eq $_ ) @listofpotatoes {
> print "Not found in list....\n"
>
> }
>> syntax error,next 2 tokens :grep {"
I don't believe you when you say that the above code works in Perl 5.
Please copy and paste the code you post; don't type it with multiple typos.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
------------------------------
Date: Tue, 3 Jun 2008 00:52:51 -0700 (PDT)
From: fourfour2@gmail.com
Subject: Re: Perl grep and Perl 4
Message-Id: <70e0db0b-0b79-4718-8485-9e71fcb1deaa@t12g2000prg.googlegroups.com>
On Jun 3, 12:39=A0am, Gunnar Hjalmarsson <nore...@gunnar.cc> wrote:
> fourfo...@gmail.com wrote:
> >> This works in Perl 5 but not Perl 4:
>
> >> ...
> >> $string1=3D"thisisstring(one)";
> >> @stringlist=3D("thisisstring(one)", "thisisafunnystring");
>
> > I mean
> > $potatoe=3D"thisisstring(one)";
> > @listofpotatoes=3D("thisisstring(one)", "thisisafunnystring");
>
> > if (( !grep { $potatoe eq $_ ) @listofpotatoes {
> > =A0 =A0print "Not found in list....\n"
>
> > }
> >> syntax error,next 2 tokens :grep {"
>
> I don't believe you when you say that the above code works in Perl 5.
> Please copy and paste the code you post; don't type it with multiple typos=
.
>
> --
> Gunnar Hjalmarsson
> Email:http://www.gunnar.cc/cgi-bin/contact.pl- Hide quoted text -
>
> - Show quoted text -
#this works in perl 5, not perl 4
$potatoe=3D"thisisapotatoe(one)";
@listofpotatoes=3D("thisisapotatoe(one)", "thisisanoldpottit");
if ( !grep { $potatoe eq $_ } @listofpotatoes) {
print "Not found in list....\n";
}
------------------------------
Date: Tue, 03 Jun 2008 09:59:28 +0200
From: Frank Seitz <devnull4711@web.de>
Subject: Re: Perl grep and Perl 4
Message-Id: <6akbvmF38g0l2U1@mid.individual.net>
fourfour2@gmail.com wrote:
>>>
>>>>syntax error,next 2 tokens :grep {"
[...]
> #this works in perl 5, not perl 4
> $potatoe="thisisapotatoe(one)";
> @listofpotatoes=("thisisapotatoe(one)", "thisisanoldpottit");
>
> if ( !grep { $potatoe eq $_ } @listofpotatoes) {
> print "Not found in list....\n";
> }
The error message says that the block syntax is not
allowed in Perl 4. Use grep(EXPR,LIST) instead.
Frank
--
Dipl.-Inform. Frank Seitz; http://www.fseitz.de/
Anwendungen für Ihr Internet und Intranet
Tel: 04103/180301; Fax: -02; Industriestr. 31, 22880 Wedel
------------------------------
Date: Tue, 3 Jun 2008 06:55:27 -0500
From: Tad J McClellan <tadmc@seesig.invalid>
Subject: Re: Perl grep and Perl 4
Message-Id: <slrng4ac9f.vf4.tadmc@tadmc30.sbcglobal.net>
fourfour2@gmail.com <fourfour2@gmail.com> wrote:
> I'm using Perl 4
Why?
--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
------------------------------
Date: Tue, 03 Jun 2008 07:04:00 -0700
From: merlyn@stonehenge.com (Randal L. Schwartz)
To: fourfour2@gmail.com
Subject: Re: Perl grep and Perl 4
Message-Id: <86bq2ihay7.fsf@blue.stonehenge.com>
>>>>> "fourfour2" == fourfour2 <fourfour2@gmail.com> writes:
fourfour2> I'm using Perl 4 and have problems
I think you can stop right there. Anything after that is redundant. :)
Are you still playing DOOM too? And using Netscape 1.0?
Perl5 was released FOURTEEN YEARS AGO. Heck, it's almost old enough
to drive. :)
Your first order of business is to climb into the new millenium, new century,
new decade.
print "Just another Perl hacker,"; # the original!
--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
See http://methodsandmessages.vox.com/ for Smalltalk and Seaside discussion
------------------------------
Date: Tue, 03 Jun 2008 14:51:05 GMT
From: "A. Sinan Unur" <1usa@llenroc.ude.invalid>
Subject: Re: Perl grep and Perl 4
Message-Id: <Xns9AB26E60AF5D8asu1cornelledu@127.0.0.1>
merlyn@stonehenge.com (Randal L. Schwartz) wrote in
news:86bq2ihay7.fsf@blue.stonehenge.com:
>>>>>> "fourfour2" == fourfour2 <fourfour2@gmail.com> writes:
>
> fourfour2> I'm using Perl 4 and have problems
>
> I think you can stop right there. Anything after that is redundant.
> :)
>
> Are you still playing DOOM too?
Actually, I am. It is the only game action game where I am assured of some
success (given all the practice that went into it ;-)
Sinan
--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)
comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc. For subscription or unsubscription requests, send
#the single line:
#
# subscribe perl-users
#or:
# unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.
NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 1605
***************************************