[28531] in Perl-Users-Digest
Perl-Users Digest, Issue: 9895 Volume: 10
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Oct 26 18:10:18 2006
Date: Thu, 26 Oct 2006 15:10:11 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Thu, 26 Oct 2006 Volume: 10 Number: 9895
Today's topics:
Naive threading performance questions <worky.workerson@gmail.com>
Re: Naive threading performance questions <jdhedden@1979.usna.com>
Re: Naive threading performance questions xhoster@gmail.com
Re: Naive threading performance questions <worky.workerson@gmail.com>
Re: Naive threading performance questions <worky.workerson@gmail.com>
Re: Naive threading performance questions <glex_no-spam@qwest-spam-no.invalid>
Simple question reg string matching <kjkartik@gmail.com>
Simple question reg string matching <kjkartik@gmail.com>
Re: Simple question reg string matching <john@castleamber.com>
Re: Simple question reg string matching <kjkartik@gmail.com>
Re: Simple question reg string matching <john@castleamber.com>
Re: Simple question reg string matching <john@castleamber.com>
Skipping a file when perl -na is in effect <bew_ba@gmx.net>
Re: Skipping a file when perl -na is in effect <mritty@gmail.com>
Re: Skipping a file when perl -na is in effect xhoster@gmail.com
Re: stop encoding of href in anchor <bart.lateur@pandora.be>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: 26 Oct 2006 12:38:36 -0700
From: "Worky Workerson" <worky.workerson@gmail.com>
Subject: Naive threading performance questions
Message-Id: <1161891516.644033.181730@f16g2000cwb.googlegroups.com>
I'm doing ETL for a database, i.e. line-by-line transformation of
fairly large data sets. From some basic profiling, I've determined
that the transformation process is relatively slow and I am heavily CPU
bound on (i.e. the DB can take data 10 times faster than I can
transform it).
Since I am on an 8-way box and each line is independent of the others,
I decided to try my hand using perl threads. I came up with a naive
implementation (below), that spawns a couple of "transformation
threads", where each thread is fed via a Thread::Queue.
Unfortunately, the threaded implementation performs about 3 times
*slower* than the single threaded implementation. Am I doing something
horribly wrong? Is there something I can be doing better? Is there
some hidden synchronization bottleneck that I'm not seeing, or is a
Thread::Queue not very efficient? Are there some common idioms for
threading that I am missing?
Thanks!
# Sorry if this is incorrect ... its hand-copied from an isolated lab
use threads;
use threads::shared;
use Thread::Queue;
my $num_threads = 5;
my $finished_processing : shared = 0;
my $data_queue = Thread::Queue->new();
threads->create("process_lines") for (1..$num_threads);
while (<>) { chomp; $data_queue->enqueue($_); }
$finished_processing = 1;
$_->join() foreach (threads->list());
# Transform thread
sub process_lines {
while (1) {
my $line = $data_queue->dequeue_nb();
last if $finish_processing && !$line;
next unless $line;
# Do a line transformation ....
print $line;
}
}
------------------------------
Date: 26 Oct 2006 13:08:33 -0700
From: "jdhedden" <jdhedden@1979.usna.com>
Subject: Re: Naive threading performance questions
Message-Id: <1161893313.858150.189270@i42g2000cwa.googlegroups.com>
Worky Workerson wrote:
> Unfortunately, the threaded implementation performs about 3 times
> *slower* than the single threaded implementation.
It may be that the dequeue_nb is causing fast loops inside your
threads. Try the following instead:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
my $num_threads = 5;
my $data_queue = Thread::Queue->new();
# Start up the threads
threads->create('process_lines') for (1..$num_threads);
# Feed data to threads
while (<>) { chomp; $data_queue->enqueue($_); }
# Signal all threads to terminate
$data_queue->enqueue(undef) for (1..$num_threads);
# Wait for threads to finish
$_->join() foreach (threads->list());
# Transform thread
sub process_lines
{
while (1) {
my $line = $data_queue->dequeue();
last if (! defined($line)); # Done processing
next unless $line; # Ignore blank lines
# Do a line transformation ....
print $line, "\n";
}
}
------------------------------
Date: 26 Oct 2006 20:36:19 GMT
From: xhoster@gmail.com
Subject: Re: Naive threading performance questions
Message-Id: <20061026163709.123$GU@newsreader.com>
"Worky Workerson" <worky.workerson@gmail.com> wrote:
> I'm doing ETL for a database, i.e. line-by-line transformation of
> fairly large data sets. From some basic profiling, I've determined
> that the transformation process is relatively slow and I am heavily CPU
> bound on (i.e. the DB can take data 10 times faster than I can
> transform it).
>
> Since I am on an 8-way box and each line is independent of the others,
> I decided to try my hand using perl threads. I came up with a naive
> implementation (below), that spawns a couple of "transformation
> threads", where each thread is fed via a Thread::Queue.
>
> Unfortunately, the threaded implementation performs about 3 times
> *slower* than the single threaded implementation. Am I doing something
> horribly wrong? Is there something I can be doing better?
My first choice would be to make N different files and process them
independently. If that were inconvient, I'd start N processes, each
getting a $token from 0 to N-1, and each opening an independent handle
onto the one file and each one only processing the lines where
$. % $N == $token
In order to start these N processes, I'd use forking where possible,
and threads only as a last resort (unless I had *other* compelling reasons
to use threads).
> Is there
> some hidden synchronization bottleneck that I'm not seeing, or is a
> Thread::Queue not very efficient?
Thread::Queue is not very efficient for "flyweight" stuff because it has a
synchronization bottleneck. I wouldn't exactly call this "hidden", more
like "implicit".
> while (<>) { chomp; $data_queue->enqueue($_); }
There is a high probability that this will enqueue lines faster than the
client threads can dequeue them, resulting in a memory explosion and
eventually a crash. (You can probably guess how I learned this)
while (<>) {
$data_queue->enqueue($_);
sleep 1 if $data_queue->pending() > 10_000;
};
Xho
--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
------------------------------
Date: 26 Oct 2006 13:54:21 -0700
From: "Worky Workerson" <worky.workerson@gmail.com>
Subject: Re: Naive threading performance questions
Message-Id: <1161896061.025905.98850@f16g2000cwb.googlegroups.com>
> > Unfortunately, the threaded implementation performs about 3 times
> > *slower* than the single threaded implementation.
> It may be that the dequeue_nb is causing fast loops inside your
> threads. Try the following instead:
...snip...
> # Signal all threads to terminate
> $data_queue->enqueue(undef) for (1..$num_threads);
...snip...
> sub process_lines
> {
> while (1) {
> my $line = $data_queue->dequeue();
> last if (! defined($line)); # Done processing
> next unless $line; # Ignore blank lines
>
> # Do a line transformation ....
>
> print $line, "\n";
> }
>
> }
Thanks! This definitely improved the process, and it seems to max out
at about 30% faster than the original with 4 transformation threads.
Is Thread::Queue the accepted/best way to do this sort of heavy I/O
between threads or is there something with more throughput (on linux)?
------------------------------
Date: 26 Oct 2006 14:08:40 -0700
From: "Worky Workerson" <worky.workerson@gmail.com>
Subject: Re: Naive threading performance questions
Message-Id: <1161896920.677988.51850@k70g2000cwa.googlegroups.com>
On Oct 26, 4:36 pm, xhos...@gmail.com wrote:
> "Worky Workerson" <worky.worker...@gmail.com> wrote:
> > I'm doing ETL for a database, i.e. line-by-line transformation of
> > fairly large data sets. From some basic profiling, I've determined
> > that the transformation process is relatively slow and I am heavily CPU
> > bound on (i.e. the DB can take data 10 times faster than I can
> > transform it).
>
> > Since I am on an 8-way box and each line is independent of the others,
> > I decided to try my hand using perl threads. I came up with a naive
> > implementation (below), that spawns a couple of "transformation
> > threads", where each thread is fed via a Thread::Queue.
> My first choice would be to make N different files and process them
> independently. If that were inconvient, I'd start N processes, each
> getting a $token from 0 to N-1, and each opening an independent handle
> onto the one file and each one only processing the lines where
> $. % $N == $token
The first is, like you mentioned, inconvenient, but the token idea is
excellent. Thanks!
> > Is there
> > some hidden synchronization bottleneck that I'm not seeing, or is a
> > Thread::Queue not very efficient?
> Thread::Queue is not very efficient for "flyweight" stuff because it has a
> synchronization bottleneck. I wouldn't exactly call this "hidden", more
> like "implicit".
What do you mean by "flyweight"? I figured that this would be pretty
IO intensive for Thread::Queue ...
> > while (<>) { chomp; $data_queue->enqueue($_); }
> There is a high probability that this will enqueue lines faster than the
> client threads can dequeue them, resulting in a memory explosion and
> eventually a crash. (You can probably guess how I learned this)
>
> while (<>) {
> $data_queue->enqueue($_);
> sleep 1 if $data_queue->pending() > 10_000;
>
> };
Thanks for that catch :) Didn't notice it since I am running on 16GB
RAM machine playing with a 500 MB file, but I'm sure that it would come
to bite me once I start dealing with the real data sets.
------------------------------
Date: Thu, 26 Oct 2006 16:41:52 -0500
From: "J. Gleixner" <glex_no-spam@qwest-spam-no.invalid>
Subject: Re: Naive threading performance questions
Message-Id: <45412b71$0$498$815e3792@news.qwest.net>
Worky Workerson wrote:
> Thanks! This definitely improved the process, and it seems to max out
> at about 30% faster than the original with 4 transformation threads.
> Is Thread::Queue the accepted/best way to do this sort of heavy I/O
> between threads or is there something with more throughput (on linux)?
If you haven't already, start with the code that's doing the
"transformation" and use parallel processing after that's optimized.
You could use Parallel::ForkManager, which manages fork() nicely.
You could post your transformation code, to see if there are
better/faster ways to do it.
You could write your transformation in C.
You could have the DB do some of the work.
You could possibly do more optimized updates to the DB.
You could do many things...
------------------------------
Date: 26 Oct 2006 12:31:40 -0700
From: "K3" <kjkartik@gmail.com>
Subject: Simple question reg string matching
Message-Id: <1161891100.738805.60820@i42g2000cwa.googlegroups.com>
I am a newbie to perl, so i wud appreciate if anyone give the code for
finding the number of matches in a given string, allowing atmost m no
of mismatches.
input: short string (len = k)
long string (len =n)
no of mismatches (=m)
cheers
karthik
------------------------------
Date: 26 Oct 2006 12:32:06 -0700
From: "K3" <kjkartik@gmail.com>
Subject: Simple question reg string matching
Message-Id: <1161891126.760608.112970@b28g2000cwb.googlegroups.com>
I am a newbie to perl, so i wud appreciate if anyone give the code for
finding the number of matches in a given string, allowing atmost m no
of mismatches.
input: short string (len = k)
long string (len =n)
no of mismatches (=m)
output: no of matches allowing m mismatches
cheers
karthik
------------------------------
Date: 26 Oct 2006 20:33:52 GMT
From: John Bokma <john@castleamber.com>
Subject: Re: Simple question reg string matching
Message-Id: <Xns98689E53F3311castleamber@130.133.1.4>
"K3" <kjkartik@gmail.com> wrote:
> I am a newbie to perl, so i wud appreciate if anyone give the code for
> finding the number of matches in a given string, allowing atmost m no
> of mismatches.
>
> input: short string (len = k)
> long string (len =n)
> no of mismatches (=m)
>
> output: no of matches allowing m mismatches
CPAN or pay someone for doing your homework.
--
John Experienced Perl programmer: http://castleamber.com/
Perl help, tutorials, and examples: http://johnbokma.com/perl/
------------------------------
Date: 26 Oct 2006 13:42:51 -0700
From: "K3" <kjkartik@gmail.com>
Subject: Re: Simple question reg string matching
Message-Id: <1161895371.416062.296240@h48g2000cwc.googlegroups.com>
John
I know how to do that thing normally. But if u can give me some
efficient code, I wud appreciate ur help.
cheers
karthik
John Bokma wrote:
> "K3" <kjkartik@gmail.com> wrote:
>
> > I am a newbie to perl, so i wud appreciate if anyone give the code for
> > finding the number of matches in a given string, allowing atmost m no
> > of mismatches.
> >
> > input: short string (len = k)
> > long string (len =n)
> > no of mismatches (=m)
> >
> > output: no of matches allowing m mismatches
>
> CPAN or pay someone for doing your homework.
>
> --
> John Experienced Perl programmer: http://castleamber.com/
>
> Perl help, tutorials, and examples: http://johnbokma.com/perl/
------------------------------
Date: 26 Oct 2006 21:12:00 GMT
From: John Bokma <john@castleamber.com>
Subject: Re: Simple question reg string matching
Message-Id: <Xns9868A4CB03292castleamber@130.133.1.4>
"K3" <kjkartik@gmail.com> wrote:
> John
>
> I know how to do that thing normally. But if u can give me some
> efficient code, I wud appreciate ur help.
u'v 2 lk @ cpan
--
John Experienced Perl programmer: http://castleamber.com/
Perl help, tutorials, and examples: http://johnbokma.com/perl/
------------------------------
Date: 26 Oct 2006 21:12:56 GMT
From: John Bokma <john@castleamber.com>
Subject: Re: Simple question reg string matching
Message-Id: <Xns9868A4F339256castleamber@130.133.1.4>
"K3" <kjkartik@gmail.com> wrote:
> John
>
> I know how to do that thing normally.
Write the code, profile it, and if you don't know how to speed it up, post
it here. There are plenty of people here willing to /improve/ code.
--
John Experienced Perl programmer: http://castleamber.com/
Perl help, tutorials, and examples: http://johnbokma.com/perl/
------------------------------
Date: 26 Oct 2006 12:15:42 -0700
From: "bernd" <bew_ba@gmx.net>
Subject: Skipping a file when perl -na is in effect
Message-Id: <1161890141.443205.255790@m7g2000cwm.googlegroups.com>
Hello folks,
does somebody know the proper way of skipping a single input file (or
at least proceed with the next one) when processing a list of files
given on the command line when running perl -na, that means, the lines
of the current file should not be read in (any more) but the program
should start with the next file in the argument list
close ARGV ;
does not work obviuosly, it just resets $.
An additional
$ARGV = shift @ARGV ;
does not help either, since $ARGV contains the name of the next file
then, but $_ does not change immediately.
Any idea?
Cheers
Bernd
------------------------------
Date: 26 Oct 2006 12:24:58 -0700
From: "Paul Lalli" <mritty@gmail.com>
Subject: Re: Skipping a file when perl -na is in effect
Message-Id: <1161890697.927790.287630@m7g2000cwm.googlegroups.com>
bernd wrote:
> does somebody know the proper way of skipping a single input file (or
> at least proceed with the next one) when processing a list of files
> given on the command line when running perl -na, that means, the lines
> of the current file should not be read in (any more) but the program
> should start with the next file in the argument list
>
> close ARGV ;
>
> does not work obviuosly, it just resets $.
I don't know where you're getting "obviously" from, as close ARGV is
exactly what you need to use.
$ cat file1.txt
more stuff
line 2
skip all of
this text
$ cat file2.txt
this whole file
should be
seen completely.
$ perl -lne'
close ARGV and next if /skip/;
print "($ARGV - $.) $_";
' file*.txt
(file1.txt - 1) more stuff
(file1.txt - 2) line 2
(file2.txt - 1) this whole file
(file2.txt - 2) should be
(file2.txt - 3) seen completely.
Paul Lalli
------------------------------
Date: 26 Oct 2006 20:45:52 GMT
From: xhoster@gmail.com
Subject: Re: Skipping a file when perl -na is in effect
Message-Id: <20061026164642.475$ro@newsreader.com>
"bernd" <bew_ba@gmx.net> wrote:
> Hello folks,
>
> does somebody know the proper way of skipping a single input file (or
> at least proceed with the next one) when processing a list of files
> given on the command line when running perl -na, that means, the lines
> of the current file should not be read in (any more) but the program
> should start with the next file in the argument list
>
> close ARGV ;
>
> does not work obviuosly, it just resets $.
Can you demonstrate it obviously not working? In my hands, it does exactly
what you want.
Xho
--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
------------------------------
Date: Thu, 26 Oct 2006 20:41:49 GMT
From: Bart Lateur <bart.lateur@pandora.be>
Subject: Re: stop encoding of href in anchor
Message-Id: <t872k2l4gnd05i3oako02ald014o50kquh@4ax.com>
meyerto@gmail.com wrote:
>I definitely
>don't want & separating my query parameters. It breaks the link.
>For example:
>
>good:
>http://www.autotrader.com/fyc/vdp.jsp?dealer_id=94476&car_id=210332809
>broken:
>http://www.autotrader.com/fyc/vdp.jsp?dealer_id=94476&car_id=210332809
The links don't work in literal text, and in the location bar. They DO
work in a html attribute.
That is because in HTML, attribute values need to be HTML-escaped. The
unescaped one only works by accident, because "&car_id" is not a known
entity, and browsers tend to fall back to keep the original text.
--
Bart.
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc. For subscription or unsubscription requests, send
#the single line:
#
# subscribe perl-users
#or:
# unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.
NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V10 Issue 9895
***************************************