[29960] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 1203 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Wed Jan 16 14:14:18 2008

Date: Wed, 16 Jan 2008 11:14:10 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Wed, 16 Jan 2008     Volume: 11 Number: 1203

Today's topics:
        Shrink large file according to REG_EXP <thellper@gmail.com>
    Re: Shrink large file according to REG_EXP xhoster@gmail.com
    Re: Shrink large file according to REG_EXP <tzz@lifelogs.com>
    Re: Shrink large file according to REG_EXP <jimsgibson@gmail.com>
    Re: Shrink large file according to REG_EXP <simon.chao@fmr.com>
    Re: Wait for background processes to complete <simon.chao@fmr.com>
    Re: Wait for background processes to complete <pgodfrin@gmail.com>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Wed, 16 Jan 2008 09:28:26 -0800 (PST)
From: thellper <thellper@gmail.com>
Subject: Shrink large file according to REG_EXP
Message-Id: <ab9782ce-07b5-4841-84e2-88cff0dee2b5@v67g2000hse.googlegroups.com>

Hello,
I've a problem to solve, and I need some help, please.
I've as input a large text file (up to 5GB) which I need to filter
according some REG_EXP and then I need to write the filtered
(hopefully smaller) output to another file.
The filtering applies row-by-row: a row is splitted according to some
rules in various pieces, then some of the pieces are checked according
to some REG_EXP, and if a match is found, the whole line is written to
the output.

The problem is that this solution is slow.
I'm now reading line by line the whole file, and then I'm applying the
reg_exp... but it is very slow.
I've noticed that the time to read and write the file without doing
anything is very small, so I'm loosing a lot of time for my
reg_exps... .

Ok, the whole program is more complicated: the files may have
different syntax, and I have syntax files which tell me how to split
each line in its fields. Then I load separately files with the rules
(the reg_exps) used to filter them.... .
Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
split the file in chunks and let each thread work on a chunk of the
file: can somebody tell me how to do this ? Or a better way?

Any help is really appreciated.

Best regards,
Davide


------------------------------

Date: 16 Jan 2008 17:54:13 GMT
From: xhoster@gmail.com
Subject: Re: Shrink large file according to REG_EXP
Message-Id: <20080116125414.672$QG@newsreader.com>

thellper <thellper@gmail.com> wrote:
> Hello,
> I've a problem to solve, and I need some help, please.
> I've as input a large text file (up to 5GB) which I need to filter
> according some REG_EXP and then I need to write the filtered
> (hopefully smaller) output to another file.
> The filtering applies row-by-row: a row is splitted according to some
> rules in various pieces, then some of the pieces are checked according
> to some REG_EXP, and if a match is found, the whole line is written to
> the output.
>
> The problem is that this solution is slow.
> I'm now reading line by line the whole file, and then I'm applying the
> reg_exp... but it is very slow.
> I've noticed that the time to read and write the file without doing
> anything is very small, so I'm loosing a lot of time for my
> reg_exps... .

Figure out which regex is slow, why it is slow, and then make it faster.

If you did the first step and posted the culprit with some sample input, we
might be able to help with the latter two.

> Ok, the whole program is more complicated: the files may have
> different syntax, and I have syntax files which tell me how to split
> each line in its fields. Then I load separately files with the rules
> (the reg_exps) used to filter them.... .
> Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
> split the file in chunks and let each thread work on a chunk of the
> file: can somebody tell me how to do this ? Or a better way?

I'd try to make the single-threaded one faster first, and resort to
parallelization only as a last resort.  Also, if I were doing
parallelization of this, I probably wouldn't use forks.pm to do it.  Once
started, your threads (or processes) really don't need to communicate with
each other (as long as you make independent output files to be combined
later) , so a simpler solution, like Parallel::ForkManager or just doing
fork yourself. Or just start the jobs as separate processes in the first
place.

If the orders of the lines in the output files isn't important, I'd give
each job a different integer token (from 0 to num_job-1) and then have each
job process only those lines where
$token == $. % $num_job

Xho

-- 
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.


------------------------------

Date: Wed, 16 Jan 2008 12:00:36 -0600
From: Ted Zlatanov <tzz@lifelogs.com>
Subject: Re: Shrink large file according to REG_EXP
Message-Id: <863asx1whn.fsf@lifelogs.com>

On Wed, 16 Jan 2008 09:28:26 -0800 (PST) thellper <thellper@gmail.com> wrote: 

t> The problem is that this solution is slow.  I'm now reading line by
t> line the whole file, and then I'm applying the reg_exp... but it is
t> very slow.  I've noticed that the time to read and write the file
t> without doing anything is very small, so I'm loosing a lot of time
t> for my reg_exps... .

t> Ok, the whole program is more complicated: the files may have
t> different syntax, and I have syntax files which tell me how to split
t> each line in its fields. Then I load separately files with the rules
t> (the reg_exps) used to filter them.... .  Anyway, my idea was to try
t> to use the FORKS.pm module (s. CPAN) to split the file in chunks and
t> let each thread work on a chunk of the file: can somebody tell me how
t> to do this ? Or a better way?

Please post a practical example of what's slow (with sample input) so we
can see, comment on, and test it.  There's a Benchmark module that will
measure the performance of a function well.

Ted


------------------------------

Date: Wed, 16 Jan 2008 10:02:20 -0800
From: Jim Gibson <jimsgibson@gmail.com>
Subject: Re: Shrink large file according to REG_EXP
Message-Id: <160120081002202581%jimsgibson@gmail.com>

In article
<ab9782ce-07b5-4841-84e2-88cff0dee2b5@v67g2000hse.googlegroups.com>,
thellper <thellper@gmail.com> wrote:

> Hello,
> I've a problem to solve, and I need some help, please.
> I've as input a large text file (up to 5GB) which I need to filter
> according some REG_EXP and then I need to write the filtered
> (hopefully smaller) output to another file.
> The filtering applies row-by-row: a row is splitted according to some
> rules in various pieces, then some of the pieces are checked according
> to some REG_EXP, and if a match is found, the whole line is written to
> the output.
> 
> The problem is that this solution is slow.
> I'm now reading line by line the whole file, and then I'm applying the
> reg_exp... but it is very slow.
> I've noticed that the time to read and write the file without doing
> anything is very small, so I'm loosing a lot of time for my
> reg_exps... .
> 
> Ok, the whole program is more complicated: the files may have
> different syntax, and I have syntax files which tell me how to split
> each line in its fields. Then I load separately files with the rules
> (the reg_exps) used to filter them.... .
> Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
> split the file in chunks and let each thread work on a chunk of the
> file: can somebody tell me how to do this ? Or a better way?

If your program is I/O bound, then it might be faster to work on
different parts simultaneously. However, you are going to suffer some
head thrashing as your multiple processes attempt to read different
parts of the same file at the same time.

If your program is cpu bound, then splitting up the work won't help
unless you are using a multi-processor system.

If, as you say, reading the file without doing any processing is quick
enough, then it is the processing of the data that is the bottleneck.
You should concentrate on improving that part of your program. People
here can help, if you post short examples of what you are trying to do.
Show us some of your regexes, at least, and samples of these "syntax
files".

-- 
Jim Gibson

 Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
    ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------        
                http://www.usenet.com


------------------------------

Date: Wed, 16 Jan 2008 10:13:04 -0800 (PST)
From: nolo contendere <simon.chao@fmr.com>
Subject: Re: Shrink large file according to REG_EXP
Message-Id: <7eb7591e-501f-4caa-bdfb-f5c73415f54a@v4g2000hsf.googlegroups.com>

On Jan 16, 12:28=A0pm, thellper <thell...@gmail.com> wrote:
> Hello,
> I've a problem to solve, and I need some help, please.
> I've as input a large text file (up to 5GB) which I need to filter
> according some REG_EXP and then I need to write the filtered
> (hopefully smaller) output to another file.
> The filtering applies row-by-row: a row is splitted according to some
> rules in various pieces, then some of the pieces are checked according
> to some REG_EXP, and if a match is found, the whole line is written to
> the output.
>
> The problem is that this solution is slow.
> I'm now reading line by line the whole file, and then I'm applying the
> reg_exp... but it is very slow.
> I've noticed that the time to read and write the file without doing
> anything is very small, so I'm loosing a lot of time for my
> reg_exps... .
>
> Ok, the whole program is more complicated: the files may have
> different syntax, and I have syntax files which tell me how to split
> each line in its fields. Then I load separately files with the rules
> (the reg_exps) used to filter them.... .
> Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
> split the file in chunks and let each thread work on a chunk of the
> file: can somebody tell me how to do this ? Or a better way?
>

check out /REGEX/o

and qr/REGEX/

=2E..also, if you keep a history of which filters get used the most,
stick those at the top. this will speed up the file processing if the
trend does not change. may want to do this periodically in case it
does change.


------------------------------

Date: Wed, 16 Jan 2008 07:17:02 -0800 (PST)
From: nolo contendere <simon.chao@fmr.com>
Subject: Re: Wait for background processes to complete
Message-Id: <710a86b9-9453-45fd-9484-39d73372f0c4@v67g2000hse.googlegroups.com>

On Jan 14, 2:05=A0pm, pgodfrin <pgodf...@gmail.com> wrote:
> On Jan 14, 11:11 am, xhos...@gmail.com wrote:
>
>
>
> > pgodfrin <pgodf...@gmail.com> wrote:
> > > On Jan 13, 9:43 pm, xhos...@gmail.com wrote:
> > > > pgodfrin <pgodf...@gmail.com> wrote:
> > > > > Greetings,
> > > > > Well - I've spent a bunch of time trying to figure this out - to n=
o
> > > > > avail.
>
> > > > > Here's what I want to do - run several commands in the background =
and
> > > > > have the perl program wait for the commands to complete. Fork does=
n't
> > > > > do it, nor does wait nor waitpid.
>
> > > > None of them individually do it, no. =A0You have to use them togethe=
r.
>
> > > > > Any thoughts?
>
> > > > > Here's a sample program which starts the processes:
>
> > > > > =A0 =A0while (<*.txt>)
> > > > > =A0 =A0{
> > > > > =A0 =A0 =A0 =A0 print "Copying =A0$_ \n";
> > > > > =A0 =A0 =A0 =A0 system("cp $_ $_.old &") ;
>
> > > > This starts a shell, which then starts cp in the background. =A0As s=
oon
> > > > as the cp is *started*, the shell exits. =A0So Perl has nothing to w=
ait
> > > > for, as the shell is already done (and waited for) before system
> > > > returns. =A0You need to use fork and system or fork and exec. Or you=

> > > > could use Parallel::ForkManager, which will wrap this stuff up nicel=
y
> > > > for you and also prevent you from fork-bombing your computer if ther=
e
> > > > are thousands of *.txt
>
> > > > > =A0 =A0}
> > > > > =A0 =A0print "End of excercise\n";
> > > > > =A0 =A0exit;
>
> > > > 1 until -1=3D=3Dwait(); =A0# on my system, yours may differ
>
> > > Well - that's beginning to make a little sense - the shell completes
> > > and perl has nothing to wait for. No wonder I'm pulling out what
> > > little of my hair is left! :) I guess the fork process returns the pid=

> > > of the process, but - if it's the pid of the shell process, then we're=

> > > back to square one.
>
> > The fork returns (to the parent) the pid of the process forked off.
> > (but you don't actually need to know the pid if you merely want to wait,=

> > rather than waitpid.) =A0If that forked-off process then itself starts t=
he cp
> > in the background, of course you are no better off. =A0But if the forked=
-off
> > process either becomes cp (using exec) or it starts up cp in the foregro=
und
> > (using system without a "&"), then you now have something to wait for. =
=A0In
> > the first case, you wait for cp itself. =A0In the second case, you wait =
for
> > the forked-off perl process which is itself waiting for the cp.
>
> > $ perl -wle 'use strict; fork or exec "sleep " . $_*3 foreach 1..3 ; \
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 my $x; do {$x=3Dwait; print $x} until $x=3D=
=3D-1'
> > 438
> > 439
> > 440
> > -1
>
> > Xho
>
> > --
> > --------------------http://NewsReader.Com/--------------------
> > The costs of publication of this article were defrayed in part by the
> > payment of page charges. This article must therefore be hereby marked
> > advertisement in accordance with 18 U.S.C. Section 1734 solely to indica=
te
> > this fact.
>
> Hi Xho,
> Well - your code and concepts work fine when you want to wait
> sequentially. My goal here is to fire off x number of process and then
> wait for ALL of them to complete (this is basically rudimentary job
> control, trying to use the shell concepts and maximize parallelism).

Reading Xho's code, it looks like 3 processes are kicked off. 1 sleeps
3 seconds, 2 sleeps 6 seconds, and 3 sleeps 9 seconds.
If the processes were sequential, you wouldn't see the last pid until
after 18 seconds. instead you see it after 9 seconds.

HTH


------------------------------

Date: Wed, 16 Jan 2008 10:19:16 -0800 (PST)
From: pgodfrin <pgodfrin@gmail.com>
Subject: Re: Wait for background processes to complete
Message-Id: <4b1911d1-5c6e-4944-bca3-66a89e76c92a@m34g2000hsb.googlegroups.com>

On Jan 15, 5:49 pm, Ben Morrow <b...@morrow.me.uk> wrote:

>
> <snip>
> > So I wrap the ps command and do some looping:
>
> > for (;;)
> > {
> >    open PGRP, "ps -C cp h |"   ;
>
> Use lexical filehandles and three-or-more-arg open.
> Check the return value.
>
>     open my $PGRP, '-|', 'ps', '-C', 'cp', 'h'
>         or die "couldn't fork ps: $!";
>
> >    @pidlist=<PGRP> ;
> >    if ($#pidlist<0) {die "\nNo more processes\n" ;}
>
> IMHO any use of $#ary is an error; certainly in this case you should be
> using @pidlist instead.
>
>     @pidlist or die "No more processes\n";
>
> This will run around in a tight loop running probably hundreds of ps
> processes per second. This is not a effective use of your system's
> resources, to say the least. If you must poll like this you need a sleep
> in there somewhere to limit the damage.
>
> > }
>
> > It's not pretty but it works...
>
> Yuck. The whole point of the wait syscall is to avoid nastiness like
> that. I suggest you learn how it works, or use a module written by
> someone who does (you have been given at least two suggestions so far),
> or stick to shell.
>
> > But, I believe this is an architectural FLAW with Perl.
>
> No. Firstly, the only flaw here is in your understanding; secondly, if
> there was a flaw it would be in Unix, not Perl, since Perl just exposes
> the underlying system interfaces.
>
> Ben



Whew! I knew I might get some feather's ruffled with the 'flaw'
comment. Sorry my knowledge is not as extensive as yours
(collectively) - but it appears it is true.

I had an exit statement in the if..elsif construct. By removing that
exit and changing the system() to exec() I am at least getting the
fork() process to kick off multiple tasks.

I'm still having problems making it wait though. I think I'm getting
there. I'll report back later - when my knowledge has increased :)
pg
(sorry Xho...)


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc.  For subscription or unsubscription requests, send
#the single line:
#
#	subscribe perl-users
#or:
#	unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.  

NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice. 

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 1203
***************************************


home help back first fref pref prev next nref lref last post