[31714] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 2977 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Jun 8 06:14:16 2010

Date: Tue, 8 Jun 2010 03:14:07 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Tue, 8 Jun 2010     Volume: 11 Number: 2977

Today's topics:
        suggestions for printing out a few records of a lengthy <cartercc@gmail.com>
    Re: suggestions for printing out a few records of a len <uri@StemSystems.com>
    Re: suggestions for printing out a few records of a len <cartercc@gmail.com>
    Re: suggestions for printing out a few records of a len <m@rtij.nl.invlalid>
    Re: suggestions for printing out a few records of a len <glex_no-spam@qwest-spam-no.invalid>
    Re: suggestions for printing out a few records of a len <rvtol+usenet@xs4all.nl>
    Re: suggestions for printing out a few records of a len <uri@StemSystems.com>
    Re: suggestions for printing out a few records of a len <uri@StemSystems.com>
    Re: suggestions for printing out a few records of a len <cartercc@gmail.com>
    Re: suggestions for printing out a few records of a len <rvtol+usenet@xs4all.nl>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Mon, 7 Jun 2010 14:21:54 -0700 (PDT)
From: ccc31807 <cartercc@gmail.com>
Subject: suggestions for printing out a few records of a lengthy file
Message-Id: <d66198f4-8ca8-4410-8323-0b070919faf8@y11g2000yqm.googlegroups.com>

The input is a flat file (pipe separated) with thousands of records
and tens of columns, similar to this. The first column is a unique
key.

42546|First|Middle|Last|Street|City|State|Zip|Country|Attr1|Attr2|
Attr3 ...

The input is processed and the output consists of multi-page PDF
documents that combine the input file with other files. The other
files reference the unique key. I build a hash with the input files,
like this:

my %records;
while(<IN>)
{
  my ($key, $first, $middle, $last ...) = split /\|/;
  $record{$key} = {
    first => $first,
    middle => $middle,
    last => $last,
    ...
}

Running this script results in thousands of PDF files. The client has
a need for individual documents, so I modified the script to accept a
unique key as a command line argument, which still reads the input
document until it matches the key, creates one hash element for the
key, and exits, like this:

#in the while loop
   if ($key == $command_line_argument) {
      #create hash element as above
      last;
    }

The client now has a need to create a small number of documents. I
capture the unique keys in @ARGV, but I don't know the best way to
select just those records. I can pre-create the hash like this:

foreach my $key (@ARVG)
(
    $records{$key};
}

and in the while loop, doing this:

    if(exists $records{$key})
    {
      #create hash element as above
    }

but this still reads through the entire input file.

Is there a better way?

Thanks, CC.


------------------------------

Date: Mon, 07 Jun 2010 17:32:24 -0400
From: "Uri Guttman" <uri@StemSystems.com>
Subject: Re: suggestions for printing out a few records of a lengthy file
Message-Id: <8739wybkbb.fsf@quad.sysarch.com>

>>>>> "c" == ccc31807  <cartercc@gmail.com> writes:

  c> The client now has a need to create a small number of documents. I
  c> capture the unique keys in @ARGV, but I don't know the best way to
  c> select just those records. I can pre-create the hash like this:

  c> foreach my $key (@ARVG)
  c> (
  c>     $records{$key};
  c> }

  c> and in the while loop, doing this:

  c>     if(exists $records{$key})
  c>     {
  c>       #create hash element as above
  c>     }

  c> but this still reads through the entire input file.

use a read database or even a DBD for a csv (pipe separated is ok)
file.

or you could save a lot of space by reading in each row and only saving
that text line in a hash with the key (you extract only the key). then
you can locate the rows of interest, parse out the fields and do the
usual stuff.

uri

-- 
Uri Guttman  ------  uri@stemsystems.com  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------


------------------------------

Date: Mon, 7 Jun 2010 14:42:53 -0700 (PDT)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: suggestions for printing out a few records of a lengthy file
Message-Id: <f8abae07-9cbe-4262-b260-9b616370e969@k39g2000yqd.googlegroups.com>

On Jun 7, 5:32=A0pm, "Uri Guttman" <u...@StemSystems.com> wrote:
> use a read database or even a DBD for a csv (pipe separated is ok)
> file.

We get the input file dumped on us every other month or so, as an
ASCII file, and use it just once to create the PDFs. We never do any
update, delete, or insert queries, and only a few select queries, so
putting it into a RDB just to print maybe two dozen documents out of
thousands seems like a lot of effort for very little benefit.

> or you could save a lot of space by reading in each row and only saving
> that text line in a hash with the key (you extract only the key). then
> you can locate the rows of interest, parse out the fields and do the
> usual stuff.

This is what I thought I was doing. However, it occurs to me that I
can use a counter initially set to the size of @ARVG, decrement it for
every match, and exit when the counter reaches zero.

Thanks for your response, CC.


------------------------------

Date: Mon, 7 Jun 2010 23:56:42 +0200
From: Martijn Lievaart <m@rtij.nl.invlalid>
Subject: Re: suggestions for printing out a few records of a lengthy file
Message-Id: <qh60e7-p39.ln1@news.rtij.nl>

On Mon, 07 Jun 2010 14:21:54 -0700, ccc31807 wrote:

> The input is a flat file (pipe separated) with thousands of records and
> tens of columns, similar to this. The first column is a unique key.

(snip)

> The client now has a need to create a small number of documents. I
> capture the unique keys in @ARGV, but I don't know the best way to
> select just those records. I can pre-create the hash like this:
> 
> foreach my $key (@ARVG)
> (
>     $records{$key};
> }
> 
> and in the while loop, doing this:
> 
>     if(exists $records{$key})
>     {
>       #create hash element as above
>     }
> 
> but this still reads through the entire input file.
> 
> Is there a better way?

Better first ask yourself if there really is a problem. "Thousands" of 
records sounds to me as peanuts, and very small peanuts at that.

[martijn@cow t]$ time perl -ne '($x, $y, $z) = split; $h{$x}{y}=$y; $h{$x}
{z}=$z' t.log

real	0m2.804s
user	0m2.750s
sys	0m0.043s
[martijn@cow t]$ wc -l t.log
670365 t.log
[martijn@cow t]$ 

YMMV, and the more you do in the loop the longer it takes. But still, the 
seconds (at most!) you might shave off aren't worth your programmer time.

That said, there isn't a real good way to optimize it as well. Only if 
you run dozens of runs with the same inputfile, it might make sens to 
either create an index file, put it in a database or read it once and 
store the hash with the Storable module for fast reread. (And all 
solutions are the same use an index on disk).

M4


------------------------------

Date: Mon, 07 Jun 2010 18:00:32 -0500
From: "J. Gleixner" <glex_no-spam@qwest-spam-no.invalid>
Subject: Re: suggestions for printing out a few records of a lengthy file
Message-Id: <4c0d7a12$0$89388$815e3792@news.qwest.net>

ccc31807 wrote:
> On Jun 7, 5:32 pm, "Uri Guttman" <u...@StemSystems.com> wrote:
>> use a read database or even a DBD for a csv (pipe separated is ok)
>> file.
> 
> We get the input file dumped on us every other month or so, as an
> ASCII file, and use it just once to create the PDFs. We never do any
> update, delete, or insert queries, and only a few select queries, so
> putting it into a RDB just to print maybe two dozen documents out of
> thousands seems like a lot of effort for very little benefit.
> 
>> or you could save a lot of space by reading in each row and only saving
>> that text line in a hash with the key (you extract only the key). then
>> you can locate the rows of interest, parse out the fields and do the
>> usual stuff.
> 
> This is what I thought I was doing. However, it occurs to me that I
> can use a counter initially set to the size of @ARVG, decrement it for
> every match, and exit when the counter reaches zero.

No need for all that. You could create a hash of the keys passed in
via ARGV.

my %ids = map { $_ => 1 } @ARGV;

Then test if the key is one you're interested in:

while(<IN>)
{
   my ($key, $first, $middle, $last ...) = split /\|/;
   next unless $ids{ $key };
   ...


------------------------------

Date: Tue, 08 Jun 2010 01:35:17 +0200
From: "Dr.Ruud" <rvtol+usenet@xs4all.nl>
Subject: Re: suggestions for printing out a few records of a lengthy file
Message-Id: <4c0d8236$0$22916$e4fe514c@news.xs4all.nl>

J. Gleixner wrote:

> You could create a hash of the keys passed in
> via ARGV.
> 
> my %ids = map { $_ => 1 } @ARGV;
> 
> Then test if the key is one you're interested in:
> 
> while(<IN>)
> {
>   my ($key, $first, $middle, $last ...) = split /\|/;
>   next unless $ids{ $key };
>   ...

You could further first test if the line starts with something 
interesting. If the key is for example at least 3 characters,
like: C<next unless $short{ substr $_, 0, 3 };>.

You can also (pre)process the file with a grep command.

-- 
Ruud


------------------------------

Date: Mon, 07 Jun 2010 20:14:50 -0400
From: "Uri Guttman" <uri@StemSystems.com>
Subject: Re: suggestions for printing out a few records of a lengthy file
Message-Id: <871vci9y85.fsf@quad.sysarch.com>

>>>>> "c" == ccc31807  <cartercc@gmail.com> writes:

  c> On Jun 7, 5:32 pm, "Uri Guttman" <u...@StemSystems.com> wrote:
  >> use a read database or even a DBD for a csv (pipe separated is ok)
  >> file.

  c> We get the input file dumped on us every other month or so, as an
  c> ASCII file, and use it just once to create the PDFs. We never do any
  c> update, delete, or insert queries, and only a few select queries, so
  c> putting it into a RDB just to print maybe two dozen documents out of
  c> thousands seems like a lot of effort for very little benefit.

  >> or you could save a lot of space by reading in each row and only saving
  >> that text line in a hash with the key (you extract only the key). then
  >> you can locate the rows of interest, parse out the fields and do the
  >> usual stuff.

  c> This is what I thought I was doing. However, it occurs to me that I
  c> can use a counter initially set to the size of @ARVG, decrement it for
  c> every match, and exit when the counter reaches zero.

that would save some time but an unknown amount as you don't know which
keys are needed and one could be the last one. if you want to do it that
way, even simpler is to make a hash of the needed keys from @ARGV. then
when you see a line with that key, process it to a pdf and delete that
entry from the hash. when the hash is empty, exit.

this also could run to the end of the file but it won't ever store more
than one line at a time so it is ram efficient.

uri

-- 
Uri Guttman  ------  uri@stemsystems.com  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------


------------------------------

Date: Mon, 07 Jun 2010 20:16:22 -0400
From: "Uri Guttman" <uri@StemSystems.com>
Subject: Re: suggestions for printing out a few records of a lengthy file
Message-Id: <87wrua8jl5.fsf@quad.sysarch.com>

>>>>> "JG" == J Gleixner <glex_no-spam@qwest-spam-no.invalid> writes:

  JG> ccc31807 wrote:
  >> On Jun 7, 5:32 pm, "Uri Guttman" <u...@StemSystems.com> wrote:
  >>> use a read database or even a DBD for a csv (pipe separated is ok)
  >>> file.
  >> 
  >> We get the input file dumped on us every other month or so, as an
  >> ASCII file, and use it just once to create the PDFs. We never do any
  >> update, delete, or insert queries, and only a few select queries, so
  >> putting it into a RDB just to print maybe two dozen documents out of
  >> thousands seems like a lot of effort for very little benefit.
  >> 
  >>> or you could save a lot of space by reading in each row and only saving
  >>> that text line in a hash with the key (you extract only the key). then
  >>> you can locate the rows of interest, parse out the fields and do the
  >>> usual stuff.
  >> 
  >> This is what I thought I was doing. However, it occurs to me that I
  >> can use a counter initially set to the size of @ARVG, decrement it for
  >> every match, and exit when the counter reaches zero.

  JG> No need for all that. You could create a hash of the keys passed in
  JG> via ARGV.

  JG> my %ids = map { $_ => 1 } @ARGV;

  JG> Then test if the key is one you're interested in:

  JG> while(<IN>)
  JG> {
  JG>   my ($key, $first, $middle, $last ...) = split /\|/;
  JG>   next unless $ids{ $key };
  JG>   ...

same idea i had but you didn't add in deleting found keys so you can
exit early.

also no need to do a full split on the line unless you know it was in
the hash. only split after you find a needed line. you can easily grab
the key from the front of each line as it comes in.

uri

-- 
Uri Guttman  ------  uri@stemsystems.com  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------


------------------------------

Date: Mon, 7 Jun 2010 17:55:11 -0700 (PDT)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: suggestions for printing out a few records of a lengthy file
Message-Id: <ae6af676-1458-41b6-866f-1b2888e99442@u26g2000yqu.googlegroups.com>

On Jun 7, 5:56=A0pm, Martijn Lievaart <m...@rtij.nl.invlalid> wrote:
> Better first ask yourself if there really is a problem. "Thousands" of
> records sounds to me as peanuts, and very small peanuts at that.

You are right about that. Printing the PDFs takes far more time than
creating the hash in memory, and even if creating the full hash took
as much as a second it would be acceptable. My concern was really more
theoretical: why create a hash of some 50K elements when you only need
three?

> YMMV, and the more you do in the loop the longer it takes. But still, the
> seconds (at most!) you might shave off aren't worth your programmer time.

That's worth a smiley! I could just create the individual documents as
I have a modified script that will do just that. Again, it offends my
sense of frugality.

> That said, there isn't a real good way to optimize it as well. Only if
> you run dozens of runs with the same inputfile, it might make sens to
> either create an index file, put it in a database or read it once and
> store the hash with the Storable module for fast reread. (And all
> solutions are the same use an index on disk).

Agreed. I don't have that much experience in development, and there
isn't a real functional need for optimization.

Again, thanks for your comments, CC.


------------------------------

Date: Tue, 08 Jun 2010 09:57:38 +0200
From: "Dr.Ruud" <rvtol+usenet@xs4all.nl>
Subject: Re: suggestions for printing out a few records of a lengthy file
Message-Id: <4c0df7f2$0$22918$e4fe514c@news.xs4all.nl>

ccc31807 wrote:

> Printing the PDFs takes far more time than
> creating the hash in memory

On the related subject of creating nice PDFs:
we are using webkit for that for the last few years,
we create many-many thousands a day,
and we are very happy with the results.

Webkit interprets HTML with a decent support of CSS,
which makes it real easy to generate the source
from which the PDF will be created.

-- 
Ruud


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 2977
***************************************


home help back first fref pref prev next nref lref last post