[31812] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 3075 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Fri Aug 13 21:09:26 2010

Date: Fri, 13 Aug 2010 18:09:08 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Fri, 13 Aug 2010     Volume: 11 Number: 3075

Today's topics:
        Appropriate technique for altering a text file? <cartercc@gmail.com>
    Re: Appropriate technique for altering a text file? <uri@StemSystems.com>
    Re: Appropriate technique for altering a text file? <hjp-usenet2@hjp.at>
    Re: Appropriate technique for altering a text file? <hjp-usenet2@hjp.at>
    Re: Appropriate technique for altering a text file? <uri@StemSystems.com>
    Re: Appropriate technique for altering a text file? <tadmc@seesig.invalid>
    Re: Appropriate technique for altering a text file? <cartercc@gmail.com>
    Re: Appropriate technique for altering a text file? sln@netherlands.com
    Re: avoiding min and max <rvtol+usenet@xs4all.nl>
    Re: avoiding min and max <tzz@lifelogs.com>
    Re: FAQ 5.23 AND: Perl the latest vs Perl the gratest <nospam-abuse@ilyaz.org>
    Re: FAQ 5.23 I still don't get locking. I just want to  <tzz@lifelogs.com>
    Re: FAQ 8.44 How do I tell the difference between error <rvtol+usenet@xs4all.nl>
    Re: sysopen failures <hjp-usenet2@hjp.at>
        Why this warning? <here@softcom.net>
    Re: Why this warning? <spamtrap@piven.net>
    Re: Why this warning? <ben@morrow.me.uk>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Fri, 13 Aug 2010 10:08:48 -0700 (PDT)
From: ccc31807 <cartercc@gmail.com>
Subject: Appropriate technique for altering a text file?
Message-Id: <f0988029-196d-4990-ba31-d1b2f4ec830a@t2g2000yqe.googlegroups.com>

During the discussion of the 9-11 mosque in NYC, several commentators
mentioned Milestones
by Sayed Qutb. I decided to read it to see that the fuss was about,
and ended up with an ASCII text copy generated from a PDF original.

I could have printed the text directly, but it was pretty mangled, and
after attempting and failing to reformat the document using vi, I
decided to write a simple Perl script to reformat it. I wanted to do
several things, join paragraphs together (every line in the file was
terminated by a "\n"), separate paragraphs by a blank line (block
style), remove repeated periods (dots), remove form feeds (which
marked pages in the original), etc.

I first attempted to munge the file in place, like this:
#FIRST ATTEMPT
open MS, '<', $file;
open OUT, '>', $out;
while (<MS>)
{
 #do stuff
 print OUT;
}
close MS;
close OUT;

It mostly worked, but I couldn't fine tune it.  I then attempted to
munge two lines together, like this:
#SECOND ATTEMPT
open MS, '<', $file;
open OUT, '>', $out;
$line1 = <MS>;
while (<MS>)
{
 $line2 = $_;
#do stuff
 print OUT;
  $line 2 = $line1;
}
close MS;
close OUT;

This worked a little better, but it wasn't perfect. I then tried this
and got perfect formatting:
#THIRD ATTEMPT
{
  local $/ = undef;
  open MS, '<', $file;
  $document = <MS>;
  close MS;
}
#series of transformations like this
$document =~ s/\r//;
open OUT, '>', $out;
print OUT $document;
close OUT;

All of the work I have done in the past has munged the lines one by
one, as in the first example. Occasionally, I have had to use the
second style (e.g., where the formatting of each line depends on the
content of the preceding line.) I've never used the third style at
all.

I liked the third way a lot. It seemed quick, easy, and worked
perfectly. I was actually able to open the resulting document in Word,
fancify it a little, and print a nice finished copy. However, I can't
think of any actual uses of the third style in my day to day work.

My question is this: Is the third attempt, slurping the entire
document into memory and transforming the text by regexs, very common,
or is it considered a last resort when nothing else would work?

CC.


------------------------------

Date: Fri, 13 Aug 2010 13:29:42 -0400
From: "Uri Guttman" <uri@StemSystems.com>
Subject: Re: Appropriate technique for altering a text file?
Message-Id: <871va2ctfd.fsf@quad.sysarch.com>

>>>>> "c" == ccc31807  <cartercc@gmail.com> writes:

  c> This worked a little better, but it wasn't perfect. I then tried this
  c> and got perfect formatting:
  c> #THIRD ATTEMPT
  c> {
  c>   local $/ = undef;
  c>   open MS, '<', $file;
  c>   $document = <MS>;
  c>   close MS;
  c> }

  c> All of the work I have done in the past has munged the lines one by
  c> one, as in the first example. Occasionally, I have had to use the
  c> second style (e.g., where the formatting of each line depends on the
  c> content of the preceding line.) I've never used the third style at
  c> all.

it isn't as common as it should be IMNSHO. in the old days reading files
line by line was almost required do to small memory machines. today,
megabyte files can be slurped without fear at all but line by line is
still taught as standard. it take time to change views.

  c> I liked the third way a lot. It seemed quick, easy, and worked
  c> perfectly. I was actually able to open the resulting document in
  c> Word, fancify it a little, and print a nice finished copy. However,
  c> I can't think of any actual uses of the third style in my day to
  c> day work.

parsing and text munging is much easier when the entire file is in
ram. there is no need to mix i/o with logic, the i/o is much faster, you
can send/receive whole documents to servers (which could format things
or whatever), etc. slurping whole files makes a lot of sense in many
areas.

  c> My question is this: Is the third attempt, slurping the entire
  c> document into memory and transforming the text by regexs, very common,
  c> or is it considered a last resort when nothing else would work?

it is not a last resort by any imagination today. and use File::Slurp
instead for both reading and writing the file. it is cleaner and faster
than the methods you used.

uri

-- 
Uri Guttman  ------  uri@stemsystems.com  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------


------------------------------

Date: Fri, 13 Aug 2010 20:14:08 +0200
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Appropriate technique for altering a text file?
Message-Id: <slrni6b2rg.a51.hjp-usenet2@hrunkner.hjp.at>

On 2010-08-13 17:08, ccc31807 <cartercc@gmail.com> wrote:
[ 3 ways of munging a text file: line by line, pairs of lines,
  and whole file at once
]

> I liked the third way a lot. It seemed quick, easy, and worked
> perfectly. I was actually able to open the resulting document in Word,
> fancify it a little, and print a nice finished copy. However, I can't
> think of any actual uses of the third style in my day to day work.
>
> My question is this: Is the third attempt, slurping the entire
> document into memory and transforming the text by regexs, very common,
> or is it considered a last resort when nothing else would work?

Uri would probably tell you that's what you always should do unless the
file is too big to fit into memory (and you should use File::Slurp for
it) :-).

I do whatever allows the most straightforward implementation. Very
often that means reading the whole data into memory, although not
necessarily as a single scalar.

	hp



------------------------------

Date: Fri, 13 Aug 2010 20:42:12 +0200
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Appropriate technique for altering a text file?
Message-Id: <slrni6b4g4.ldg.hjp-usenet2@hrunkner.hjp.at>

On 2010-08-13 18:14, Peter J. Holzer <hjp-usenet2@hjp.at> wrote:
> Uri would probably tell you [...]

I didn't see Uri's answer before I posted this. I swear!  :-)

	hp


------------------------------

Date: Fri, 13 Aug 2010 14:48:54 -0400
From: "Uri Guttman" <uri@StemSystems.com>
Subject: Re: Appropriate technique for altering a text file?
Message-Id: <87d3tm9wmh.fsf@quad.sysarch.com>

>>>>> "PJH" == Peter J Holzer <hjp-usenet2@hjp.at> writes:

  PJH> On 2010-08-13 18:14, Peter J. Holzer <hjp-usenet2@hjp.at> wrote:
  >> Uri would probably tell you [...]

  PJH> I didn't see Uri's answer before I posted this. I swear!  :-)

great minds. :)

uri

-- 
Uri Guttman  ------  uri@stemsystems.com  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------


------------------------------

Date: Fri, 13 Aug 2010 14:18:23 -0500
From: Tad McClellan <tadmc@seesig.invalid>
Subject: Re: Appropriate technique for altering a text file?
Message-Id: <slrni6b6dg.bhe.tadmc@tadbox.sbcglobal.net>

Uri Guttman <uri@StemSystems.com> wrote:
>>>>>> "PJH" == Peter J Holzer <hjp-usenet2@hjp.at> writes:
>
>  PJH> On 2010-08-13 18:14, Peter J. Holzer <hjp-usenet2@hjp.at> wrote:
>  >> Uri would probably tell you [...]
>
>  PJH> I didn't see Uri's answer before I posted this. I swear!  :-)
>
> great minds. :)


yes, but why were you and Peter both thinking the same thing?

:-)  :-)


-- 
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.


------------------------------

Date: Fri, 13 Aug 2010 13:07:36 -0700 (PDT)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: Appropriate technique for altering a text file?
Message-Id: <94dec0dd-55c9-474e-b461-24b1b401bdbd@j8g2000yqd.googlegroups.com>

On Aug 13, 1:29=A0pm, "Uri Guttman" <u...@StemSystems.com> wrote:
> parsing and text munging is much easier when the entire file is in
> ram. there is no need to mix i/o with logic, the i/o is much faster, you
> can send/receive whole documents to servers (which could format things
> or whatever), etc. slurping whole files makes a lot of sense in many
> areas.

Most of what I do requires me to treat each record as a separate
'document.' In many cases, this even extends to the output, where one
input document results in hundreds of separate output documents, each
of which must be opened, written to, and closed.

I'm not being difficult (or maybe I am) but I'm having a hard time
seeing how this kind of logic which treats each record separately:

while (<IN>)
{
  chomp;
  my ($var1, $var2, ... $varn) =3D split;
  #do stuff
  print OUT qq("$field1","$field2",..."$fieldn"\n);
}

or this:

foreach my $key (sort keys %{$hashref})
{
  #do stuff using $hashref{$key}{var1}, $hashref{$key}{var2}, etc.
  print OUT qq("$field1","$field2",..."$fieldn"\n);
}

could be made easier by dealing with the entire file at once.

Okay, this is the first time I have had to treat a single file as a
unit, and to be honest the experience was positive. Still, my
worldview consists of record oriented datasets, so I put this in my
nice-to-know-but-not-particularly-useful category.

CC.


------------------------------

Date: Fri, 13 Aug 2010 17:56:08 -0700
From: sln@netherlands.com
Subject: Re: Appropriate technique for altering a text file?
Message-Id: <1sob669he9hae4c2mrnlmvp8ssfg4hehcu@4ax.com>

On Fri, 13 Aug 2010 10:08:48 -0700 (PDT), ccc31807 <cartercc@gmail.com> wrote:

>My question is this: Is the third attempt, slurping the entire
>document into memory and transforming the text by regexs, very common,
>or is it considered a last resort when nothing else would work?
>

The answer is no to slurping, and no to using regex's on large
documents that don't need to be all in memory.

There is usually a single drive (say raid). Only one
i/o operation is performed at a time. If hogged, the
other processes will wait until the hog is done and thier
i/o is dequed, done and returned.
The speeds of modern sata2, raid configured drives work well
when reading/writing incremental data, it should always be
used this way on large data that can be worked on incrementally.
The default buffer on read between the api and the device is usually
small, so as to not clog up device i/o and spin locks. So its still
going to be incremental.

A complex regex will perform larger back tracking on large
data then on smaller data. So it depends on the type and complexity.

The third reason is always memory. Sure, there is a lot of memory,
but to hog it all, bogs down background file cacheing and other processing.




------------------------------

Date: Fri, 13 Aug 2010 18:34:01 +0200
From: "Dr.Ruud" <rvtol+usenet@xs4all.nl>
Subject: Re: avoiding min and max
Message-Id: <4c6573fa$0$22945$e4fe514c@news.xs4all.nl>

Ben Morrow wrote:

> (Now, of course, I'm wondering about a '<=' operator, which isn't the
> usual 'less-than-or-equal' operator but instead a compare-and-assign
> operator like ||=... It's hard to see what it might sensibly be called.)


    $x = min( $x, $y, $z, ... );

    $x min= $y, $z, ...;

    $x->min( $y, $z, ...);


Of course you often have a running minimum:

    $x = min( $x, $min );

    $x min= $min;

    $x->min( $min );


-- 
Ruud


------------------------------

Date: Fri, 13 Aug 2010 14:55:04 -0500
From: Ted Zlatanov <tzz@lifelogs.com>
Subject: Re: avoiding min and max
Message-Id: <8739uib84n.fsf@lifelogs.com>

On Fri, 13 Aug 2010 18:34:01 +0200 "Dr.Ruud" <rvtol+usenet@xs4all.nl> wrote: 

R>    $x = min( $x, $y, $z, ... );

R>    $x min= $y, $z, ...;

R>    $x->min( $y, $z, ...);

The second one is best.

But I think it's much better to treat this as a general stats problem.
_ caches the stat call, so why not cache list stats too if requested?

cache_stats(@list, qw/min avg median stdev/);

min(@list); # fast

push @list, min(@list) -1;

min(@list); # automatically updated

max(@list); # slow, not cached

This is not so good if you have lots of small updates, that's why it
should be optional.  I don't think it can be done in Perl 5.

Ted


------------------------------

Date: Fri, 13 Aug 2010 21:11:33 +0000 (UTC)
From: Ilya Zakharevich <nospam-abuse@ilyaz.org>
Subject: Re: FAQ 5.23 AND: Perl the latest vs Perl the gratest
Message-Id: <slrni6bd85.bf4.nospam-abuse@powdermilk.math.berkeley.edu>

On 2010-08-13, Ted Zlatanov <tzz@lifelogs.com> wrote:
>>> I disagree.  The FAQ should target the latest Perl.
>
> DC> Why?
>
> I think it's pretty self-evident

No.

> and brian explained it better.

No, he did not.

There are two principal modes of writing Perl: (generalized)
one-liners, and reusable code.

"One-liners" address a particular problem in a particular environment.
I have no problems when my one-liners use features specific to the
particular version of Perl.

When I write reusable code, I'm interested in balancing "the lowest common
denominator" vs "annoyance factor with coding to rare flavors of Perl".

I'd be more interested in consulting FAQ (and other docs) when I
operate in the second ("reusable") mode.  So I'm much more interested
in FAQ focusing on the "current pool" of deployed Perl than it
focussing on the latest version.

Yours,
Ilya


------------------------------

Date: Fri, 13 Aug 2010 08:59:09 -0500
From: Ted Zlatanov <tzz@lifelogs.com>
Subject: Re: FAQ 5.23 I still don't get locking. I just want to increment the number in the file. How can I do this?
Message-Id: <87lj8aehqq.fsf@lifelogs.com>

On Thu, 12 Aug 2010 22:29:53 +0000 (UTC) dmcanzi@remulak.uwaterloo.ca (David Canzi) wrote: 

DC> In article <87eie3imbs.fsf@lifelogs.com>, Ted Zlatanov  <tzz@lifelogs.com> wrote:
>> I disagree.  The FAQ should target the latest Perl.

DC> Why?

I think it's pretty self-evident and brian explained it better.  But to
be clear, "target" doesn't mean "ignore anything but."

On Fri, 13 Aug 2010 12:02:43 +0200 brian d foy <brian.d.foy@gmail.com> wrote: 

bdf> The FAQ will always be for the version of Perl it comes with. If you
bdf> have an older Perl, you can use the FAQ that came with it. 

Right.  So it's counter-productive to have FAQ answers that ignore
available features, especially if it makes the answers more concise.
`autodie' is such a feature: it makes the answer better while reducing
the amount of unnecessary "... or die ..." code.

bdf> However, I'm not extremely motivated to change all the code in the FAQ
bdf> just to use the new features when a backward compatible version of the
bdf> code is good enough. That might not apply to this particular answer,
bdf> though.

Yeah, since someone went and did "... or die ..." everywhere it may as
well be done the simple way :)

On Fri, 13 Aug 2010 02:26:21 +0100 Ben Morrow <ben@morrow.me.uk> wrote: 

BM> I rather suspect autodie isn't widely enough used yet to count as 'best
BM> practice', but I think it ought to become so.

Usage in the FAQ is a good way to popularize it.  It gives people a
clear reason to use newer Perls.

Ted


------------------------------

Date: Fri, 13 Aug 2010 19:55:41 +0200
From: "Dr.Ruud" <rvtol+usenet@xs4all.nl>
Subject: Re: FAQ 8.44 How do I tell the difference between errors from the shell and perl?
Message-Id: <4c65871e$0$22944$e4fe514c@news.xs4all.nl>

brian d foy wrote:

> A better answer probably changes the verbs instead of the nouns.

You flirt!

-- 
Ruud


------------------------------

Date: Fri, 13 Aug 2010 20:32:26 +0200
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: sysopen failures
Message-Id: <slrni6b3tr.a51.hjp-usenet2@hrunkner.hjp.at>

On 2010-08-12 22:31, C.DeRykus <derykus@gmail.com> wrote:
> On Aug 12, 1:03 pm, "Peter J. Holzer" <hjp-usen...@hjp.at> wrote:
>> On 2010-08-12 19:01, C.DeRykus <dery...@gmail.com> wrote:
>> > It's not clear to me what O_EXCL will do though
>> > since the files are remote and the OP's doc say:
>>
>> >    "O_EXCL" may not work on network filesystems ...
>>
>> > This makes me wonder if "not work" might well
>> > manifest  as a flurry of false positives for
>> > an error at times but work normally at other
>> > times.
>>
>> I already covered this in my first posting in this thread. If you can
>> add anything substantial to that, please do.
>>
>
> I must've missed it.

<slrni5qj95.jq5.hjp-usenet2@hrunkner.hjp.at>

	hp



------------------------------

Date: Fri, 13 Aug 2010 17:16:21 -0700 (PDT)
From: Sal <here@softcom.net>
Subject: Why this warning?
Message-Id: <a4b21696-181a-4739-8d86-d3985e5572b0@x24g2000pro.googlegroups.com>

#!/usr/bin/perl

use strict;
use warnings;

my %sum = {};
for (my $i = 1; $i <= 6; $i++) {
  for (my $j = 1; $j <= 6; $j++) {
    for (my $k = 1; $k <= 6; $k++) {
      my $tot = $i+$j+$k;
      my $key = "$i " . "$j " . "$k ";
      $sum{$key} = $tot;
      print "$i " . "$j " . "$k " . "  $tot\n";
    }
  }
}

foreach my $key (sort keys %sum) {
  print "$key => $sum{$key}\n";
}

When the above is executed it first prints the entire hash, then
returns the error:

Use of uninitialized value $sum{"HASH(0x95fe818)"} in concatenation
(.) or string at ./3dice.pl line 19.
HASH(0x95fe818) =>

Why is the last hash value blank?



------------------------------

Date: Fri, 13 Aug 2010 19:37:59 -0500
From: Don Piven <spamtrap@piven.net>
Subject: Re: Why this warning?
Message-Id: <XZmdnZNGdJzBePjRnZ2dnUVZ_uKdnZ2d@speakeasy.net>

On 08/13/2010 07:16 PM, Sal wrote:
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> my %sum = {};
> [code omitted]
>
> foreach my $key (sort keys %sum) {
>    print "$key =>  $sum{$key}\n";
> }
>
> When the above is executed it first prints the entire hash, then
> returns the error:
>
> Use of uninitialized value $sum{"HASH(0x95fe818)"} in concatenation
> (.) or string at ./3dice.pl line 19.
> HASH(0x95fe818) =>
>
> Why is the last hash value blank?

The line "my %sum = {};" doesn't initialize %sum to an empty hash.  {} 
returns a reference to an empty hash, so that reference, stringified, 
becomes a key in %sum and, since no value is provided in %sum's 
initializer list, that key's value becomes undef.

You could just declare %sum without an initializer, or initialize it 
with an empty list: "my %sum = ();".

Don



------------------------------

Date: Sat, 14 Aug 2010 01:41:25 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Why this warning?
Message-Id: <la5hj7-bbi1.ln1@osiris.mauzo.dyndns.org>


Quoth Sal <here@softcom.net>:
> #!/usr/bin/perl
> 
> use strict;
> use warnings;
> 
> my %sum = {};

This line is wrong. {} creates a new anonymous hashref; this is then
stringified and inserted into the hash as a key with no value.
Effectively this line is equivalent to

    my %sum = ("HASH(0xDEADBEEF)" => undef);

If you want an empty hash, just say

    my %sum;

Ben



------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3075
***************************************


home help back first fref pref prev next nref lref last post