[22524] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 4745 Volume: 10

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat Mar 22 14:10:37 2003

Date: Sat, 22 Mar 2003 11:10:10 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Sat, 22 Mar 2003     Volume: 10 Number: 4745

Today's topics:
    Re: regexp question (Tad McClellan)
    Re: regexp question (Tad McClellan)
    Re: regexp question <mike@luusac.co.uk>
    Re: regexp question <jurgenex@hotmail.com>
    Re: regexp question <mike@luusac.co.uk>
    Re: regexp question <mbudash@sonic.net>
    Re: regexp question <user@nospam.xxx>
    Re: regexp question <mbudash@sonic.net>
    Re: regexp's to rip html file <dorward@yahoo.com>
    Re: regexp's to rip html file (Tad McClellan)
    Re: regexp's to rip html file <jurgenex@hotmail.com>
    Re: Text::ParseWords or Text::CSV (david)
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Sat, 22 Mar 2003 06:20:30 -0600
From: tadmc@augustmail.com (Tad McClellan)
Subject: Re: regexp question
Message-Id: <slrnb7ol8e.4pv.tadmc@magna.augustmail.com>

Michael Budash <mbudash@sonic.net> wrote:
> In article <oUOea.405$N73.4421@newsfep4-glfd.server.ntli.net>,
>  "Mike" <mike@luusac.co.uk> wrote:
>> 
>> I am trying to read in a file line by line and then print / assign to an
>> array the line following one matched by a regexp, ie

>> So that I know the content I want will always be preceded by a line which
>> ends in a '#'.


> while (<FILE>) {
>    if (/#$/) {
>       <FILE>;
>       print;
>    }
> }


But that will print the line ending with # not the following line,
since that code reads and discards the line of interest:

   if ( /#$/ ) {
      $_ = <FILE>;
      print;
   }


-- 
    Tad McClellan                          SGML consulting
    tadmc@augustmail.com                   Perl programming
    Fort Worth, Texas


------------------------------

Date: Sat, 22 Mar 2003 06:46:54 -0600
From: tadmc@augustmail.com (Tad McClellan)
Subject: Re: regexp question
Message-Id: <slrnb7ompu.544.tadmc@magna.augustmail.com>

Kostis P <kp@iwerx.com> wrote:

> This one 


Does not work.


> reads for STDIN but you can modify it to open a file and read 
> from there instead.
> 
> while (my $line = <>) {
>      if ($1 eq '#') {


$1 has never been set to any value! It is undef.

If you meant $line instead of $1, then the if() condition will
_never_ be true! You at least need a chomp() before the eq test.

If you meant 

   $line eq "#\n"

then it does not do what the OP said he wants done.


>         print "I got it: $line";
>      }
>      $line =~ /(.)\n$/;
> }
> 
> This script assumes that you are on a unix system 


No it doesn't.


> since the end of each 
> line is the "new line" special character.
> 
> It will not work unless 


It will not work regardless of the line endings used.


> the last character is really a new line 


But if all of the other problems were repaired, it would work
fine on non-unix systems.

backslash-n is a _logical_ line-end, it works the same in all
perls, regardless of platform.



[snip TOFU, Please do not post TOFU ]

-- 
    Tad McClellan                          SGML consulting
    tadmc@augustmail.com                   Perl programming
    Fort Worth, Texas


------------------------------

Date: Sat, 22 Mar 2003 15:42:58 -0000
From: "Mike" <mike@luusac.co.uk>
Subject: Re: regexp question
Message-Id: <NA%ea.394$842.235@newsfep4-winn.server.ntli.net>

Hi,

the following works fine :

#!/usr/local/bin/perl

open (IN, '< data.txt');
$contents = <IN>;
open (OUT, '> out.txt');
$out = <OUT>;

while (<IN>) {
  if /ID\: (\d\d\d\d)/{
     print OUT $1;
  }

 if (/#$/) {
     print OUT scalar <IN>;
  }
}

close (IN);
close (OUT);

But what I want is to assign the contents to a hash (which I understand is a
2 dimensional array in perl ?).

I have tried:

open(my $in,'<','data') or die "....: $!";

until ( eof $in ) {
while ( <$in> ) {

last if /ID\: (\d\d\d\d)/;
}
chomp;
my $key = $_; # save the key match
while ( <$in> ) { # search second pattern
last if /#$/;
}
chomp;
$hash{$key} = $_; # enter result into hash
}
close $in;

while (($k, $v) = each %hash) {print "$k=>$v,\n "};

This will print the entire line containing the matches.  It also behaves in
a different way to the first method in that it will print the line which
ends with a '#' instead of the line after it (it is the line after it that I
want).

Mike




------------------------------

Date: Sat, 22 Mar 2003 15:53:00 GMT
From: "Jürgen Exner" <jurgenex@hotmail.com>
Subject: Re: regexp question
Message-Id: <wF%ea.15046$IM3.7673@nwrddc03.gnilink.net>

Mike wrote:
> But what I want is to assign the contents to a hash (which I
> understand is a 2 dimensional array in perl ?).

No, it's not. An array is a mapping from natural numbers to scalars. A hash
is a mapping from arbitrary strings to scalars.

jue




------------------------------

Date: Sat, 22 Mar 2003 17:01:52 -0000
From: "Mike" <mike@luusac.co.uk>
Subject: Re: regexp question
Message-Id: <LK0fa.439$842.133@newsfep4-winn.server.ntli.net>

Well the ID that I am trying to match (in the first regexp) is numerical,
but no calculations are done with it - it just acts as an itentifier of a
string, so is a hash what I want ?

Mike

"Jürgen Exner" <jurgenex@hotmail.com> wrote in message
news:wF%ea.15046$IM3.7673@nwrddc03.gnilink.net...
> Mike wrote:
> > But what I want is to assign the contents to a hash (which I
> > understand is a 2 dimensional array in perl ?).
>
> No, it's not. An array is a mapping from natural numbers to scalars. A
hash
> is a mapping from arbitrary strings to scalars.
>
> jue
>
>




------------------------------

Date: Sat, 22 Mar 2003 17:23:30 GMT
From: Michael Budash <mbudash@sonic.net>
Subject: Re: regexp question
Message-Id: <mbudash-3E18A0.09233222032003@typhoon.sonic.net>

In article <slrnb7ol8e.4pv.tadmc@magna.augustmail.com>,
 tadmc@augustmail.com (Tad McClellan) wrote:

> Michael Budash <mbudash@sonic.net> wrote:
> > In article <oUOea.405$N73.4421@newsfep4-glfd.server.ntli.net>,
> >  "Mike" <mike@luusac.co.uk> wrote:
> >> 
> >> I am trying to read in a file line by line and then print / assign to an
> >> array the line following one matched by a regexp, ie
> 
> >> So that I know the content I want will always be preceded by a line which
> >> ends in a '#'.
> 
> 
> > while (<FILE>) {
> >    if (/#$/) {
> >       <FILE>;
> >       print;
> >    }
> > }
> 
> 
> But that will print the line ending with # not the following line,
> since that code reads and discards the line of interest:
> 
>    if ( /#$/ ) {
>       $_ = <FILE>;
>       print;
>    }

yes, my oversight, as another poster has already noted...


------------------------------

Date: Sat, 22 Mar 2003 18:52:02 +0100
From: Kostis P <user@nospam.xxx>
Subject: Re: regexp question
Message-Id: <3e7ca368$1_1@news.bluewin.ch>

Tad McClellan wrote:
> Kostis P <kp@iwerx.com> wrote:
> 
> 
>>This one 
> 
> 
> 
> Does not work.
Yes it does. Have you tried it/looked at it carefully?

> $1 has never been set to any value! It is undef.
The $1 will be undef for the first loop iteration but the line following
i.e. $line =~ /(.)\n$/; will set it to the character proceeding the \n 
character and on the next iteration will be '#' if the previous line 
matched the criteria.

> 
> If you meant $line instead of $1, then the if() condition will
> _never_ be true! You at least need a chomp() before the eq test.
> 
> If you meant 
> 
>    $line eq "#\n"
No I did not mean that and that is why I didn't write that.

> 
> then it does not do what the OP said he wants done.
Yours doesn't. Mine does.

> 
> 
> 
>>        print "I got it: $line";
>>     }
>>     $line =~ /(.)\n$/;
>>}
>>
>>This script assumes that you are on a unix system 

 > No it doesn't.

Is that so? What if the file was created by a windows editor writing to 
a samba filesystem or a windows text file copied to a unix filesystem?


Regards...



------------------------------

Date: Sat, 22 Mar 2003 18:27:13 GMT
From: Michael Budash <mbudash@sonic.net>
Subject: Re: regexp question
Message-Id: <mbudash-0C7331.10271322032003@typhoon.sonic.net>

In article <LK0fa.439$842.133@newsfep4-winn.server.ntli.net>,
 "Mike" <mike@luusac.co.uk> wrote:

> "Jürgen Exner" <jurgenex@hotmail.com> wrote in message
> news:wF%ea.15046$IM3.7673@nwrddc03.gnilink.net...
> >
> > Mike wrote:
> > >
> > > But what I want is to assign the contents to a hash (which I
> > > understand is a 2 dimensional array in perl ?).
> >
> > No, it's not. An array is a mapping from natural numbers to scalars. A
> > hash is a mapping from arbitrary strings to scalars.
> 
> Well the ID that I am trying to match (in the first regexp) is numerical,
> but no calculations are done with it - it just acts as an identifier of a
> string, so is a hash what I want ?

who knows? you started this thread with one idea, now you've moved on to 
another. please re-state the problem along with a minimally complete 
example of the pertinent data.

with seeing you data, based on your comments, i'm gonna go out on a limb 
and propose that his code is [at least closer to] what you want:

use strict;

my %hash;

open (D,'data') or die "....: $!";
while (<D>) {
    if (/ID\: (\d\d\d\d)/) {
        my $key = $1; # save the key match
        while (<D>) { # search second pattern
            if (/#$/) {
                chomp (my $val = scalar <D>);
                $hash{$key} = $val; # enter result into hash
                last;
            }
        }
    }
}
close D;

while (my ($k, $v) = each %hash) {
    print "$k=>$v\n";
}

# or print the hash sorted:
foreach (sort keys %hash) {
    print "$_=>$hash{$_}\n";
}

hope this helps


------------------------------

Date: Sat, 22 Mar 2003 11:16:54 +0000
From: David Dorward <dorward@yahoo.com>
Subject: Re: regexp's to rip html file
Message-Id: <b5hgnd$848$2$8300dec7@news.demon.co.uk>

joe wrote:
> I was wondering if anyone could help me out here. I'm trying to make my
> own sitesearch-script, but I've encountered some problems ripping an html-
> file to plain text.

> All I want is to get rid of all of the html-tags. That's no problem, but
> what about tags like the 'script' one. I don't only want to eliminate that
> tag, but also the lines between this opening and closing tag.

> Is there some easy way to solve this problem or does anyone know of a
> module that does the trick for me.

http://www.perldoc.com/perl5.8.0/lib/HTML/Parser.html might be of use.

-- 
David Dorward                                   http://david.us-lot.org/
"You cannot rewrite history, not one line."
                                      - The Doctor (Dr. Who: The Aztecs)


------------------------------

Date: Sat, 22 Mar 2003 06:38:57 -0600
From: tadmc@augustmail.com (Tad McClellan)
Subject: Re: regexp's to rip html file
Message-Id: <slrnb7omb1.544.tadmc@magna.augustmail.com>

joe <tunmaster@hotmail.com> wrote:

> All I want is to get rid of all of the html-tags. 


Your Question is Asked Frequently.

You are expected to check the Perl FAQ *before* posting to the
Perl newsgroup you know.

   perldoc -q HTML

      "How do I remove HTML from a string?"


> That's no problem,


Yes it is, you just have not discovered the problems yet.

Several of them are pointed out in the FAQ answer that you did not read.


> This is what I have sofar,

> $joe =~ s/<[^>]*>//ig;
                     ^
                     ^ useless use of the "i" option...


Try your program with this legal HTML data:

   <img src="cool.jpg" alt=">>Cool Pic!<<">
   ^^^^^^^^^^^^^^^^^^^^^^^^^^          ^^^^

   <!-- if income > expenses then STRIP THIS TEXT!  -->
   ^^^^^^^^^^^^^^^^

   <p> if income < expenses then LEAVE THIS TEXT! </p>
   ^^^           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> Is there some easy way to solve this problem 


Not using only pattern matching (unless you carefully control 
the generation of this HTML data yourself).


> or does anyone know of a
> module that does the trick for me.


There are many modules on CPAN that can _correctly_ handle HTML data.


-- 
    Tad McClellan                          SGML consulting
    tadmc@augustmail.com                   Perl programming
    Fort Worth, Texas


------------------------------

Date: Sat, 22 Mar 2003 15:19:04 GMT
From: "Jürgen Exner" <jurgenex@hotmail.com>
Subject: Re: regexp's to rip html file
Message-Id: <I9%ea.77938$iq1.40892@nwrddc02.gnilink.net>

joe wrote:
> I was wondering if anyone could help me out here. I'm trying to make
> my own sitesearch-script, but I've encountered some problems ripping
> an html- file to plain text.
>
> All I want is to get rid of all of the html-tags. That's no problem,

Really? I doubt it. If you would have written the code to parse HTML
correctly then removing the <script> body would be simpler than trivial.

> but what
> about tags like the 'script' one. I don't only want to eliminate that
> tag, but
> also the lines between this opening and closing tag.

What's the problem with PerlFAQ9: "How do I remove HTML from a string?"

jue




------------------------------

Date: 22 Mar 2003 08:23:59 -0800
From: dwlepage@yahoo.com (david)
Subject: Re: Text::ParseWords or Text::CSV
Message-Id: <b09a22ae.0303220823.6669ea1b@posting.google.com>

Benjamin Goldberg <goldbb2@earthlink.net> wrote in message news:<3E7C210B.EED370DA@earthlink.net>...
> david wrote:
> > Benjamin Goldberg wrote:
>  [snip]
> > >    my ($field0, $field1, $field2, @fields34) = split /,/, $_, -1;
> > >    my $field4 = pop @fields34;
> > >    my $field3 = join(",", @fields34);
>  [snip]
> >      my ($field0, $field1, $field2, @fields34) = split /,/, $_, -1;
> >      my $field4 = @fields34;
> >      my $field3 = join("+", @fields34);
> [snip]
> 
> Notice something missing?


You are right, I am missing the 'pop' command. If I run the script
with the pop command this is what I get:
dlepage,engineer,mn,
aidan,support,va,infrastructure+engineering

So it effectively pops off the last value of the array, so I how do I
merge this back in to my original file if I have additional fields at
the end that I want to retain - i.e:
aidan,support,va,infrastructure,engineering,3-1-96
to make it:
aidan,support,va,infrastructure+engineering,3-1-96

I have a feeling I may not be able to do it? What I am struggling with
is that the only way I can determine if someone has a comma in field3
is that they will always have what appears to be an 'extra' field.
Normally the record will have 4 fields, but the ones with comma's in
field3 will have 5. These should be the only records modified.

This is a tuff one.

thanks,


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

The Perl-Users Digest is a retransmission of the USENET newsgroup
comp.lang.perl.misc.  For subscription or unsubscription requests, send
the single line:

	subscribe perl-users
or:
	unsubscribe perl-users

to almanac@ruby.oce.orst.edu.  

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

To request back copies (available for a week or so), send your request
to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
where x is the volume number and y is the issue number.

For other requests pertaining to the digest, send mail to
perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
sending perl questions to the -request address, I don't have time to
answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V10 Issue 4745
***************************************


home help back first fref pref prev next nref lref last post