[22524] in Perl-Users-Digest
Perl-Users Digest, Issue: 4745 Volume: 10
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat Mar 22 14:10:37 2003
Date: Sat, 22 Mar 2003 11:10:10 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Sat, 22 Mar 2003 Volume: 10 Number: 4745
Today's topics:
Re: regexp question (Tad McClellan)
Re: regexp question (Tad McClellan)
Re: regexp question <mike@luusac.co.uk>
Re: regexp question <jurgenex@hotmail.com>
Re: regexp question <mike@luusac.co.uk>
Re: regexp question <mbudash@sonic.net>
Re: regexp question <user@nospam.xxx>
Re: regexp question <mbudash@sonic.net>
Re: regexp's to rip html file <dorward@yahoo.com>
Re: regexp's to rip html file (Tad McClellan)
Re: regexp's to rip html file <jurgenex@hotmail.com>
Re: Text::ParseWords or Text::CSV (david)
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Sat, 22 Mar 2003 06:20:30 -0600
From: tadmc@augustmail.com (Tad McClellan)
Subject: Re: regexp question
Message-Id: <slrnb7ol8e.4pv.tadmc@magna.augustmail.com>
Michael Budash <mbudash@sonic.net> wrote:
> In article <oUOea.405$N73.4421@newsfep4-glfd.server.ntli.net>,
> "Mike" <mike@luusac.co.uk> wrote:
>>
>> I am trying to read in a file line by line and then print / assign to an
>> array the line following one matched by a regexp, ie
>> So that I know the content I want will always be preceded by a line which
>> ends in a '#'.
> while (<FILE>) {
> if (/#$/) {
> <FILE>;
> print;
> }
> }
But that will print the line ending with # not the following line,
since that code reads and discards the line of interest:
if ( /#$/ ) {
$_ = <FILE>;
print;
}
--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
------------------------------
Date: Sat, 22 Mar 2003 06:46:54 -0600
From: tadmc@augustmail.com (Tad McClellan)
Subject: Re: regexp question
Message-Id: <slrnb7ompu.544.tadmc@magna.augustmail.com>
Kostis P <kp@iwerx.com> wrote:
> This one
Does not work.
> reads for STDIN but you can modify it to open a file and read
> from there instead.
>
> while (my $line = <>) {
> if ($1 eq '#') {
$1 has never been set to any value! It is undef.
If you meant $line instead of $1, then the if() condition will
_never_ be true! You at least need a chomp() before the eq test.
If you meant
$line eq "#\n"
then it does not do what the OP said he wants done.
> print "I got it: $line";
> }
> $line =~ /(.)\n$/;
> }
>
> This script assumes that you are on a unix system
No it doesn't.
> since the end of each
> line is the "new line" special character.
>
> It will not work unless
It will not work regardless of the line endings used.
> the last character is really a new line
But if all of the other problems were repaired, it would work
fine on non-unix systems.
backslash-n is a _logical_ line-end, it works the same in all
perls, regardless of platform.
[snip TOFU, Please do not post TOFU ]
--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
------------------------------
Date: Sat, 22 Mar 2003 15:42:58 -0000
From: "Mike" <mike@luusac.co.uk>
Subject: Re: regexp question
Message-Id: <NA%ea.394$842.235@newsfep4-winn.server.ntli.net>
Hi,
the following works fine :
#!/usr/local/bin/perl
open (IN, '< data.txt');
$contents = <IN>;
open (OUT, '> out.txt');
$out = <OUT>;
while (<IN>) {
if /ID\: (\d\d\d\d)/{
print OUT $1;
}
if (/#$/) {
print OUT scalar <IN>;
}
}
close (IN);
close (OUT);
But what I want is to assign the contents to a hash (which I understand is a
2 dimensional array in perl ?).
I have tried:
open(my $in,'<','data') or die "....: $!";
until ( eof $in ) {
while ( <$in> ) {
last if /ID\: (\d\d\d\d)/;
}
chomp;
my $key = $_; # save the key match
while ( <$in> ) { # search second pattern
last if /#$/;
}
chomp;
$hash{$key} = $_; # enter result into hash
}
close $in;
while (($k, $v) = each %hash) {print "$k=>$v,\n "};
This will print the entire line containing the matches. It also behaves in
a different way to the first method in that it will print the line which
ends with a '#' instead of the line after it (it is the line after it that I
want).
Mike
------------------------------
Date: Sat, 22 Mar 2003 15:53:00 GMT
From: "Jürgen Exner" <jurgenex@hotmail.com>
Subject: Re: regexp question
Message-Id: <wF%ea.15046$IM3.7673@nwrddc03.gnilink.net>
Mike wrote:
> But what I want is to assign the contents to a hash (which I
> understand is a 2 dimensional array in perl ?).
No, it's not. An array is a mapping from natural numbers to scalars. A hash
is a mapping from arbitrary strings to scalars.
jue
------------------------------
Date: Sat, 22 Mar 2003 17:01:52 -0000
From: "Mike" <mike@luusac.co.uk>
Subject: Re: regexp question
Message-Id: <LK0fa.439$842.133@newsfep4-winn.server.ntli.net>
Well the ID that I am trying to match (in the first regexp) is numerical,
but no calculations are done with it - it just acts as an itentifier of a
string, so is a hash what I want ?
Mike
"Jürgen Exner" <jurgenex@hotmail.com> wrote in message
news:wF%ea.15046$IM3.7673@nwrddc03.gnilink.net...
> Mike wrote:
> > But what I want is to assign the contents to a hash (which I
> > understand is a 2 dimensional array in perl ?).
>
> No, it's not. An array is a mapping from natural numbers to scalars. A
hash
> is a mapping from arbitrary strings to scalars.
>
> jue
>
>
------------------------------
Date: Sat, 22 Mar 2003 17:23:30 GMT
From: Michael Budash <mbudash@sonic.net>
Subject: Re: regexp question
Message-Id: <mbudash-3E18A0.09233222032003@typhoon.sonic.net>
In article <slrnb7ol8e.4pv.tadmc@magna.augustmail.com>,
tadmc@augustmail.com (Tad McClellan) wrote:
> Michael Budash <mbudash@sonic.net> wrote:
> > In article <oUOea.405$N73.4421@newsfep4-glfd.server.ntli.net>,
> > "Mike" <mike@luusac.co.uk> wrote:
> >>
> >> I am trying to read in a file line by line and then print / assign to an
> >> array the line following one matched by a regexp, ie
>
> >> So that I know the content I want will always be preceded by a line which
> >> ends in a '#'.
>
>
> > while (<FILE>) {
> > if (/#$/) {
> > <FILE>;
> > print;
> > }
> > }
>
>
> But that will print the line ending with # not the following line,
> since that code reads and discards the line of interest:
>
> if ( /#$/ ) {
> $_ = <FILE>;
> print;
> }
yes, my oversight, as another poster has already noted...
------------------------------
Date: Sat, 22 Mar 2003 18:52:02 +0100
From: Kostis P <user@nospam.xxx>
Subject: Re: regexp question
Message-Id: <3e7ca368$1_1@news.bluewin.ch>
Tad McClellan wrote:
> Kostis P <kp@iwerx.com> wrote:
>
>
>>This one
>
>
>
> Does not work.
Yes it does. Have you tried it/looked at it carefully?
> $1 has never been set to any value! It is undef.
The $1 will be undef for the first loop iteration but the line following
i.e. $line =~ /(.)\n$/; will set it to the character proceeding the \n
character and on the next iteration will be '#' if the previous line
matched the criteria.
>
> If you meant $line instead of $1, then the if() condition will
> _never_ be true! You at least need a chomp() before the eq test.
>
> If you meant
>
> $line eq "#\n"
No I did not mean that and that is why I didn't write that.
>
> then it does not do what the OP said he wants done.
Yours doesn't. Mine does.
>
>
>
>> print "I got it: $line";
>> }
>> $line =~ /(.)\n$/;
>>}
>>
>>This script assumes that you are on a unix system
> No it doesn't.
Is that so? What if the file was created by a windows editor writing to
a samba filesystem or a windows text file copied to a unix filesystem?
Regards...
------------------------------
Date: Sat, 22 Mar 2003 18:27:13 GMT
From: Michael Budash <mbudash@sonic.net>
Subject: Re: regexp question
Message-Id: <mbudash-0C7331.10271322032003@typhoon.sonic.net>
In article <LK0fa.439$842.133@newsfep4-winn.server.ntli.net>,
"Mike" <mike@luusac.co.uk> wrote:
> "Jürgen Exner" <jurgenex@hotmail.com> wrote in message
> news:wF%ea.15046$IM3.7673@nwrddc03.gnilink.net...
> >
> > Mike wrote:
> > >
> > > But what I want is to assign the contents to a hash (which I
> > > understand is a 2 dimensional array in perl ?).
> >
> > No, it's not. An array is a mapping from natural numbers to scalars. A
> > hash is a mapping from arbitrary strings to scalars.
>
> Well the ID that I am trying to match (in the first regexp) is numerical,
> but no calculations are done with it - it just acts as an identifier of a
> string, so is a hash what I want ?
who knows? you started this thread with one idea, now you've moved on to
another. please re-state the problem along with a minimally complete
example of the pertinent data.
with seeing you data, based on your comments, i'm gonna go out on a limb
and propose that his code is [at least closer to] what you want:
use strict;
my %hash;
open (D,'data') or die "....: $!";
while (<D>) {
if (/ID\: (\d\d\d\d)/) {
my $key = $1; # save the key match
while (<D>) { # search second pattern
if (/#$/) {
chomp (my $val = scalar <D>);
$hash{$key} = $val; # enter result into hash
last;
}
}
}
}
close D;
while (my ($k, $v) = each %hash) {
print "$k=>$v\n";
}
# or print the hash sorted:
foreach (sort keys %hash) {
print "$_=>$hash{$_}\n";
}
hope this helps
------------------------------
Date: Sat, 22 Mar 2003 11:16:54 +0000
From: David Dorward <dorward@yahoo.com>
Subject: Re: regexp's to rip html file
Message-Id: <b5hgnd$848$2$8300dec7@news.demon.co.uk>
joe wrote:
> I was wondering if anyone could help me out here. I'm trying to make my
> own sitesearch-script, but I've encountered some problems ripping an html-
> file to plain text.
> All I want is to get rid of all of the html-tags. That's no problem, but
> what about tags like the 'script' one. I don't only want to eliminate that
> tag, but also the lines between this opening and closing tag.
> Is there some easy way to solve this problem or does anyone know of a
> module that does the trick for me.
http://www.perldoc.com/perl5.8.0/lib/HTML/Parser.html might be of use.
--
David Dorward http://david.us-lot.org/
"You cannot rewrite history, not one line."
- The Doctor (Dr. Who: The Aztecs)
------------------------------
Date: Sat, 22 Mar 2003 06:38:57 -0600
From: tadmc@augustmail.com (Tad McClellan)
Subject: Re: regexp's to rip html file
Message-Id: <slrnb7omb1.544.tadmc@magna.augustmail.com>
joe <tunmaster@hotmail.com> wrote:
> All I want is to get rid of all of the html-tags.
Your Question is Asked Frequently.
You are expected to check the Perl FAQ *before* posting to the
Perl newsgroup you know.
perldoc -q HTML
"How do I remove HTML from a string?"
> That's no problem,
Yes it is, you just have not discovered the problems yet.
Several of them are pointed out in the FAQ answer that you did not read.
> This is what I have sofar,
> $joe =~ s/<[^>]*>//ig;
^
^ useless use of the "i" option...
Try your program with this legal HTML data:
<img src="cool.jpg" alt=">>Cool Pic!<<">
^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^
<!-- if income > expenses then STRIP THIS TEXT! -->
^^^^^^^^^^^^^^^^
<p> if income < expenses then LEAVE THIS TEXT! </p>
^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Is there some easy way to solve this problem
Not using only pattern matching (unless you carefully control
the generation of this HTML data yourself).
> or does anyone know of a
> module that does the trick for me.
There are many modules on CPAN that can _correctly_ handle HTML data.
--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
------------------------------
Date: Sat, 22 Mar 2003 15:19:04 GMT
From: "Jürgen Exner" <jurgenex@hotmail.com>
Subject: Re: regexp's to rip html file
Message-Id: <I9%ea.77938$iq1.40892@nwrddc02.gnilink.net>
joe wrote:
> I was wondering if anyone could help me out here. I'm trying to make
> my own sitesearch-script, but I've encountered some problems ripping
> an html- file to plain text.
>
> All I want is to get rid of all of the html-tags. That's no problem,
Really? I doubt it. If you would have written the code to parse HTML
correctly then removing the <script> body would be simpler than trivial.
> but what
> about tags like the 'script' one. I don't only want to eliminate that
> tag, but
> also the lines between this opening and closing tag.
What's the problem with PerlFAQ9: "How do I remove HTML from a string?"
jue
------------------------------
Date: 22 Mar 2003 08:23:59 -0800
From: dwlepage@yahoo.com (david)
Subject: Re: Text::ParseWords or Text::CSV
Message-Id: <b09a22ae.0303220823.6669ea1b@posting.google.com>
Benjamin Goldberg <goldbb2@earthlink.net> wrote in message news:<3E7C210B.EED370DA@earthlink.net>...
> david wrote:
> > Benjamin Goldberg wrote:
> [snip]
> > > my ($field0, $field1, $field2, @fields34) = split /,/, $_, -1;
> > > my $field4 = pop @fields34;
> > > my $field3 = join(",", @fields34);
> [snip]
> > my ($field0, $field1, $field2, @fields34) = split /,/, $_, -1;
> > my $field4 = @fields34;
> > my $field3 = join("+", @fields34);
> [snip]
>
> Notice something missing?
You are right, I am missing the 'pop' command. If I run the script
with the pop command this is what I get:
dlepage,engineer,mn,
aidan,support,va,infrastructure+engineering
So it effectively pops off the last value of the array, so I how do I
merge this back in to my original file if I have additional fields at
the end that I want to retain - i.e:
aidan,support,va,infrastructure,engineering,3-1-96
to make it:
aidan,support,va,infrastructure+engineering,3-1-96
I have a feeling I may not be able to do it? What I am struggling with
is that the only way I can determine if someone has a comma in field3
is that they will always have what appears to be an 'extra' field.
Normally the record will have 4 fields, but the ones with comma's in
field3 will have 5. These should be the only records modified.
This is a tuff one.
thanks,
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
The Perl-Users Digest is a retransmission of the USENET newsgroup
comp.lang.perl.misc. For subscription or unsubscription requests, send
the single line:
subscribe perl-users
or:
unsubscribe perl-users
to almanac@ruby.oce.orst.edu.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
To request back copies (available for a week or so), send your request
to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
where x is the volume number and y is the issue number.
For other requests pertaining to the digest, send mail to
perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
sending perl questions to the -request address, I don't have time to
answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V10 Issue 4745
***************************************