[24133] in Perl-Users-Digest
Perl-Users Digest, Issue: 6327 Volume: 10
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Mon Mar 29 09:10:48 2004
Date: Mon, 29 Mar 2004 06:10:12 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Mon, 29 Mar 2004 Volume: 10 Number: 6327
Today's topics:
how to capture multiple lines? <geoffacox@dontspamblueyonder.co.uk>
Re: how to capture multiple lines? <tassilo.parseval@rwth-aachen.de>
Re: how to capture multiple lines? <noreply@gunnar.cc>
Re: how to capture multiple lines? <geoffacox@dontspamblueyonder.co.uk>
Re: how to capture multiple lines? <geoffacox@dontspamblueyonder.co.uk>
Re: how to capture multiple lines? <geoffacox@dontspamblueyonder.co.uk>
Re: how to capture multiple lines? <noreply@gunnar.cc>
Re: how to capture multiple lines? <tassilo.parseval@rwth-aachen.de>
Re: how to capture multiple lines? <geoffacox@dontspamblueyonder.co.uk>
Re: how to capture multiple lines? <geoffacox@dontspamblueyonder.co.uk>
Re: how to capture multiple lines? <geoffacox@dontspamblueyonder.co.uk>
Re: how to capture multiple lines? <noreply@gunnar.cc>
Re: how to capture multiple lines? <geoffacox@dontspamblueyonder.co.uk>
Re: how to capture multiple lines? (Anno Siegel)
Re: how to capture multiple lines? <tassilo.parseval@rwth-aachen.de>
Re: how to capture multiple lines? <noreply@gunnar.cc>
Re: how to capture multiple lines? <geoffacox@dontspamblueyonder.co.uk>
Re: how to capture multiple lines? <geoffacox@dontspamblueyonder.co.uk>
Re: how to capture multiple lines? <geoffacox@dontspamblueyonder.co.uk>
Re: how to capture multiple lines? <tassilo.parseval@rwth-aachen.de>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Mon, 29 Mar 2004 08:38:28 GMT
From: Geoff Cox <geoffacox@dontspamblueyonder.co.uk>
Subject: how to capture multiple lines?
Message-Id: <nvnf60lg29peruj3tphi85jfhlsuk0d185@4ax.com>
Hello
I thought following code using / /s should get multiple lines such as
<p> hdajhksdh jash djh jaskd a
d ahjkd jakdljkaksdkjlad a
d ajkd jadklj aldkj ald </p>
but it is only capturing where <p> and </p> are on the same line..
Help!
if ($line =~ /<p>(.*)<\/p>/s) {
print ("\$1 = $1 \n");
}
Cheers
Geoff
------------------------------
Date: 29 Mar 2004 09:25:13 GMT
From: "Tassilo v. Parseval" <tassilo.parseval@rwth-aachen.de>
Subject: Re: how to capture multiple lines?
Message-Id: <c48q1p$ek9$1@nets3.rz.RWTH-Aachen.DE>
Also sprach Geoff Cox:
> I thought following code using / /s should get multiple lines such as
>
><p> hdajhksdh jash djh jaskd a
> d ahjkd jakdljkaksdkjlad a
> d ajkd jadklj aldkj ald </p>
>
> but it is only capturing where <p> and </p> are on the same line..
>
> Help!
>
> if ($line =~ /<p>(.*)<\/p>/s) {
> print ("\$1 = $1 \n");
> }
This works for me as expected:
$line = <<EOC;
<p> hdajhksdh jash djh jaskd a
d ahjkd jakdljkaksdkjlad a
d ajkd jadklj aldkj ald </p>
EOC
if ($line =~ /<p>(.*)<\/p>/s) {
print ("\$1 = $1 \n");
}
__END__
$1 = hdajhksdh jash djh jaskd a
d ahjkd jakdljkaksdkjlad a
d ajkd jadklj aldkj ald
Did you check that $line really contains what you think it contains?
Maybe you read into this variable line-wise and so quite naturally you
only get a match when <p>...</p> happen to be on one line.
Btw: I hope the appearance of <p> and </p> is only falsely indicating
that you are working with HTML because you cannot parse HTML properly
with regexes. But if the above really is HTML, you'll be happier with
one of the HTML parsing modules, such as HTML::Parser.
Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
------------------------------
Date: Mon, 29 Mar 2004 11:28:48 +0200
From: Gunnar Hjalmarsson <noreply@gunnar.cc>
Subject: Re: how to capture multiple lines?
Message-Id: <c48q9f$2gt654$1@ID-184292.news.uni-berlin.de>
Geoff Cox wrote:
> I thought following code using / /s should get multiple lines such
> as
>
> <p> hdajhksdh jash djh jaskd a
> d ahjkd jakdljkaksdkjlad a
> d ajkd jadklj aldkj ald </p>
>
> but it is only capturing where <p> and </p> are on the same line..
No, it's not.
> if ($line =~ /<p>(.*)<\/p>/s) {
> print ("\$1 = $1 \n");
> }
That works fine for me.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
------------------------------
Date: Mon, 29 Mar 2004 09:56:23 GMT
From: Geoff Cox <geoffacox@dontspamblueyonder.co.uk>
Subject: Re: how to capture multiple lines?
Message-Id: <fesf60dhhr22rbo5n3op5tc6bbbf8ehovf@4ax.com>
On 29 Mar 2004 09:25:13 GMT, "Tassilo v. Parseval"
<tassilo.parseval@rwth-aachen.de> wrote:
>Also sprach Geoff Cox:
Tassilo,
I should have said that the <p> ... </p> is from an html file ....
I have just tried following which works for above but breaks the rest
of the input
$/ = "\0a\0d";
$line =~ /<p>(.*?)<\/p>/s;
$/ = "\0a";
The 3rd line does not appear to put $/ back to the default value??
Cheers
Geoff
>
>> I thought following code using / /s should get multiple lines such as
>>
>><p> hdajhksdh jash djh jaskd a
>> d ahjkd jakdljkaksdkjlad a
>> d ajkd jadklj aldkj ald </p>
>>
>> but it is only capturing where <p> and </p> are on the same line..
>>
>> Help!
>>
>> if ($line =~ /<p>(.*)<\/p>/s) {
>> print ("\$1 = $1 \n");
>> }
>
>This works for me as expected:
>
> $line = <<EOC;
> <p> hdajhksdh jash djh jaskd a
> d ahjkd jakdljkaksdkjlad a
> d ajkd jadklj aldkj ald </p>
> EOC
>
> if ($line =~ /<p>(.*)<\/p>/s) {
> print ("\$1 = $1 \n");
> }
> __END__
> $1 = hdajhksdh jash djh jaskd a
> d ahjkd jakdljkaksdkjlad a
> d ajkd jadklj aldkj ald
>
>Did you check that $line really contains what you think it contains?
>Maybe you read into this variable line-wise and so quite naturally you
>only get a match when <p>...</p> happen to be on one line.
>
>Btw: I hope the appearance of <p> and </p> is only falsely indicating
>that you are working with HTML because you cannot parse HTML properly
>with regexes. But if the above really is HTML, you'll be happier with
>one of the HTML parsing modules, such as HTML::Parser.
>
>Tassilo
------------------------------
Date: Mon, 29 Mar 2004 09:56:56 GMT
From: Geoff Cox <geoffacox@dontspamblueyonder.co.uk>
Subject: Re: how to capture multiple lines?
Message-Id: <lmsf605nugb8ishniu1rje9navdmv0hn59@4ax.com>
On Mon, 29 Mar 2004 11:28:48 +0200, Gunnar Hjalmarsson
<noreply@gunnar.cc> wrote:
Gunnar
I should have said that the <p> ... </p> is from an html file ....
I have just tried following which works for above but breaks the rest
of the input
$/ = "\0a\0d";
$line =~ /<p>(.*?)<\/p>/s;
$/ = "\0a";
The 3rd line does not appear to put $/ back to the default value??
Cheers
Geoff
>Geoff Cox wrote:
>> I thought following code using / /s should get multiple lines such
>> as
>>
>> <p> hdajhksdh jash djh jaskd a
>> d ahjkd jakdljkaksdkjlad a
>> d ajkd jadklj aldkj ald </p>
>>
>> but it is only capturing where <p> and </p> are on the same line..
>
>No, it's not.
>
>> if ($line =~ /<p>(.*)<\/p>/s) {
>> print ("\$1 = $1 \n");
>> }
>
>That works fine for me.
------------------------------
Date: Mon, 29 Mar 2004 10:08:21 GMT
From: Geoff Cox <geoffacox@dontspamblueyonder.co.uk>
Subject: Re: how to capture multiple lines?
Message-Id: <99tf60944iktktkjibjnid99n59n5hkodu@4ax.com>
On Mon, 29 Mar 2004 09:56:23 GMT, Geoff Cox
<geoffacox@dontspamblueyonder.co.uk> wrote:
>On 29 Mar 2004 09:25:13 GMT, "Tassilo v. Parseval"
><tassilo.parseval@rwth-aachen.de> wrote:
>
>>Also sprach Geoff Cox:
>
>Tassilo,
>
>I should have said that the <p> ... </p> is from an html file ....
>
>I have just tried following which works for above but breaks the rest
>of the input
>
> $/ = "\0a\0d";
> $line =~ /<p>(.*?)<\/p>/s;
> $/ = "\0a";
I think this should have been (I had the wrong order of 0doa
$/ = "\0d\0a";
$line =~ /<p>(.*?)<\/p>/s;
$/ = "\0a";
but still breaks rest of the script in that $/ does not seem to be
back to the default value ...?
Geoff
>
>The 3rd line does not appear to put $/ back to the default value??
>
>Cheers
>
>Geoff
>
>
>
>
>
>>
>>> I thought following code using / /s should get multiple lines such as
>>>
>>><p> hdajhksdh jash djh jaskd a
>>> d ahjkd jakdljkaksdkjlad a
>>> d ajkd jadklj aldkj ald </p>
>>>
>>> but it is only capturing where <p> and </p> are on the same line..
>>>
>>> Help!
>>>
>>> if ($line =~ /<p>(.*)<\/p>/s) {
>>> print ("\$1 = $1 \n");
>>> }
>>
>>This works for me as expected:
>>
>> $line = <<EOC;
>> <p> hdajhksdh jash djh jaskd a
>> d ahjkd jakdljkaksdkjlad a
>> d ajkd jadklj aldkj ald </p>
>> EOC
>>
>> if ($line =~ /<p>(.*)<\/p>/s) {
>> print ("\$1 = $1 \n");
>> }
>> __END__
>> $1 = hdajhksdh jash djh jaskd a
>> d ahjkd jakdljkaksdkjlad a
>> d ajkd jadklj aldkj ald
>>
>>Did you check that $line really contains what you think it contains?
>>Maybe you read into this variable line-wise and so quite naturally you
>>only get a match when <p>...</p> happen to be on one line.
>>
>>Btw: I hope the appearance of <p> and </p> is only falsely indicating
>>that you are working with HTML because you cannot parse HTML properly
>>with regexes. But if the above really is HTML, you'll be happier with
>>one of the HTML parsing modules, such as HTML::Parser.
>>
>>Tassilo
------------------------------
Date: Mon, 29 Mar 2004 12:08:39 +0200
From: Gunnar Hjalmarsson <noreply@gunnar.cc>
Subject: Re: how to capture multiple lines?
Message-Id: <c48sk7$2f6247$1@ID-184292.news.uni-berlin.de>
Geoff Cox wrote:
> I should have said that the <p> ... </p> is from an html file ....
Then you should consider to use a module instead.
> I have just tried following which works for above but breaks the
> rest of the input
>
> $/ = "\0a\0d";
> $line =~ /<p>(.*?)<\/p>/s;
> $/ = "\0a";
>
> The 3rd line does not appear to put $/ back to the default value??
I'm not sure what you are trying to do. If the file isn't really huge,
why don't you just slurp it into a scalar variable instead of reading
it line by line?
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
------------------------------
Date: 29 Mar 2004 10:22:17 GMT
From: "Tassilo v. Parseval" <tassilo.parseval@rwth-aachen.de>
Subject: Re: how to capture multiple lines?
Message-Id: <c48tcp$hhf$1@nets3.rz.RWTH-Aachen.DE>
Also sprach Geoff Cox:
> On 29 Mar 2004 09:25:13 GMT, "Tassilo v. Parseval"
><tassilo.parseval@rwth-aachen.de> wrote:
>
>>Also sprach Geoff Cox:
>
> Tassilo,
>
> I should have said that the <p> ... </p> is from an html file ....
As if we didn't know. ;-)
Another thing you should have done is choosing a more effective
follow-up style. Put your reply below the stuff you are replying to,
cutting out parts you don't refer to.
> I have just tried following which works for above but breaks the rest
> of the input
>
> $/ = "\0a\0d";
> $line =~ /<p>(.*?)<\/p>/s;
> $/ = "\0a";
>
> The 3rd line does not appear to put $/ back to the default value??
Your handling of $/ looks a bit fishy. First of all, I suspect that
"\0a\0d" is supposed to be a Windows line-ending. Well, it's not. That
would be "\0d\0a".
Secondly, you shouldn't even be in need of setting $/ explicitely.
Usually perl will be able to read a file with Windows newlines even on
other platforms. You can also force newline translation so that perl
will automatically replace "\0d\0a" with "\0d" (or vice versa, depending
on the platform):
open HTML, "<:crlf", "file.html" or die $!;
This works ever since 5.8.0, AFAIK.
If you really want to tamper with $/ manually, use local() so that perl
recovers the old value for you:
{ # need a block here
local $/ = "\0d\0a";
...
}
# here $/ has its previous value again
Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
------------------------------
Date: Mon, 29 Mar 2004 10:41:52 GMT
From: Geoff Cox <geoffacox@dontspamblueyonder.co.uk>
Subject: Re: how to capture multiple lines?
Message-Id: <l4vf60dg9c49oa2l4gri2fq81l1m1osufp@4ax.com>
On Mon, 29 Mar 2004 12:08:39 +0200, Gunnar Hjalmarsson
<noreply@gunnar.cc> wrote:
>Geoff Cox wrote:
>> I should have said that the <p> ... </p> is from an html file ....
>
>Then you should consider to use a module instead.
>
>> I have just tried following which works for above but breaks the
>> rest of the input
>>
>> $/ = "\0a\0d";
>> $line =~ /<p>(.*?)<\/p>/s;
>> $/ = "\0a";
>>
>> The 3rd line does not appear to put $/ back to the default value??
>
>I'm not sure what you are trying to do. If the file isn't really huge,
>why don't you just slurp it into a scalar variable instead of reading
>it line by line?
I think I will have to do the slurp ...
re above - in the html file the end of lines have ODOA so by changing
the value of $/ to ODOA I get all the text between <p> and </p>.
Problem is that I then find that the script is finding text which I do
not want! I wondered whether thaat was because I have not been able to
change the value of $/ back to the default value. If I print out $/
before changing it to ODOA I get
$/ =
so I assumed that I could get $/ back to default value by
$/ = "";
but not quite working ...
Cheers
Geoff
------------------------------
Date: Mon, 29 Mar 2004 10:46:10 GMT
From: Geoff Cox <geoffacox@dontspamblueyonder.co.uk>
Subject: Re: how to capture multiple lines?
Message-Id: <3hvf60dsoa54ef3rnkv0s3rhauitv1gl13@4ax.com>
On 29 Mar 2004 10:22:17 GMT, "Tassilo v. Parseval"
<tassilo.parseval@rwth-aachen.de> wrote:
Tassilo,
>If you really want to tamper with $/ manually, use local() so that perl
>recovers the old value for you:
>
> { # need a block here
> local $/ = "\0d\0a";
The local did the trick - the rest of the code works find now!
Thanks a lot...
Cheers
Geoff
------------------------------
Date: Mon, 29 Mar 2004 11:19:04 GMT
From: Geoff Cox <geoffacox@dontspamblueyonder.co.uk>
Subject: Re: how to capture multiple lines?
Message-Id: <bc1g60ls02ad689ot8i9r6rdk11klak4l3@4ax.com>
On 29 Mar 2004 10:22:17 GMT, "Tassilo v. Parseval"
<tassilo.parseval@rwth-aachen.de> wrote:
>If you really want to tamper with $/ manually, use local() so that perl
>recovers the old value for you:
>
> { # need a block here
> local $/ = "\0d\0a";
oop! I spoke too soon! If I have the
local $/ = "\0D\0A";
in a sub routine - I do not get the <p> .... </p> text. If I have
$/ = "\0D\0A";
I do get the text but I then get some data which I do not want!
Geoff
> ...
> }
> # here $/ has its previous value again
>
>Tassilo
------------------------------
Date: Mon, 29 Mar 2004 13:53:51 +0200
From: Gunnar Hjalmarsson <noreply@gunnar.cc>
Subject: Re: how to capture multiple lines?
Message-Id: <c492pi$2ehun7$1@ID-184292.news.uni-berlin.de>
Geoff Cox wrote:
> I think I will have to do the slurp ...
That would probably make things much easier. :)
> re above - in the html file the end of lines have ODOA so by
> changing the value of $/ to ODOA I get all the text between <p> and
> </p>.
I don't understand.
> I assumed that I could get $/ back to default value by
>
> $/ = "";
That does not set it to default. This does:
$/ = "\n";
But if you for some reason want to fiddle with $/, you'd better do it
locally within a block, as Tassilo suggested.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
------------------------------
Date: Mon, 29 Mar 2004 12:20:07 GMT
From: Geoff Cox <geoffacox@dontspamblueyonder.co.uk>
Subject: Re: how to capture multiple lines?
Message-Id: <gn4g60pdvgb722qq9f9ttetr8sj1m1bh45@4ax.com>
On Mon, 29 Mar 2004 13:53:51 +0200, Gunnar Hjalmarsson
<noreply@gunnar.cc> wrote:
>That does not set it to default. This does:
>
> $/ = "\n";
The best I can get is as follows
sub para {
local ($/ = "\0a\0d");
my ($linepara) = @_;
$linepara =~ /<p>(.*?)<\/p>/s;
# print ("\$1 = $1 \n");
print OUT ("<tr><td colspan=2>" . $1 . "<\/td><\/tr> \n");
$/ = "";
}
Now, this does get the
<p> jahjsdkaljk al
asdjk aksdj klad
kajsd akl </p>
text but it also get some lines which I do not want and do not get if
I do not use $/ - so am a bit lost. Tempted to put the whol code up
but that would be asking too much!
I would liek to use the slurp approach but not sure how to do it so
that as I parse through an html file and find the first line of the
first <p> etc block of text - how do I get that text and put in into a
file and then when find the second <p> block put it in the right
place...I do not want toput all the <p> etc text together..they appear
at different places in the html file....
So, I find the first line with <p>, slurp in the whole of the file,
but only wish to get the first line of the <p> already found and the
next few lines until the end of the first line with a </p>.
How to do that?!
Cheers
Geoff
Cheers
Geoff
>
>But if you for some reason want to fiddle with $/, you'd better do it
>locally within a block, as Tassilo suggested.
------------------------------
Date: 29 Mar 2004 12:50:44 GMT
From: anno4000@lublin.zrz.tu-berlin.de (Anno Siegel)
Subject: Re: how to capture multiple lines?
Message-Id: <c49634$n53$1@mamenchi.zrz.TU-Berlin.DE>
Geoff Cox <geoffacox@dontspamblueyonder.co.uk> wrote in comp.lang.perl.misc:
> On Mon, 29 Mar 2004 13:53:51 +0200, Gunnar Hjalmarsson
> <noreply@gunnar.cc> wrote:
>
>
> >That does not set it to default. This does:
> >
> > $/ = "\n";
>
> The best I can get is as follows
>
> sub para {
>
> local ($/ = "\0a\0d");
The parentheses counteract the intention of "local". Parenthesized like
this, "\0d\0a" is assigned to $/ and that value is localized. You want
to localize $/ first:
local $/ = "\0a\0d";
[...]
> text but it also get some lines which I do not want and do not get if
That's surely because $/ carries its new value out of the sub.
Anno
------------------------------
Date: 29 Mar 2004 13:02:51 GMT
From: "Tassilo v. Parseval" <tassilo.parseval@rwth-aachen.de>
Subject: Re: how to capture multiple lines?
Message-Id: <c496pr$rd1$1@nets3.rz.RWTH-Aachen.DE>
Also sprach Geoff Cox:
> On Mon, 29 Mar 2004 13:53:51 +0200, Gunnar Hjalmarsson
><noreply@gunnar.cc> wrote:
>
>
>>That does not set it to default. This does:
>>
>> $/ = "\n";
>
> The best I can get is as follows
>
> sub para {
>
> local ($/ = "\0a\0d");
>
> my ($linepara) = @_;
> $linepara =~ /<p>(.*?)<\/p>/s;
> # print ("\$1 = $1 \n");
> print OUT ("<tr><td colspan=2>" . $1 . "<\/td><\/tr> \n");
> $/ = "";
> }
>
> Now, this does get the
><p> jahjsdkaljk al
> asdjk aksdj klad
> kajsd akl </p>
>
> text but it also get some lines which I do not want and do not get if
> I do not use $/ - so am a bit lost. Tempted to put the whol code up
> but that would be asking too much!
>
> I would liek to use the slurp approach but not sure how to do it so
> that as I parse through an html file and find the first line of the
> first <p> etc block of text - how do I get that text and put in into a
> file and then when find the second <p> block put it in the right
> place...I do not want toput all the <p> etc text together..they appear
> at different places in the html file....
If I understand you right, you want to grab everything that appears in
<p> tags? Here's an example using HTML::Parser:
#! /usr/bin/perl -w
package MyParser;
use strict;
use base qw/HTML::Parser/;
our $in_para;
sub start {
my (undef, $tagname) = @_;
$in_para = 1 if $tagname eq 'p';
}
sub end {
my (undef, $tagname) = @_;
$in_para = 0 if $tagname eq 'p';
}
sub text {
my (undef, $text) = @_;
print $text if $in_para;
}
package main;
my $p = MyParser->new;
$p->parse_file("file.html");
It's dead simple: You create a subclass of HTML::Parser (MyParser) that
overwrites the start(), end() and text() method. The start() method
simply sets the global variable $in_para to a true value when it
encountered a <p>-starttag. It's set to false when </p> is encountered.
The method text() is triggered for ordinary text. It will only print it
when $in_para is true.
This solution is very robust and since the basic skeleton is only a few
lines, it is easily extensible. You most probably want to change the
text() method to let it print into a file or so. If you want to grab
anything between <p> and </p> (including other tags) you must extend
start() and end() a bit to print their last argument (which is the
original text of the tag as it appeared in the HTML-file). Something
like:
sub start {
my (undef, $tagname, undef, undef, $origtext) = @_;
print $origtext if $in_para;
$in_para = 1 if $tagname eq 'p';
}
sub end {
my (undef, $tagname, $origtext) = @_;
$in_para = 0 if $tagname eq 'p';
print $origtext if $in_para;
}
Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
------------------------------
Date: Mon, 29 Mar 2004 15:43:25 +0200
From: Gunnar Hjalmarsson <noreply@gunnar.cc>
Subject: Re: how to capture multiple lines?
Message-Id: <c49972$2eq769$1@ID-184292.news.uni-berlin.de>
Tassilo v. Parseval wrote:
> If I understand you right, you want to grab everything that appears
> in <p> tags? Here's an example using HTML::Parser:
<code example>
> It's dead simple:
Hmm.. Not sure I agree on "dead simple".
If grabbing everything between <p> tags is *all* there is, I don't
understand why something like this wouldn't be sufficient:
open FH, 'file.html' or die $!;
$_ = do { local $/; <FH> };
close FH;
my @paras;
push @paras, $1 while m!<\s*p[^>]*>(.*?)<\s*/\s*p\s*>!igs;
To me, again if that's all there is, this appears to be even simpler
than "dead simple". ;-)
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
------------------------------
Date: Mon, 29 Mar 2004 13:29:01 GMT
From: Geoff Cox <geoffacox@dontspamblueyonder.co.uk>
Subject: Re: how to capture multiple lines?
Message-Id: <5t8g6018k1rf1mislnu1ttaichi6f583lh@4ax.com>
On 29 Mar 2004 13:02:51 GMT, "Tassilo v. Parseval"
<tassilo.parseval@rwth-aachen.de> wrote:
>If I understand you right, you want to grab everything that appears in
><p> tags? Here's an example using HTML::Parser:
Tassilo
The code below will take a bit of thinking about! However I need to
get the <p> .... </p> text in the order in which it appears in the
html file, not all together.
The html file has say
<p> ajdkjs ak lsdjas
asdja dkasj dl asd
lad akl;sdk a;dkl; </p>
<h2 align= etc </h2>
<option value = "docs/ etc >text</option>
(when the option line is met I take the path part and use it to search
another file in order to get some related text)
<h2 etc
<option etc
<p> etc
So, if I used the slurp idea - not clear how I would get the above in
order??
Cheers
Geoff
>
> #! /usr/bin/perl -w
>
> package MyParser;
>
> use strict;
> use base qw/HTML::Parser/;
>
> our $in_para;
>
> sub start {
> my (undef, $tagname) = @_;
> $in_para = 1 if $tagname eq 'p';
> }
>
> sub end {
> my (undef, $tagname) = @_;
> $in_para = 0 if $tagname eq 'p';
> }
>
> sub text {
> my (undef, $text) = @_;
> print $text if $in_para;
> }
>
> package main;
>
> my $p = MyParser->new;
> $p->parse_file("file.html");
>
>It's dead simple: You create a subclass of HTML::Parser (MyParser) that
>overwrites the start(), end() and text() method. The start() method
>simply sets the global variable $in_para to a true value when it
>encountered a <p>-starttag. It's set to false when </p> is encountered.
>The method text() is triggered for ordinary text. It will only print it
>when $in_para is true.
>
>This solution is very robust and since the basic skeleton is only a few
>lines, it is easily extensible. You most probably want to change the
>text() method to let it print into a file or so. If you want to grab
>anything between <p> and </p> (including other tags) you must extend
>start() and end() a bit to print their last argument (which is the
>original text of the tag as it appeared in the HTML-file). Something
>like:
>
> sub start {
> my (undef, $tagname, undef, undef, $origtext) = @_;
> print $origtext if $in_para;
> $in_para = 1 if $tagname eq 'p';
> }
>
> sub end {
> my (undef, $tagname, $origtext) = @_;
> $in_para = 0 if $tagname eq 'p';
> print $origtext if $in_para;
> }
>
>Tassilo
------------------------------
Date: Mon, 29 Mar 2004 13:13:23 GMT
From: Geoff Cox <geoffacox@dontspamblueyonder.co.uk>
Subject: Re: how to capture multiple lines?
Message-Id: <h28g60tdu0vohdp490ue9aje0gl2bs3ove@4ax.com>
On 29 Mar 2004 12:50:44 GMT, anno4000@lublin.zrz.tu-berlin.de (Anno
Siegel) wrote:
>The parentheses counteract the intention of "local". Parenthesized like
>this, "\0d\0a" is assigned to $/ and that value is localized. You want
>to localize $/ first:
>
> local $/ = "\0a\0d";
Anno,
I do not understand this! If I use
local $/ = "\0D\0A"; in the sub routine
I do not get the <p> ...... </p> text.
If I use
local ($/ = "\0D\0A");
I do get it !! But then I get some text which I do not wish to have!
Any ideas?
Cheers
Geoff
>
>[...]
>
>> text but it also get some lines which I do not want and do not get if
>
>That's surely because $/ carries its new value out of the sub.
>
>Anno
------------------------------
Date: Mon, 29 Mar 2004 14:03:05 GMT
From: Geoff Cox <geoffacox@dontspamblueyonder.co.uk>
Subject: Re: how to capture multiple lines?
Message-Id: <2mag601citg08roeqfimtpp8c5326h86p8@4ax.com>
On Mon, 29 Mar 2004 15:43:25 +0200, Gunnar Hjalmarsson
<noreply@gunnar.cc> wrote:
>Tassilo v. Parseval wrote:
>> If I understand you right, you want to grab everything that appears
>> in <p> tags? Here's an example using HTML::Parser:
>
><code example>
>
>> It's dead simple:
>
>Hmm.. Not sure I agree on "dead simple".
Gunnar
not dead simple to me!
>
>If grabbing everything between <p> tags is *all* there is, I don't
>understand why something like this wouldn't be sufficient:
It is not all there is - at least I do not think so...the file
contains blocks of text between <p> and </p>s mixed in with other
lines such as <h2> haskhdjk ashj </h2>, <option jjaksdjka </option>
etc and I want to parse through the file in order, taking the <p> </p>
and <h2> </h2> data into another file. Also when finding the <option
line I use the extracted path to search another fiel to get related
text...
The nearest I get with $/ is
sub para {
local ($/ = "\0D\0A");
my ($linepara) = @_;
$linepara =~ /<p>(.*?)<\/p>/s;
# print ("\$1 = $1 \n");
print OUT ("<tr><td colspan=2>" . $1 . "<\/td><\/tr> \n");
$/ = "";
}
but as I say, this gets the <p> </p>, <h2> ... </h2> info in the right
places. It also gets the related text using the path info from the
<option lines BUT it also puts in the option selection boxes which I
do not want and does not happen if I fo not use $/ ...!!!??
If I use slurp idea then I would put the whole html file into $total
say but how do I parse through it in order of appearance of <p>, <h2>
<option etc as above??
Cheers
Geoff
>
> open FH, 'file.html' or die $!;
> $_ = do { local $/; <FH> };
> close FH;
>
> my @paras;
> push @paras, $1 while m!<\s*p[^>]*>(.*?)<\s*/\s*p\s*>!igs;
>
>To me, again if that's all there is, this appears to be even simpler
>than "dead simple". ;-)
------------------------------
Date: 29 Mar 2004 14:03:37 GMT
From: "Tassilo v. Parseval" <tassilo.parseval@rwth-aachen.de>
Subject: Re: how to capture multiple lines?
Message-Id: <c49abp$1vi$1@nets3.rz.RWTH-Aachen.DE>
Also sprach Gunnar Hjalmarsson:
> Tassilo v. Parseval wrote:
>> If I understand you right, you want to grab everything that appears
>> in <p> tags? Here's an example using HTML::Parser:
>
><code example>
>
>> It's dead simple:
>
> Hmm.. Not sure I agree on "dead simple".
>
> If grabbing everything between <p> tags is *all* there is, I don't
> understand why something like this wouldn't be sufficient:
>
> open FH, 'file.html' or die $!;
> $_ = do { local $/; <FH> };
> close FH;
>
> my @paras;
> push @paras, $1 while m!<\s*p[^>]*>(.*?)<\s*/\s*p\s*>!igs;
>
> To me, again if that's all there is, this appears to be even simpler
> than "dead simple". ;-)
There are some contrived edge cases not captured by the above. For
instance, there could be a closing </p> in an HTML comment (yeah, I
know, this happens all the time;-).
Another situation where a regex could fail is with attributes. The
quoted string in such attributes could contain something that looks like
a tag.
Tassilo
--
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc. For subscription or unsubscription requests, send
#the single line:
#
# subscribe perl-users
#or:
# unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.
NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V10 Issue 6327
***************************************