[13652] in Perl-Users-Digest


home	help	back	first	fref	pref	prev	next	nref	lref	last	post
Perl-Users Digest, Issue: 1062 Volume: 9

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Fri Oct 15 14:36:53 1999

Date: Fri, 15 Oct 1999 11:36:34 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Message-Id: <940012592-v9-i1062@ruby.oce.orst.edu>
Content-Type: text

Perl-Users Digest           Fri, 15 Oct 1999     Volume: 9 Number: 1062

Today's topics:
        Help with extracting a portion of a string <fair@bucknell.edu>
    Re: Help with extracting a portion of a string (Craig Berry)
    Re: Help with extracting a portion of a string (Abigail)
    Re: Help with extracting a portion of a string (Abigail)
    Re: Help with extracting a portion of a string <pooka@cygnus.ucdavis.edu>
    Re: Help with extracting a portion of a string <uri@sysarch.com>
    Re: Help with extracting a portion of a string <pooka@cygnus.ucdavis.edu>
    Re: Help with extracting a portion of a string (Abigail)
    Re: Help with extracting a portion of a string (Tad McClellan)
    Re: Help with extracting a portion of a string (Craig Berry)
    Re: Help with extracting a portion of a string <uri@sysarch.com>
    Re: Help with extracting a portion of a string <pooka@cygnus.ucdavis.edu>
    Re: Help with extracting a portion of a string <pooka@cygnus.ucdavis.edu>
    Re: Help with extracting a portion of a string (Sam Holden)
    Re: Help with extracting a portion of a string (Sam Holden)
    Re: Help with extracting a portion of a string <jeff@vpservices.com>
    Re: Help with extracting a portion of a string <pooka@cygnus.ucdavis.edu>
    Re: Help with extracting a portion of a string <pooka@cygnus.ucdavis.edu>
    Re: Help with extracting a portion of a string <pooka@cygnus.ucdavis.edu>
    Re: Help with extracting a portion of a string sumengen@my-deja.com
    Re: Help with extracting a portion of a string (Sam Holden)
    Re: Help with extracting a portion of a string (Abigail)
    Re: Help with extracting a portion of a string <skilchen@swissonline.ch>
    Re: Help with extracting a portion of a string (Abigail)
    Re: Help with extracting a portion of a string <skilchen@swissonline.ch>
    Re: Help with extracting a portion of a string (Greg Bacon)
    Re: Help with extracting a portion of a string (Tad McClellan)
    Re: Help with extracting a portion of a string (Tad McClellan)
    Re: Help with extracting a portion of a string (Tad McClellan)
    Re: Help with extracting a portion of a string <emschwar@rmi.net>
        Digest Administrivia (Last modified: 16 Sep 99) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Thu, 14 Oct 1999 17:59:01 -0400
From: Eric Fair <fair@bucknell.edu>
Subject: Help with extracting a portion of a string
Message-Id: <3806521C.962624F5@bucknell.edu>

This is from within a CGI script.....

I have a string, and I would like to extract a portion from the middle
of it.
$string has a bunch of HTML stored in it, including a comment (<!-- ...
-->)
I want to find the comment in the string and copy it to another string.
so...

$string = <body>yadda, yadda, yadda,<!--comment-->yadda, yadda...</body>

and I want:

$string2 = <!--comment-->

If anyone can tell me a simple way to do this, or point me to a
tutorial, I would appreciate it.  I have looked in the perlfaq, but the
documentation for the range operator{..} is rather vaque........

Eric Fair
fair@bucknell.edu






------------------------------

Date: Thu, 14 Oct 1999 23:57:58 GMT
From: cberry@cinenet.net (Craig Berry)
Subject: Re: Help with extracting a portion of a string
Message-Id: <s0crg6kkhu273@corp.supernews.com>

Eric Fair (fair@bucknell.edu) wrote:
: I have a string, and I would like to extract a portion from the
: middle of it.
:
: $string has a bunch of HTML stored in it, including a comment
: (<!-- ... -->)
: I want to find the comment in the string and copy it to another
: string.  so...
: 
: $string = <body>yadda, yadda, yadda,<!--comment-->yadda, yadda...</body>

When presenting stuff that looks this much like actual code, it's best to
go all the way and make it actual code.  Quoting the string on the rhs of
this assignment, for example.

: and I want:
: 
: $string2 = <!--comment-->
: 
: If anyone can tell me a simple way to do this, or point me to a
: tutorial, I would appreciate it.  I have looked in the perlfaq, but the
: documentation for the range operator{..} is rather vaque........

Range operator is (probably) the wrong tree; you want a regex.  However,
there's a big caveat to the suggestion I'm about to give you:  It depends
strongly on the *certainty* that the comment you want to pull out is  the
very first one in the string (perhaps the only one, but first for sure,
and that no contorted syntax tricks are present that would make the
opening or closing of the comment hard to pull out unambiguously.  If
these conditions aren't met, you need a parser and some fervent prayer. :)

If they *are* met, however:

  ($string2) = $string1 =~ m/(<!--.*?-->)/;

Add the /s modifier to the match if the comment might have internal
newlines.

HTH!

-- 
   |   Craig Berry - cberry@cinenet.net
 --*--  http://www.cinenet.net/users/cberry/home.html
   |   "They do not preach that their God will rouse them
      a little before the nuts work loose." - Kipling


------------------------------

Date: 14 Oct 1999 21:41:07 -0500
From: abigail@delanet.com (Abigail)
Subject: Re: Help with extracting a portion of a string
Message-Id: <slrn80d51h.q8s.abigail@alexandra.delanet.com>

Eric Fair (fair@bucknell.edu) wrote on MMCCXXXV September MCMXCIII in
<URL:news:3806521C.962624F5@bucknell.edu>:
[] This is from within a CGI script.....
[] 
[] I have a string, and I would like to extract a portion from the middle
[] of it.
[] $string has a bunch of HTML stored in it, including a comment (<!-- ...
[] -->)
[] I want to find the comment in the string and copy it to another string.
[] so...
[] 
[] $string = <body>yadda, yadda, yadda,<!--comment-->yadda, yadda...</body>
[] 
[] and I want:
[] 
[] $string2 = <!--comment-->
[] 
[] If anyone can tell me a simple way to do this, or point me to a
[] tutorial, I would appreciate it.  I have looked in the perlfaq, but the
[] documentation for the range operator{..} is rather vaque........


Range operator? I don't think you want to use that.

Here's a solution using Parse::RecDescent, using an extremely
simplistic grammar for HTML, only a tid bit better than HTML::Parser.


#!/opt/perl/bin/perl -ws

use strict;
use Parse::RecDescent;

$::RD_HINT   = 1;
$::RD_ERRORS = 1;
$::RD_WARN   = 1;

my $grammar = <<'EOG';

document:  ( stag | etag | markup_declaration | pcdata )(s?)
             {$thisparser -> {local} -> {comments}}

stag:       '<'  <skip: ""> name <skip: $item [2]> attribute(s?) '>'
etag:       '</' <skip: ""> name <skip: $item [2]> '>'

name:       /[a-zA-Z][-.a-zA-Z\d]*/
attribute:  name ('=' attribute_value)(?)

attribute_value: name | literal
literal:    /"[^"]*"/ | /'[^']*'/

markup_declaration: doctype_decl | comment_decl

doctype_decl:   '<!' <skip: ""> "DOCTYPE" <skip: $item [2]> name
                 external_identifier(?) '>'
external_identifier:
            ( "SYSTEM" | ( "PUBLIC" literal ) ) literal(?)

comment_decl:  '<!' <skip: ""> <leftop: comment /\s*/ comment>(?)
                    <skip: $item [2]> '>'

comment:    '--'  <skip: ""> /(?:[^-]+|-[^-])*/ '--'  
            {push @{$thisparser -> {local} -> {comments}} => $item [3]; 1}

pcdata:     m{(?:[^<]+|<(?![!/a-zA-Z])|</(?![a-zA-Z]))*}

EOG

my $parser = Parse::RecDescent -> new ($grammar) or die;

undef $/;
my $text = <DATA>;

my $comments = $parser -> document ($text);

print "Found comments: ", join " " => map {"'$_'"} @$comments;
print "\n";


__DATA__
<!DOCTYPE Pseudo-HTML PUBLIC "foo bar!" "-- NOT A COMMENT --">
This is some text <!-- with a comment -->. And here are <!-- some
more ---- comments -- > We what <is inside = "a tag <!-- is not -->" a
> comment!




Running that gives:
Found comments: ' with a comment ' ' some
more ' ' comments '





Abigail
-- 
tie $" => A; $, = " "; $\ = "\n"; @a = ("") x 2; print map {"@a"} 1 .. 4;
sub A::TIESCALAR {bless \my $A => A} #  Yet Another silly JAPH by Abigail
sub A::FETCH     {@q = qw /Just Another Perl Hacker/ unless @q; shift @q}


  -----------== Posted via Newsfeeds.Com, Uncensored Usenet News ==----------
   http://www.newsfeeds.com       The Largest Usenet Servers in the World!
------== Over 73,000 Newsgroups - Including  Dedicated  Binaries Servers ==-----


------------------------------

Date: 14 Oct 1999 21:42:55 -0500
From: abigail@delanet.com (Abigail)
Subject: Re: Help with extracting a portion of a string
Message-Id: <slrn80d54u.q8s.abigail@alexandra.delanet.com>

Craig Berry (cberry@cinenet.net) wrote on MMCCXXXV September MCMXCIII in
<URL:news:s0crg6kkhu273@corp.supernews.com>:
() Eric Fair (fair@bucknell.edu) wrote:
() :
() : [ Extracting something out of HTML ]
() :
() 
() Range operator is (probably) the wrong tree; you want a regex.

No, you don't.


Abigail
-- 
srand 123456;$-=rand$_--=>@[[$-,$_]=@[[$_,$-]for(reverse+1..(@[=split
//=>"IGrACVGQ\x02GJCWVhP\x02PL\x02jNMP"));print+(map{$_^q^"^}@[),"\n"


  -----------== Posted via Newsfeeds.Com, Uncensored Usenet News ==----------
   http://www.newsfeeds.com       The Largest Usenet Servers in the World!
------== Over 73,000 Newsgroups - Including  Dedicated  Binaries Servers ==-----


------------------------------

Date: Fri, 15 Oct 1999 04:52:07 GMT
From: Brandon <pooka@cygnus.ucdavis.edu>
Subject: Re: Help with extracting a portion of a string
Message-Id: <3806423A.E8FB9EBB@cygnus.ucdavis.edu>

Abigail wrote:
> 
> Eric Fair (fair@bucknell.edu) wrote on MMCCXXXV September MCMXCIII in
> <URL:news:3806521C.962624F5@bucknell.edu>:
> [] This is from within a CGI script.....
> []
> [] I have a string, and I would like to extract a portion from the middle
> [] of it.
> [] $string has a bunch of HTML stored in it, including a comment (<!-- ...
> [] -->)
> [] I want to find the comment in the string and copy it to another string.
> [] so...
> []
> [] $string = <body>yadda, yadda, yadda,<!--comment-->yadda, yadda...</body>

 ...<bunch of crap deleted>

> 
> Abigail
> --

Wtf are you talking about?

while($html_string =~ /(<!--.*?-->)/g)
{
 $string = $1;
 print "$string\n";
}


------------------------------

Date: 15 Oct 1999 01:17:52 -0400
From: Uri Guttman <uri@sysarch.com>
Subject: Re: Help with extracting a portion of a string
Message-Id: <x7n1tli4n3.fsf@home.sysarch.com>

>>>>> "B" == Brandon  <pooka@cygnus.ucdavis.edu> writes:

  B> Abigail wrote:

  B> ...<bunch of crap deleted>

  B> Wtf are you talking about?

oh, i can't wait for this one. but i will stick my nose in here first.

  B> while($html_string =~ /(<!--.*?-->)/g)
  B> {
  B>  $string = $1;
  B>  print "$string\n";

why assign to $string? can't you print $1?


try that on this html:

<!-- comment wrapping over
multiple lines -->

and there are other areas where you can't match arbitrary markup with
regexes. html has to be parsed. 

the moral of this story is don't go after abigail unless you have more
than half a brain.

uri

-- 
Uri Guttman  ---------  uri@sysarch.com  ----------  http://www.sysarch.com
SYStems ARCHitecture, Software Engineering, Perl, Internet, UNIX Consulting
The Perl Books Page  -----------  http://www.sysarch.com/cgi-bin/perl_books
The Best Search Engine on the Net  ----------  http://www.northernlight.com


------------------------------

Date: Fri, 15 Oct 1999 05:48:22 GMT
From: Brandon <pooka@cygnus.ucdavis.edu>
Subject: Re: Help with extracting a portion of a string
Message-Id: <38064F6A.473C839F@cygnus.ucdavis.edu>

Uri Guttman wrote:
> 

> why assign to $string? can't you print $1?

The original poster said he wanted it in $string. 

> 
> try that on this html:
> 
> <!-- comment wrapping over
> multiple lines -->
> 
> and there are other areas where you can't match arbitrary markup with
> regexes. html has to be parsed.
> 
> the moral of this story is don't go after abigail unless you have more
> than half a brain.
> 

You obviously don't seem to know much about regexes. And since I have
two halves of a brain I feel quite justified in going after Abigail.


------------------------------

Date: 15 Oct 1999 01:00:02 -0500
From: abigail@delanet.com (Abigail)
Subject: Re: Help with extracting a portion of a string
Message-Id: <slrn80dgmg.q8s.abigail@alexandra.delanet.com>

Brandon (pooka@cygnus.ucdavis.edu) wrote on MMCCXXXVI September MCMXCIII
in <URL:news:38064F6A.473C839F@cygnus.ucdavis.edu>:
;; Uri Guttman wrote:
;; > 
;; 
;; > why assign to $string? can't you print $1?
;; 
;; The original poster said he wanted it in $string. 
;; 
;; > 
;; > try that on this html:
;; > 
;; > <!-- comment wrapping over
;; > multiple lines -->
;; > 
;; > and there are other areas where you can't match arbitrary markup with
;; > regexes. html has to be parsed.
;; > 
;; > the moral of this story is don't go after abigail unless you have more
;; > than half a brain.
;; > 
;; 
;; You obviously don't seem to know much about regexes. And since I have
;; two halves of a brain I feel quite justified in going after Abigail.


And you don't seem to know much about HTML/SGML.

<!-- This is a comment --
  -- This is another comment --
  --> Look. Still a comment! <!--
>



Abigail
-- 
sub camel (^#87=i@J&&&#]u'^^s]#'#={123{#}7890t[0.9]9@+*`"'***}A&&&}n2o}00}t324i;
h[{e **###{r{+P={**{e^^^#'#i@{r'^=^{l+{#}H***i[0.9]&@a5`"':&^;&^,*&^$43##@@####;
c}^^^&&&k}&&&}#=e*****[]}'r####'`=437*{#};::'1[0.9]2@43`"'*#==[[.{{],,,1278@#@);
print+((($llama=prototype'camel')=~y|+{#}$=^*&[0-9]i@:;`"',.| |d)&&$llama."\n");


  -----------== Posted via Newsfeeds.Com, Uncensored Usenet News ==----------
   http://www.newsfeeds.com       The Largest Usenet Servers in the World!
------== Over 73,000 Newsgroups - Including  Dedicated  Binaries Servers ==-----


------------------------------

Date: Thu, 14 Oct 1999 19:37:15 -0400
From: tadmc@metronet.com (Tad McClellan)
Subject: Re: Help with extracting a portion of a string
Message-Id: <bfp5u7.e54.ln@magna.metronet.com>

Eric Fair (fair@bucknell.edu) wrote:

: This is from within a CGI script.....


   Pattern matching does not depend on the environment that
   the script is run on.

   So that above is not helpful in helping you.


: I want to find the comment in the string and copy it to another string.
: so...

: If anyone can tell me a simple way to do this, 


   my $comment = $1 if $string =~ /(<!--.*?--\s*>)/;


   ***But*** you shouldn't really trust code that tries to "parse"
   HTML with only a regex.

   But you already knew that, having checked the Perl FAQs that
   mention HTML.
   

   Tremble in fear if you are getting paid for "parsing" HTML
   with regular expressions.


: I have looked in the perlfaq, 


   There are about 50 files in the standard documentation set.

   Maybe the answer is in one of the 40 _other_ (non-FAQ) files?


: but the
: documentation for the range operator{..} is rather vaque........


   Huh?

   The range operator is not useful for this problem.

   I dunno what you're talking about there.

   see perlre.pod and perlop.pod to learn how to do pattern
   matching with regular expressions.



--
    Tad McClellan                          SGML Consulting
    tadmc@metronet.com                     Perl programming
    Fort Worth, Texas


------------------------------

Date: Fri, 15 Oct 1999 06:04:29 GMT
From: cberry@cinenet.net (Craig Berry)
Subject: Re: Help with extracting a portion of a string
Message-Id: <s0dgvd3ghu27@corp.supernews.com>

Abigail (abigail@delanet.com) wrote:
: Craig Berry (cberry@cinenet.net) wrote on MMCCXXXV September MCMXCIII in
: <URL:news:s0crg6kkhu273@corp.supernews.com>:
: () Eric Fair (fair@bucknell.edu) wrote:
: () :
: () : [ Extracting something out of HTML ]
: () 
: () Range operator is (probably) the wrong tree; you want a regex.
: 
: No, you don't.

I carefully explained the caveats around that recommendation in the
remainder of my post.  Sometimes, if your input data is constrained and
well understood, you can get by with 'cheating' and save a lot of work.
See the Virtue of Laziness. :)

-- 
   |   Craig Berry - cberry@cinenet.net
 --*--  http://www.cinenet.net/users/cberry/home.html
   |   "They do not preach that their God will rouse them
      a little before the nuts work loose." - Kipling


------------------------------

Date: 15 Oct 1999 02:13:04 -0400
From: Uri Guttman <uri@sysarch.com>
Subject: Re: Help with extracting a portion of a string
Message-Id: <x7k8opi233.fsf@home.sysarch.com>

>>>>> "B" == Brandon  <pooka@cygnus.ucdavis.edu> writes:

  >> try that on this html:
  >> 
  >> <!-- comment wrapping over
  >> multiple lines -->

  B> You obviously don't seem to know much about regexes. And since I have
  B> two halves of a brain I feel quite justified in going after Abigail.

  B> while($html_string =~ /(<!--.*?-->)/g)

more like you don't know much about regexes. yours does NOT match my
counterexample. your . will not match an embedded newline.  so sorry,
you lose this round. try again sometime. read perlre before you do. or
better yet, read mastering regular expressions.

uri

-- 
Uri Guttman  ---------  uri@sysarch.com  ----------  http://www.sysarch.com
SYStems ARCHitecture, Software Engineering, Perl, Internet, UNIX Consulting
The Perl Books Page  -----------  http://www.sysarch.com/cgi-bin/perl_books
The Best Search Engine on the Net  ----------  http://www.northernlight.com


------------------------------

Date: Fri, 15 Oct 1999 06:19:00 GMT
From: Brandon <pooka@cygnus.ucdavis.edu>
Subject: Re: Help with extracting a portion of a string
Message-Id: <3806568F.B2752FBD@cygnus.ucdavis.edu>

Abigail wrote:
> 
> Brandon (pooka@cygnus.ucdavis.edu) wrote on MMCCXXXVI September MCMXCIII
> ;; >
> ;;
> ;; You obviously don't seem to know much about regexes. And since I have
> ;; two halves of a brain I feel quite justified in going after Abigail.
> 
> And you don't seem to know much about HTML/SGML.
> 
> <!-- This is a comment --
>   -- This is another comment --
>   --> Look. Still a comment! <!--
> >
> 

Well, you're free to use an HTML parser if you want, since you have so
much time on your hands.


------------------------------

Date: Fri, 15 Oct 1999 06:36:09 GMT
From: Brandon <pooka@cygnus.ucdavis.edu>
Subject: Re: Help with extracting a portion of a string
Message-Id: <38065A9C.4085DFB1@cygnus.ucdavis.edu>

Uri Guttman wrote:
> 
> >>>>> "B" == Brandon  <pooka@cygnus.ucdavis.edu> writes:
> 
>   >> try that on this html:
>   >>
>   >> <!-- comment wrapping over
>   >> multiple lines -->
> 
>   B> You obviously don't seem to know much about regexes. And since I have
>   B> two halves of a brain I feel quite justified in going after Abigail.
> 
>   B> while($html_string =~ /(<!--.*?-->)/g)
> 
> more like you don't know much about regexes. yours does NOT match my
> counterexample. your . will not match an embedded newline.  so sorry,
> you lose this round. try again sometime. read perlre before you do. or
> better yet, read mastering regular expressions.
> 

My regex matches exactly the original posters example, nothing more,
nothing less. Like I said previously, you're free to use whatever means
you want, but HTML does _not_ need to be parsed to get the comments.
See if your mighty intellect of perlre and "mastering regular
expressions" can find the solution.


------------------------------

Date: 15 Oct 1999 06:37:05 GMT
From: sholden@pgrad.cs.usyd.edu.au (Sam Holden)
Subject: Re: Help with extracting a portion of a string
Message-Id: <slrn80dish.f5f.sholden@pgrad.cs.usyd.edu.au>

On Fri, 15 Oct 1999 05:48:22 GMT, Brandon <pooka@cygnus.ucdavis.edu> wrote:
>Uri Guttman wrote:
>> 
>
>> why assign to $string? can't you print $1?
>
>The original poster said he wanted it in $string. 
>
>> 
>> try that on this html:
>> 
>> <!-- comment wrapping over
>> multiple lines -->
>
>You obviously don't seem to know much about regexes.

Now that's funny. Coming from someone who obviously doesn't even know
what . means in a regex.

-- 
Sam

It is inappropriate to require that a time represented as seconds
since the Epoch precisely represent the number of seconds between
the referenced time and the Epoch
                    --IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2


------------------------------

Date: 15 Oct 1999 06:43:53 GMT
From: sholden@pgrad.cs.usyd.edu.au (Sam Holden)
Subject: Re: Help with extracting a portion of a string
Message-Id: <slrn80dj99.f5f.sholden@pgrad.cs.usyd.edu.au>

On Fri, 15 Oct 1999 06:36:09 GMT, Brandon <pooka@cygnus.ucdavis.edu> wrote:
>
>My regex matches exactly the original posters example, nothing more,
>nothing less. Like I said previously, you're free to use whatever means
>you want, but HTML does _not_ need to be parsed to get the comments.
>See if your mighty intellect of perlre and "mastering regular
>expressions" can find the solution.

Even funnier than the last one. Brandon is arguing with Abigail (and uri) about
parsing HTML. I'm sure he picks fights with Kernighan about using unix as well.

I have a feeling that your time in c.l.p.m will not be exceptionally
pleasurable. 

-- 
Sam

why can't newbies use hash slices in their hello world programs? :-)
	-- Uri Guttman in <x74skxhve5.fsf@home.sysarch.com>


------------------------------

Date: 15 Oct 1999 05:52:05 GMT
From: Jeff Zucker <jeff@vpservices.com>
Subject: Re: Help with extracting a portion of a string
Message-Id: <3806C0BC.25537D2D@vpservices.com>

Brandon wrote:

[Abigail's comment-catching snippet, bad-mouthed by Brandon, snipped]

> Wtf are you talking about?
> 
> while($html_string =~ /(<!--.*?-->)/g)
> {
>  $string = $1;
>  print "$string\n";
> }

WOW!  You must be quite a whiz at perl to whip that up.  Trouble is, it
prints out a bunch of things that aren't comments and it completely
ignores legitimate comments like

    <!-- a comment with whitespace at end -- > 
    <!-- 
         a multi-line comment 
    --> 

I'd point out some other subtelties of comments that Abigail's script
catches and yours doesn't but I'm afraid they would be over your head. 
Why don't you a) read the HTML specs to find out what comments are and
aren't, b) try out programs that are posted here before you declare them
as crap, c) try out your own programs to see if they work before you
post them, d) get a clue about who Abigail is and e) have a nice life,
but not here, please.

-- 
Jeff


------------------------------

Date: Fri, 15 Oct 1999 06:52:26 GMT
From: Brandon <pooka@cygnus.ucdavis.edu>
Subject: Re: Help with extracting a portion of a string
Message-Id: <38065E6E.8EF19089@cygnus.ucdavis.edu>

Sam Holden wrote:
> 
> On Fri, 15 Oct 1999 05:48:22 GMT, Brandon <pooka@cygnus.ucdavis.edu> wrote:
> >Uri Guttman wrote:
> >>
> >
> >> why assign to $string? can't you print $1?
> >
> >The original poster said he wanted it in $string.
> >
> >>
> >> try that on this html:
> >>
> >> <!-- comment wrapping over
> >> multiple lines -->
> >
> >You obviously don't seem to know much about regexes.
> 
> Now that's funny. Coming from someone who obviously doesn't even know
> what . means in a regex.
> 

You don't think I can make . match a newline if so inclined?


------------------------------

Date: Fri, 15 Oct 1999 06:58:32 GMT
From: Brandon <pooka@cygnus.ucdavis.edu>
Subject: Re: Help with extracting a portion of a string
Message-Id: <38065FDC.B39154F3@cygnus.ucdavis.edu>

Sam Holden wrote:

> Even funnier than the last one. Brandon is arguing with Abigail (and uri) about
> parsing HTML. I'm sure he picks fights with Kernighan about using unix as well.
> 
> I have a feeling that your time in c.l.p.m will not be exceptionally
> pleasurable.
> 

So far, so good.


------------------------------

Date: Fri, 15 Oct 1999 07:11:39 GMT
From: Brandon <pooka@cygnus.ucdavis.edu>
Subject: Re: Help with extracting a portion of a string
Message-Id: <380662F0.6BA3C555@cygnus.ucdavis.edu>

Jeff Zucker wrote:
> > 
> WOW!  You must be quite a whiz at perl to whip that up.  Trouble is, it
> prints out a bunch of things that aren't comments and it completely
> ignores legitimate comments like
> 
>     <!-- a comment with whitespace at end -- >
>     <!--
>          a multi-line comment
>     -->
> 
> I'd point out some other subtelties of comments that Abigail's script
> catches and yours doesn't but I'm afraid they would be over your head.
> Why don't you a) read the HTML specs to find out what comments are and
> aren't, b) try out programs that are posted here before you declare them
> as crap, c) try out your own programs to see if they work before you
> post them, d) get a clue about who Abigail is and e) have a nice life,
> but not here, please.


My example was minimalistic, and matched exactly the original posters
example. All the comment anomolies are easily corrected _without_ having
to use parsing modules. And just to be fair to Abigail, I wasn't calling
her code crap; it was more of a generic term for quoted text that wastes
space. So lookit, if you want to use parsers, use parsers. There are
other, easier ways to do it, and that's all I intended to say from the
beginning. Forgive me for daring to challenge the word of your queen.

I think I'll have my nice day right here, thanks.


------------------------------

Date: Fri, 15 Oct 1999 07:15:07 GMT
From: sumengen@my-deja.com
Subject: Re: Help with extracting a portion of a string
Message-Id: <7u6k9k$10d$1@nnrp1.deja.com>

Hi,
Yours doesn't work if there are more than one comments in the html file.
Will this work?
$html =~ m/(<!--[^(-->)]*-->)/g;
$string = $1;

I hope so..
baris.


Sent via Deja.com http://www.deja.com/
Before you buy.


------------------------------

Date: 15 Oct 1999 07:24:06 GMT
From: sholden@pgrad.cs.usyd.edu.au (Sam Holden)
Subject: Re: Help with extracting a portion of a string
Message-Id: <slrn80dlkm.ir2.sholden@pgrad.cs.usyd.edu.au>

On Fri, 15 Oct 1999 07:15:07 GMT, sumengen@my-deja.com wrote:
>Hi,
>Yours doesn't work if there are more than one comments in the html file.
>Will this work?
>$html =~ m/(<!--[^(-->)]*-->)/g;

That does not do what you think it does. [^(-->)] matches any character except
for (,-,> and ). Comments are allowed to contain those characters...


-- 
Sam

Even if you aren't in doubt, consider the mental welfare of the person
who has to maintain the code after you, and who will probably put parens
in the wrong place.	--Larry Wall


------------------------------

Date: 15 Oct 1999 02:31:33 -0500
From: abigail@delanet.com (Abigail)
Subject: Re: Help with extracting a portion of a string
Message-Id: <slrn80dm23.q8s.abigail@alexandra.delanet.com>

Brandon (pooka@cygnus.ucdavis.edu) wrote on MMCCXXXVI September MCMXCIII
in <URL:news:3806568F.B2752FBD@cygnus.ucdavis.edu>:
&& Abigail wrote:
&& > 
&& > Brandon (pooka@cygnus.ucdavis.edu) wrote on MMCCXXXVI September MCMXCIII
&& > ;; >
&& > ;;
&& > ;; You obviously don't seem to know much about regexes. And since I have
&& > ;; two halves of a brain I feel quite justified in going after Abigail.
&& > 
&& > And you don't seem to know much about HTML/SGML.
&& > 
&& > <!-- This is a comment --
&& >   -- This is another comment --
&& >   --> Look. Still a comment! <!--
&& > >
&& > 
&& 
&& Well, you're free to use an HTML parser if you want, since you have so
&& much time on your hands.


Oh, already working on one. A better one than that's now found in HTML::*.

Next week, "The SGML Handbook" should arrive.... ;-)



Abigail
-- 
perl5.004 -wMMath::BigInt -e'$^V=Math::BigInt->new(qq]$^F$^W783$[$%9889$^F47]
 .qq]$|88768$^W596577669$%$^W5$^F3364$[$^W$^F$|838747$[8889739$%$|$^F673$%$^W]
 .qq]98$^F76777$=56]);$^U=substr($]=>$|=>5)*(q.25..($^W=@^V))=>do{print+chr$^V
%$^U;$^V/=$^U}while$^V!=$^W'


  -----------== Posted via Newsfeeds.Com, Uncensored Usenet News ==----------
   http://www.newsfeeds.com       The Largest Usenet Servers in the World!
------== Over 73,000 Newsgroups - Including  Dedicated  Binaries Servers ==-----


------------------------------

Date: Fri, 15 Oct 1999 08:44:33 GMT
From: "Samuel Kilchenmann" <skilchen@swissonline.ch>
Subject: Re: Help with extracting a portion of a string
Message-Id: <RPBN3.24458$m4.89592549@news.magma.ca>

Abigail <abigail@delanet.com> wrote in:
news:slrn80dgmg.q8s.abigail@alexandra.delanet.com...
>
> <!-- This is a comment --
>   -- This is another comment --
>   --> Look. Still a comment! <!--
> >
This is a malformed (invalid) HTML comment. From
http://www.w3.org/TR/1998/REC-html40-19980424

  3.2.4 Comments
  HTML comments have the following syntax:

  <!-- this is a comment -->
  <!-- and so is this one,
      which occupies more than one line -->

  White space is not permitted between the markup declaration open
  delimiter("<!") and the comment open delimiter ("--"), but is
  permitted between the comment close delimiter ("--") and the
  markup declaration close delimiter (">"). A common error is to
  include a string of hyphens ("---") within a comment. Authors
  should avoid putting two or more adjacent hyphens inside comments.

  Information that appears between comments has no special meaning
  (e.g., character references are not interpreted as such).




------------------------------

Date: 15 Oct 1999 04:26:34 -0500
From: abigail@delanet.com (Abigail)
Subject: Re: Help with extracting a portion of a string
Message-Id: <slrn80dspo.q8s.abigail@alexandra.delanet.com>

Samuel Kilchenmann (skilchen@swissonline.ch) wrote on MMCCXXXVI September
MCMXCIII in <URL:news:RPBN3.24458$m4.89592549@news.magma.ca>:
&& Abigail <abigail@delanet.com> wrote in:
&& news:slrn80dgmg.q8s.abigail@alexandra.delanet.com...
&& >
&& > <!-- This is a comment --
&& >   -- This is another comment --
&& >   --> Look. Still a comment! <!--
&& > >
&& This is a malformed (invalid) HTML comment. From
&& http://www.w3.org/TR/1998/REC-html40-19980424
&& 
&&   3.2.4 Comments
&&   HTML comments have the following syntax:
&& 
&&   <!-- this is a comment -->
&&   <!-- and so is this one,
&&       which occupies more than one line -->
&& 
&&   White space is not permitted between the markup declaration open
&&   delimiter("<!") and the comment open delimiter ("--"), but is
&&   permitted between the comment close delimiter ("--") and the
&&   markup declaration close delimiter (">"). A common error is to
&&   include a string of hyphens ("---") within a comment. Authors
&&   should avoid putting two or more adjacent hyphens inside comments.
&& 
&&   Information that appears between comments has no special meaning
&&   (e.g., character references are not interpreted as such).

That description isn't complete. This is from the SGML production rules:


[91] comment declaration (10.3, 391:1) =
        ( mdo ("<!"),
          ?( comment [92],
             *( s [5]
              | comment [92] ) ),
          mdc (">") )

[92] comment (10.3, 391:7) =
        ( com ("--"),
          *SGML character [50],
          com ("--") )

As you can see, one can have multiple comments inside a single
comment declaration.

     <!-- This is a comment --
       -- This is another comment --
       --> Look. Still a comment! <!--
     >

is a declaration with three comments:
        ' This is a comment '
        ' This is another comment '
        '> Look. Still a comment! <!'

There are 3 tokens involved here: MDO ('<!'), MDC ('>') and COM ('--').
'<!--' is not a single token, and neither is '-->'.


And this is from the only HTML standard that made it to an RFC (RFC1866):

3.2.5. Comments
        
   To include comments in an HTML document, use a comment declaration. A
   comment declaration consists of `<!' followed by zero or more
   comments followed by `>'. Each comment starts with `--' and includes
   all text up to and including the next occurrence of `--'. In a
   comment declaration, white space is allowed after each comment, but
   not before the first comment.  The entire comment declaration is
   ignored.

      NOTE - Some historical HTML implementations incorrectly consider
      any `>' character to be the termination of a comment.
        
   For example:

    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
    <HEAD>
    <TITLE>HTML Comment Example</TITLE>
    <!-- Id: html-sgml.sgm,v 1.5 1995/05/26 21:29:50 connolly Exp  -->
    <!-- another -- -- comment -->
    <!> 
    </HEAD>
    <BODY>
    <p> <!- not a comment, just regular old data characters ->



Now, there are of course luser browsers that will display something of:

     <!-- -- -->If you read this, your browser sucks.<!-- -- -->

but that doesn't mean they are correct.


Abigail
-- 
package Just_another_Perl_Hacker; sub print {($_=$_[0])=~ s/_/ /g;
                                      print } sub __PACKAGE__ { &
                                      print (     __PACKAGE__)} &
                                                  __PACKAGE__
                                            (                )


  -----------== Posted via Newsfeeds.Com, Uncensored Usenet News ==----------
   http://www.newsfeeds.com       The Largest Usenet Servers in the World!
------== Over 73,000 Newsgroups - Including  Dedicated  Binaries Servers ==-----


------------------------------

Date: Fri, 15 Oct 1999 12:45:57 GMT
From: "Samuel Kilchenmann" <skilchen@swissonline.ch>
Subject: Re: Help with extracting a portion of a string
Message-Id: <9mFN3.24505$m4.89682581@news.magma.ca>

Abigail <abigail@delanet.com> wrote in:
news:slrn80dspo.q8s.abigail@alexandra.delanet.com...
> Samuel Kilchenmann (skilchen@swissonline.ch) wrote on MMCCXXXVI
> MCMXCIII in <URL:news:RPBN3.24458$m4.89592549@news.magma.ca>:
> && Abigail <abigail@delanet.com> wrote in:
> && news:slrn80dgmg.q8s.abigail@alexandra.delanet.com...
> && >
> && > <!-- This is a comment --
> && >   -- This is another comment --
> && >   --> Look. Still a comment! <!--
> && > >
> && This is a malformed (invalid) HTML comment. From
> && http://www.w3.org/TR/1998/REC-html40-19980424
> &&
> &&   3.2.4 Comments
> &&   HTML comments have the following syntax:
> &&
> &&   <!-- this is a comment -->
> &&   <!-- and so is this one,
> &&       which occupies more than one line -->
> &&
> &&   White space is not permitted between the markup declaration open
> &&   delimiter("<!") and the comment open delimiter ("--"), but is
> &&   permitted between the comment close delimiter ("--") and the
> &&   markup declaration close delimiter (">"). A common error is to
> &&   include a string of hyphens ("---") within a comment. Authors
> &&   should avoid putting two or more adjacent hyphens inside
> &&   comments.
> &&
> &&   Information that appears between comments has no special meaning
> &&   (e.g., character references are not interpreted as such).
>
> That description isn't complete. This is from the SGML production
> rules:
>
Thanks a lot for the very interesting followup. But how do you know that
the cited description is incomplete? Isn't this more restrictive
definition of the allowed HTML comment syntax intended, because "nobody"
cared to implement the full SGML comments definition in the past?

An even more restrictive syntax for comments was defined (proposed?) for
XML, see http://www.w3.org/TR/REC-xml:
  2.5 Comments
  Comments may appear anywhere in a document outside other markup; in
  addition, they may appear within the document type declaration at
  places allowed by the grammar. They are not part of the document's
  character data; an XML processor may, but need not, make it possible
  for an application to retrieve the text of comments. For
  compatibility, the string "--" (double-hyphen) must not occur
  within comments.

  Comments
  [15]  Comment ::=  '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'

What would you say about an XML processoe implementing this definition
instead of the full SGML comments definition? Luser?




------------------------------

Date: 15 Oct 1999 14:18:41 GMT
From: gbacon@ruby.itsc.uah.edu (Greg Bacon)
Subject: Re: Help with extracting a portion of a string
Message-Id: <7u7d41$br1$1@info2.uah.edu>

In article <380662F0.6BA3C555@cygnus.ucdavis.edu>,
	Brandon <pooka@cygnus.ucdavis.edu> writes:

: My example was minimalistic, and matched exactly the original posters
: example. All the comment anomolies are easily corrected _without_ having
: to use parsing modules. And just to be fair to Abigail, I wasn't calling
: her code crap; it was more of a generic term for quoted text that wastes
: space. So lookit, if you want to use parsers, use parsers. There are
: other, easier ways to do it, and that's all I intended to say from the
: beginning. Forgive me for daring to challenge the word of your queen.

The point isn't to solve the single problem proposed.  If that were the
case, then

    $string2 = "<!--comment-->";

would be the ideal solution.  It's likely that Eric would use solutions
on arbitrary HTML.  Your regular expression fails for arbitrary HTML.
Therefore, your solution is bad given Eric's problem definition.

Thanks for playing.  Hack Perl!

Greg
-- 
You've got to understand their market has always been the Windows space,
where you're actually doing people a favor by charging them money for things,
because that's the only way to keep from confusing them.
    -- Larry Wall on ActiveState


------------------------------

Date: Fri, 15 Oct 1999 05:28:11 -0400
From: tadmc@metronet.com (Tad McClellan)
Subject: Re: Help with extracting a portion of a string
Message-Id: <b3s6u7.s05.ln@magna.metronet.com>

Brandon (pooka@cygnus.ucdavis.edu) wrote:
: Abigail wrote:
: > 
: > Brandon (pooka@cygnus.ucdavis.edu) wrote on MMCCXXXVI September MCMXCIII
: > ;; >
: > ;;
: > ;; You obviously don't seem to know much about regexes. And since I have
: > ;; two halves of a brain I feel quite justified in going after Abigail.
: > 
: > And you don't seem to know much about HTML/SGML.
: > 
: > <!-- This is a comment --
: >   -- This is another comment --
: >   --> Look. Still a comment! <!--
: > >
: > 

: Well, you're free to use an HTML parser if you want, since you have so
: much time on your hands.


   Well, you're free to write code that doesn't work since
   you don't have much time on your hands.

   My bosses tell me stuff like that all the time (not!)


--
    Tad McClellan                          SGML Consulting
    tadmc@metronet.com                     Perl programming
    Fort Worth, Texas


------------------------------

Date: Fri, 15 Oct 1999 05:55:20 -0400
From: tadmc@metronet.com (Tad McClellan)
Subject: Re: Help with extracting a portion of a string
Message-Id: <8mt6u7.s05.ln@magna.metronet.com>

Samuel Kilchenmann (skilchen@swissonline.ch) wrote:

: From
: http://www.w3.org/TR/1998/REC-html40-19980424

:   3.2.4 Comments
:   HTML comments have the following syntax:

:   <!-- this is a comment -->
:   <!-- and so is this one,
:       which occupies more than one line -->

:   White space is not permitted between the markup declaration open
:   delimiter("<!") and the comment open delimiter ("--"), but is
:   permitted between the comment close delimiter ("--") and the
:   markup declaration close delimiter (">"). A common error is to
:   include a string of hyphens ("---") within a comment. Authors
:   should avoid putting two or more adjacent hyphens inside comments.
    ^^^^^^
    ^^^^^^

   "should" is not the same as "must" in spec-speak.

   So you _can_ put a string of hyphens in a comment
   (though they may serve to end a comment and start 
   another comment, which is likely not what was intended)



--
    Tad McClellan                          SGML Consulting
    tadmc@metronet.com                     Perl programming
    Fort Worth, Texas


------------------------------

Date: Fri, 15 Oct 1999 05:52:23 -0400
From: tadmc@metronet.com (Tad McClellan)
Subject: Re: Help with extracting a portion of a string
Message-Id: <ngt6u7.s05.ln@magna.metronet.com>

Brandon (pooka@cygnus.ucdavis.edu) wrote:
: Jeff Zucker wrote:
: > > 
: > WOW!  You must be quite a whiz at perl to whip that up.  Trouble is, it
: > prints out a bunch of things that aren't comments and it completely
: > ignores legitimate comments like


: > I'd point out some other subtelties of comments that Abigail's script
: > catches and yours doesn't but I'm afraid they would be over your head.



: All the comment anomolies are easily corrected _without_ having
: to use parsing modules. 


   Put up or shut up.

   If it is easy, then post the corrected regex code.

   Should only be a few minutes, right?

   Certainly less time than you have spent saying you could do
   it without actually doing it.

   Blowing smoke does not support your position. Support your
   position, or stop espousing it.



   Then someone will post legal HTML that breaks it.

   Then you can make the "easy" correction to handle that case,
   and post the corrected code.

   Then someone will post legal HTML that breaks it.

   Then you can make the "easy" correction to handle that case,
   and post the corrected code.

   Then someone will post legal HTML that breaks it.

   Then you can make the "easy" correction ...


: So lookit, if you want to use parsers, use parsers. There are
: other, easier ways to do it, and that's all I intended to say from the
: beginning. 


   That is all we intended to refute from the beginning,
   because it is not true.

   We cannot allow disinformation to go unchallenged.


: Forgive me for daring to challenge the word of your queen.


   It is not a challenge to abigail. 

   It is a challenge to the w3c, who define what HTML is.



   Surely UC Davis has courses on Formal Methods and Set Theory?

   Go ask a prof if you can parse a context free grammar with
   a regular expression.

   Then come back and apologize.


--
    Tad McClellan                          SGML Consulting
    tadmc@metronet.com                     Perl programming
    Fort Worth, Texas


------------------------------

Date: 15 Oct 1999 11:44:35 -0600
From: Eric The Read <emschwar@rmi.net>
Subject: Re: Help with extracting a portion of a string
Message-Id: <xkfemew34e3.fsf@valdemar.col.hp.com>

tadmc@metronet.com (Tad McClellan) writes:
>    Go ask a prof if you can parse a context free grammar with
>    a regular expression.

Well, to be fair, Perl's regexes aren't "regular expressions" in the
formal sense of the word.  ISTR Ilya saying that, with some of the new
features in the regex engine, it *might* be possible to parse HTML with
them.

Not that Brandon has shown any signs of understanding either HTML or
regexes well enough to pull this off.

-=Eric
-- 
"Cutting the space budget really restores my faith in humanity.  It
eliminates dreams, goals, and ideals and lets us get straight to the
business of hate, debauchery, and self-annihilation."
                -- Johnny Hart


------------------------------

Date: 16 Sep 99 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 16 Sep 99)
Message-Id: <null>


Administrivia:

The Perl-Users Digest is a retransmission of the USENET newsgroup
comp.lang.perl.misc.  For subscription or unsubscription requests, send
the single line:

	subscribe perl-users
or:
	unsubscribe perl-users

to almanac@ruby.oce.orst.edu.  

| NOTE: The mail to news gateway, and thus the ability to submit articles
| through this service to the newsgroup, has been removed. I do not have
| time to individually vet each article to make sure that someone isn't
| abusing the service, and I no longer have any desire to waste my time
| dealing with the campus admins when some fool complains to them about an
| article that has come through the gateway instead of complaining
| to the source.

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

To request back copies (available for a week or so), send your request
to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
where x is the volume number and y is the issue number.

For other requests pertaining to the digest, send mail to
perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
sending perl questions to the -request address, I don't have time to
answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V9 Issue 1062
**************************************

home	help	back	first	fref	pref	prev	next	nref	lref	last	post
[13652] in Perl-Users-Digest

Perl-Users Digest, Issue: 1062 Volume: 9

daemon@ATHENA.MIT.EDU (Perl-Users Digest)Fri Oct 15 14:36:53 1999

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Fri Oct 15 14:36:53 1999