[9949] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 3542 Volume: 8

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Aug 25 19:02:13 1998

Date: Tue, 25 Aug 98 16:01:28 -0700
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Tue, 25 Aug 1998     Volume: 8 Number: 3542

Today's topics:
    Re: Regex question - removing HTML tags.... (Larry Rosler)
    Re: Regex question - removing HTML tags.... <sneaker@sneex.fccj.org>
    Re: Regex question - removing HTML tags.... <dgris@rand.dimensional.com>
    Re: Scheduling Perl with Win NT AT Svc (Jonathan Stowe)
        What is it? hoangngo@usa.net
    Re: What is it? <erik@zeno.com>
    Re: What is it? (Craig Berry)
    Re: Win32 Q: Reading Outlook 98 files (Jonathan Stowe)
    Re: Y2K Date Support <sneaker@sneex.fccj.org>
    Re: Y2K Date Support (I R A Aggie)
        Special: Digest Administrivia (Last modified: 12 Mar 98 (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Tue, 25 Aug 1998 14:13:17 -0700
From: lr@hpl.hp.com (Larry Rosler)
Subject: Re: Regex question - removing HTML tags....
Message-Id: <MPG.104cc8ccfa23ddc1989771@nntp.hpl.hp.com>

In article <No.unsoiliciteds-2608980500210001@cs11i11.ppp.infoweb.or.jp> 
on Wed, 26 Aug 1998 05:00:20 +0900, Norman UNsoliciteds 
<No.unsoiliciteds@dead.end> says...
 ...
> However if the intention is to deHTMLize a hyper text document, the
> "<"">""&" won't be escaped (&lt;) this would mean as a  HTML document it
> would be useless - a browser would show the document as it actually
> was(showing the tags as text) instead of formatting the text according to
> the tags. This approach would mean having to parse the existing document
> so that the offending character entities could be substituted for their
> escaped versions to then re parse them with the regexp to remove them

My intent (and I think the intent of the other submitters) is to extract 
tags from an HTML document, process them in some way, and perhaps 
reinsert them as edited HTML.  The characters in question are within 
attributes or comments, so have no syntactic significance as HTML.  The 
only issue is whether and how they compicate the extraction process.

-- 
(Yet Another Larry) Rosler
Hewlett-Packard Laboratories
http://www.hpl.hp.com/personal/Larry_Rosler/
lr@hpl.hp.com


------------------------------

Date: Tue, 25 Aug 1998 17:26:03 -0400
From: Bill 'Sneex' Jones <sneaker@sneex.fccj.org>
Subject: Re: Regex question - removing HTML tags....
Message-Id: <35E32BEB.6D446201@sneex.fccj.org>

iada@hplb.hpl.hp.com wrote:
> 
> Hi,
>    I'm very new to perl and haven't got my head around the joys of regular
> expressions -> I came across this example for removing the HTML tags from a
> string and can't work out how it works:
> 
> $value=~s/<([^>]|\n)*>//g
> 
> from what I understand it strips the < > pair and anything inbetween, the []
> being a character class, the parenthesis being a group and the * being one or
> more tokens to match (the g being  global replace?)
> 
> However....I'm confused by the caret - not the beginning of a string in this
> context? and the | in the character class...
> 
> I've tried consulting the man pages as well as all the tutorials I could find,
> and am still stuck :-(
> 
> Can anyone tell me how this works and/or point me in the direction of an
> idiots guide to Regular expressions?
> 
> Thanks
> 
>  Ian

:]  Hi,

As a pointer to other 'newbies' who may be reading this thread:

It is hard, from my understanding of what I have followed in 
the past reading of this group, to correctly strip HTML tags
from HTML,  See:

my $value = "<HTML>Value > was <...</HTML>";

$value=~s/<([^>]|\n)*>//g;

print "Value now: $value\n\n";


Which should print:

Value now: Value > was <...

But really prints:

Value now: Value > was

You lost the "<..." in the RegEx used...

Sorry,  I do not know that answer to your question...
-Sneex- 
__________________________________________________________________
Bill Jones | FCCJ Webmaster | Murphy's Law of Research:
           Enough research will tend to support your theory.


------------------------------

Date: Tue, 25 Aug 1998 22:25:06 GMT
From: Daniel Grisinger <dgris@rand.dimensional.com>
Subject: Re: Regex question - removing HTML tags....
Message-Id: <6rvcpn$9hm$1@rand.dimensional.com>

[posted to comp.lang.perl.misc and mailed to the cited author]

[warning- this article turned out to be very long and contains
 a regular expression that may cause weak-stomached people
 to become physically ill.  You have been warned :-)]

In article <6ruio4$2gi$1@nnrp1.dejanews.com>
iada@hplb.hpl.hp.com wrote:
>Hi,
>   I'm very new to perl and haven't got my head around the joys of regular
>expressions -> I came across this example for removing the HTML tags from a
>string and can't work out how it works:

Ok, I'll explain it, but then I'm going to explain why using
a regular expression to remove html is a bad idea.


>$value=~s/<([^>]|\n)*>//g

$value =~ s/ <         # match a literal `<', start of html tag

               (        # begin group 1  
                                             
                [^>]     # anything but >
                |        # or
                \n       # a literal newline (this is redundant)
              
               )*       # end group 1, match 0 or more times.

             >/        # match a literal '>'
              /x;      # substitute with nothing, the /x modifier allows
                       # the whitespace and comments in the expression.


>from what I understand it strips the < > pair and anything inbetween, the []
>being a character class, the parenthesis being a group and the * being one or
>more tokens to match (the g being  global replace?)

The * means 0 or more, but otherwise you have got this part
right :-).


>However....I'm confused by the caret - not the beginning of a string in this
>context? and the | in the character class...

When ^ is the first character in a character class it has a different
meta-meaning than when it is the first character in a regular expression.
In a character class it negates the rest of the class (this example
says to match anything not a `>').

The | is not in the character class, instead it signals alternation
among the subexpressions that are contained in group 1.

Now, a couple of nitpicks. Anyone not interested in reading
further can safely stop now that the question has been answered.

Ok, still with me.  Let's move on to the nitpicking.

  1.  This regular expression won't work.  It fails to properly
      match possible constructs that are valid HTML.

  2.  It is impossible, in general, to match all HTML constructs
      using a single regular expression.[0]  This is because HTML
      allows for elements to be nested to an arbitrary level.

The real problem is each of these (blank line separated) is a single
valid HTML tag[1]-

    <img src= "rt-arrow.gif" alt= "==>" >

    <!-- -- -->
    If you see this, you have a buggy browser. Perhaps you should
    <A HREF="http://lynx.browser.org/">upgrade to Lynx</A>?
    <!-- -- -->

    <!-- <<<<<<<<<<<<<<<!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -->

    <!-- This is a valid HTML comment > -- 
         -- still a comment ---->
         STILL part of the comment! --
    >

The expression that you are using will fail on each of the
above examples.  This won't (although it _will_ fail on some
html, see below)-

$comment = q/(?:                # group 1
               <!\s*             # comment start
                                                    
                 (?:                # group 2
                   --                # two dashes
                                                      
                   (?:                 # group 3
                     [^-]               # anything not a dash
                   |                    # or
                     -                  # a dash
                     (?!-)              # not followed by a dash
                   )*                  # group 3, 0 or more times
                                               
                   --                # two dashes
                   \s*               # 0 or more space characters
                 )+                 # group 2, 1 or more times
                                        
               >                 # comment end
             )/;                # end group 1


$html = qq/(?:             # group 1
            <               # html tag start
                                      
              (?:             # group 2
                [^><"']+       # one or more characters that aren't ><'"
                                   
              |                # or
                                                 
                "              # a double quote
                [^"]+          # one or more not double quotes
                "              # a double quote
                                    
              |                # or
                                             
                '              # a single quote
                [^']+          # one or more not single quotes
                '              # a single quote
                                      
              |                # or
                                        
                $comment       # a comment
              )+?             # one or more of group 2, non-greedy
                                    
            >               # html tag end 
          )/;              # end group 1


# this is the actual regular expression
/(?=<!)      # look ahead for a <!
   $comment  
   
   |         # or 
 
 (?=<)       # look ahead for a <
   $html     
/gsx;        # the /x protects whitespace and comments
__END__


All in all, this is big, ugly, and still not guaranteed to work
because
  1-  there are still valid html tags that the above will not
      match[2]-
          <p  <em  <strong  <font <big> color = '#ffffff'>>>>

  2-  this will break on invalid html and unbalanced quote
      characters (although the quote problem is easy to fix)
 
I'd strongly recommend using HTML::Parser instead of regular
expressions to remove html, especially if somebody else
is going to have to take care of the code after you.

Hope this helps.

dgris

[0]- I personally don't believe this to be true any longer.  5.005
     introduced several new constructs to perl's regular expressions
     that I think can be used to match nested data.

[1]- HTML 3.2, validated using the WebTechs validation service
     at http://valsvc.webtechs.com/testbed/

[2]- Although this construct is valid none of the browsers I tested
     (several versions of netscape, several versions of lynx, and
     arena) properly rendered it, so I feel a little better about
     not being able to match it correctly.
-- 
Daniel Grisinger           dgris@perrin.dimensional.com
"No kings, no presidents, just a rough consensus and
running code."
                           Dave Clark


------------------------------

Date: Tue, 25 Aug 1998 21:16:52 GMT
From: Gellyfish@btinternet.com (Jonathan Stowe)
Subject: Re: Scheduling Perl with Win NT AT Svc
Message-Id: <35e32568.12389651@news.btinternet.com>

On Mon, 24 Aug 1998 15:32:39 -0600, Alex Tatistcheff wrote :

>Greetings,
>
>Rather than run my NT Perl script as a service I'd like to run it using
>the NT Scheduler Service (AT service).  I want to run the script every 5
>minutes but I don't want to have 100+ AT jobs.  Since I don't have
>another scheduler service available, has anyone ever used Perl to
>schedule an AT job?  It should be fairly easy but I thought I'd check
>and see if anyone has an example that would work before I dig in.
>
<snip>

In principal there is no problem with running a Perl program using the
schedule service but as with most things on NT there are some gotchas
but these are to do with NTness rather than Perlness so I wont go into
any detail but to say:

A) Check the permissions of the account that the schedule service is
running under.
B) You may also have problems with detached process and certain
networking functions

-- 
/J\


Jonathan Stowe
Some of your questions answered:
<URL:http://www.btinternet.com/~gellyfish/resources/wwwfaq.htm>



------------------------------

Date: Tue, 25 Aug 1998 22:07:09 GMT
From: hoangngo@usa.net
Subject: What is it?
Message-Id: <6rvcid$867$1@nnrp1.dejanews.com>

Hi,

I am new to Perl i often see something like this at the end of a file:

sub mysub
{
	code...
	code...
}

1;

What is that "1;"? I can't find an explaination anywhere?

Thanks,
Hoang

-----== Posted via Deja News, The Leader in Internet Discussion ==-----
http://www.dejanews.com/rg_mkgrp.xp   Create Your Own Free Member Forum


------------------------------

Date: Tue, 25 Aug 1998 22:37:01 GMT
From: "Erik Knepfler" <erik@zeno.com>
Subject: Re: What is it?
Message-Id: <h0HE1.397$J3.9289967@nnrp2.ni.net>

Read about the require or include command.  You'll find that when you
require another file (such as this file you're talking about), it expects a
true to be returned if the require statement succeeded.  This makes that
happen.  Without it, you cannot use require/include.

Of course, require doesn't work for me with the latest Perl anyway since
it's too dumb to know what the current directory is...

Erik

>I am new to Perl i often see something like this at the end of a file:
>
>
>1;
>





------------------------------

Date: 25 Aug 1998 22:44:48 GMT
From: cberry@cinenet.net (Craig Berry)
Subject: Re: What is it?
Message-Id: <6rvep0$bdj$7@marina.cinenet.net>

hoangngo@usa.net wrote:
: I am new to Perl i often see something like this at the end of a file:
: 
: sub mysub
: {
: 	code...
: 	code...
: }
: 
: 1;
: 
: What is that "1;"? I can't find an explaination anywhere?

See 'perldoc -f require', 3rd-to-the-last paragraph.

---------------------------------------------------------------------
   |   Craig Berry - cberry@cinenet.net
 --*--    Home Page: http://www.cinenet.net/users/cberry/home.html
   |      "Ripple in still water, when there is no pebble tossed,
       nor wind to blow..."


------------------------------

Date: Tue, 25 Aug 1998 21:16:54 GMT
From: Gellyfish@btinternet.com (Jonathan Stowe)
Subject: Re: Win32 Q: Reading Outlook 98 files
Message-Id: <35e326e2.12767426@news.btinternet.com>

On Mon, 24 Aug 1998 17:34:05 -0400, Walter Torres wrote :

>I would like to read my Outlook files via Perl.
>
>I have no idea where to start to look for this type of info.
>
>Anyone know where I can start to research this topic?
>
>Has anyone done this?
>
You probably want to look at Win32::OLE or MAPI beyond that (as far as
I'm concerned ) you're on your own

-- 
/J\
Jonathan Stowe
Some of your questions answered:
<URL:http://www.btinternet.com/~gellyfish/resources/wwwfaq.htm>



------------------------------

Date: Tue, 25 Aug 1998 17:29:16 -0400
From: Bill 'Sneex' Jones <sneaker@sneex.fccj.org>
Subject: Re: Y2K Date Support
Message-Id: <35E32CAC.8812556A@sneex.fccj.org>

Daniel Grisinger wrote:
> 
> >+ > $date = "$days[$wday], $months[$mon] $mday, 19$year at
> >
> >In the year 2000, this program would have reported the year as:
> >
> >19100
> 
> In the year 2000 it will report the wrong result, that sure sounds
> like a bug to me.
> 
> dgris


Not a bug in Perl, a bug in the original posters code.

See the 'hard coded' 19 up there?

They said
	19$year

They should have at least said 
	(1900 + $year)

That way, it would be 2000, as descibed n the docs :]
-Sneex- 
__________________________________________________________________
Bill Jones | FCCJ Webmaster | Murphy's Law of Research:
           Enough research will tend to support your theory.


------------------------------

Date: Tue, 25 Aug 1998 18:05:52 -0500
From: fl_aggie@thepentagon.com (I R A Aggie)
Subject: Re: Y2K Date Support
Message-Id: <fl_aggie-2508981805520001@aggie.coaps.fsu.edu>

In article <6ruv3i$1gm@mozo.cc.purdue.edu>, gebis@fee.ecn.purdue.edu
(Michael J Gebis) wrote:

+ fl_aggie@thepentagon.com (I R A Aggie) writes:
+ }and you'll be just fine thru the Y2K problem.
 
+ That is, assuming you survive the C.H.U.D.s.

C.H.U.D.s? or is this in the jargon file?

James


------------------------------

Date: 12 Jul 98 21:33:47 GMT (Last modified)
From: Perl-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Special: Digest Administrivia (Last modified: 12 Mar 98)
Message-Id: <null>


Administrivia:

Special notice: in a few days, the new group comp.lang.perl.moderated
should be formed. I would rather not support two different groups, and I
know of no other plans to create a digested moderated group. This leaves
me with two options: 1) keep on with this group 2) change to the
moderated one.

If you have opinions on this, send them to
perl-users-request@ruby.oce.orst.edu. 


The Perl-Users Digest is a retransmission of the USENET newsgroup
comp.lang.perl.misc.  For subscription or unsubscription requests, send
the single line:

	subscribe perl-users
or:
	unsubscribe perl-users

to almanac@ruby.oce.orst.edu.  

To submit articles to comp.lang.perl.misc (and this Digest), send your
article to perl-users@ruby.oce.orst.edu.

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

To request back copies (available for a week or so), send your request
to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
where x is the volume number and y is the issue number.

The Meta-FAQ, an article containing information about the FAQ, is
available by requesting "send perl-users meta-faq". The real FAQ, as it
appeared last in the newsgroup, can be retrieved with the request "send
perl-users FAQ". Due to their sizes, neither the Meta-FAQ nor the FAQ
are included in the digest.

The "mini-FAQ", which is an updated version of the Meta-FAQ, is
available by requesting "send perl-users mini-faq". It appears twice
weekly in the group, but is not distributed in the digest.

For other requests pertaining to the digest, send mail to
perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
sending perl questions to the -request address, I don't have time to
answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V8 Issue 3542
**************************************

home help back first fref pref prev next nref lref last post