[32600] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 3873 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Jan 31 11:09:24 2013

Date: Thu, 31 Jan 2013 08:09:10 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Thu, 31 Jan 2013     Volume: 11 Number: 3873

Today's topics:
    Re: capturing, computing the ephemeris and passing it t <cal@example.invalid>
    Re: capturing, computing the ephemeris and passing it t <ben@morrow.me.uk>
    Re: capturing, computing the ephemeris and passing it t <cal@example.invalid>
    Re: capturing, computing the ephemeris and passing it t <ben@morrow.me.uk>
    Re: capturing, computing the ephemeris and passing it t <cwilbur@chromatico.net>
        Comparing a reference? (Tim McDaniel)
    Re: Comparing a reference? <ben@morrow.me.uk>
    Re: Comparing a reference? (Tim McDaniel)
    Re: Comparing a reference? (Tim McDaniel)
        The definitive statement on parsing HTML with regular e (Tim McDaniel)
    Re: The definitive statement on parsing HTML with regul <ben@morrow.me.uk>
    Re: The definitive statement on parsing HTML with regul (Tim McDaniel)
    Re: The definitive statement on parsing HTML with regul <ben@morrow.me.uk>
    Re: The definitive statement on parsing HTML with regul <rweikusat@mssgmbh.com>
    Re: The definitive statement on parsing HTML with regul <cwilbur@chromatico.net>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Tue, 29 Jan 2013 21:14:51 -0700
From: Cal Dershowitz <cal@example.invalid>
Subject: Re: capturing, computing the ephemeris and passing it to gfortran
Message-Id: <X7ednXckE7-mA5XMnZ2dnUVZ_qSdnZ2d@supernews.com>

On 01/24/2013 07:53 AM, Charlton Wilbur wrote:
>>>>>> "CD" == Cal Dershowitz <cal@example.invalid> writes:
>
>      CD> How does one use perl to upload images with the meta-data?
>
> One figures out what protocol the server or service one is uploading to
> uses to communicate about images and metadata, and one implements that
> protocol in Perl.  Then one uses that protocol to upload the images and
> that metadata.
>
> Different people can work on different parts of that solution, and there
> is a fair chance that the minimally clueful can find the first part
> mostly done on CPAN if they are using a common service.
>
> Charlton
>
>
>
>
>

What I'm finding Charlton, is that there's a steep learning curve with 
how perl interacts with html, and to the extent that I'm getting 
anything done at all, it's just muddling through.

Nominally, we're looking at this site:

http://www.fourmilab.ch/yoursky/

They're a collaboration of geeks who are thrilled with people making 
educational uses of their site that don't involve restricting 
distribution, so we're not misusing the resource.

When I asked about doing this a while back, Juergen said, "well, get a 
html tree parser, and evaluate it*."  *not his exact words.

I've fished around as I might this week, looking at how I might do this, 
and I think there are at least 4 modules that have a good shot at being 
tools I could well use here, but to my mind it came down to 
HTML::Tokeparser and WWW::Mechanize, and then ultimately the latter.

I'm getting partial results.

To my mind, I need to re-create the input sequence with perl that I 
would if I were getting there using GUI events.

This is where I need to be to get data:

http://merrillpjensen.com/mel_4.html

So the "metadata" is, in my current usage, anything the server wants to 
tell me about the top image.  How one gets there is to
1) click nearby city
2) hit SF, CA
2.5) click Make Sky Map
3) hit the radio button for "universal time"
4) give it the value of 1 a.m. UTC
5) click Make Sky map
5.5)  Capture sky map image
6) capture the ephemeris table

Here's what I have.

First of all, there's all the garbage when I had @list declared instead 
of $list:

$ ./capture3.pl
HTTP::Response=HASH(0x8b9316c)
WWW::Mechanize::Link=ARRAY(0x8bc29ac) 
WWW::Mechanize::Link=ARRAY(0x8c469b8) WWW::Mechanize::Link=ARRAY(0x8c460d0)
 ...
WWW::Mechanize::Link=ARRAY(0x8c41f40) 
WWW::Mechanize::Link=ARRAY(0x8c41d38) WWW::Mechanize::Link=ARRAY(0x8c3ed84)

then there's the garbage when I try it with a dollar sign:

$ ./capture3.pl
HTTP::Response=HASH(0x8b9715c)
ARRAY(0x8a989a8)

Here's what cpan has to say about it:

$mech->links()

When called in a list context, returns a list of the links found in the 
last fetched page. In a scalar context it returns a reference to an 
array with those links. Each link is a WWW::Mechanize::Link object.


$ cat capture3.pl
#!/usr/bin/perl -w
use strict;
use autodie;
use utf8;
use WWW::Mechanize;
use HTML::TokeParser;

#Example
my $url = 'http://www.fourmilab.ch/yoursky/';

my $mech = WWW::Mechanize->new;

my $result = $mech->get( $url );
die "GET failedn" unless $result->is_success;

print "$result\n";

my $filename = 'content1.txt';
$mech->save_content( $filename );

my $list= $mech->links();

print "$list\n";
$ cat capture3.pl
#!/usr/bin/perl -w
use strict;
use autodie;
use utf8;
use WWW::Mechanize;
use HTML::TokeParser;

my $url = 'http://www.fourmilab.ch/yoursky/';
my $mech = WWW::Mechanize->new;
my $result = $mech->get( $url );
die "GET failedn" unless $result->is_success;
print "$result\n";
my $filename = 'content1.txt';
$mech->save_content( $filename );
my $list= $mech->links();
print "$list\n";
$

The save_content method worked great, so I'm not stuck in the mud; 
nevertheless, I would loe to see how other people have gotten this 
material to work for them as opposed to against them.  It seems like I'm 
off by one to two levels of misdirection on my print statements.

Fishing for tips,
-- 
Cal



------------------------------

Date: Wed, 30 Jan 2013 13:47:26 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: capturing, computing the ephemeris and passing it to gfortran
Message-Id: <esknt9-ueo.ln1@anubis.morrow.me.uk>


Quoth Cal Dershowitz <cal@example.invalid>:
> 
> $ ./capture3.pl
> HTTP::Response=HASH(0x8b9316c)
> WWW::Mechanize::Link=ARRAY(0x8bc29ac) 
> WWW::Mechanize::Link=ARRAY(0x8c469b8) WWW::Mechanize::Link=ARRAY(0x8c460d0)
> ...
> WWW::Mechanize::Link=ARRAY(0x8c41f40) 
> WWW::Mechanize::Link=ARRAY(0x8c41d38) WWW::Mechanize::Link=ARRAY(0x8c3ed84)
> 
> then there's the garbage when I try it with a dollar sign:
> 
> $ ./capture3.pl
> HTTP::Response=HASH(0x8b9715c)
> ARRAY(0x8a989a8)
> 
> Here's what cpan has to say about it:
> 
> $mech->links()
> 
> When called in a list context, returns a list of the links found in the 
> last fetched page. In a scalar context it returns a reference to an 
> array with those links. Each link is a WWW::Mechanize::Link object.

 ...so you need to read the documentation for WWW::Mechanize::Link to
find out what you can do with one of those.

> my $list= $mech->links();
> print "$list\n";
> $
> 
> The save_content method worked great, so I'm not stuck in the mud; 
> nevertheless, I would loe to see how other people have gotten this 
> material to work for them as opposed to against them.  It seems like I'm 
> off by one to two levels of misdirection on my print statements.

The word you're looking for is 'indirection', and yes, that's exactly
what's wrong. $list above is a reference to an array of objects, so
before you can print anything you need to first pull out the object you
want and then call a method on that object which returns something
printable.

Ben



------------------------------

Date: Wed, 30 Jan 2013 20:53:56 -0700
From: Cal Dershowitz <cal@example.invalid>
Subject: Re: capturing, computing the ephemeris and passing it to gfortran
Message-Id: <avidnRHBqcdLd5TMnZ2dnUVZ_o2dnZ2d@supernews.com>

On 01/30/2013 06:47 AM, Ben Morrow wrote:

>  The word you're looking for is 'indirection', and yes, that's exactly
> what's wrong. $list above is a reference to an array of objects, so
> before you can print anything you need to first pull out the object you
> want and then call a method on that object which returns something
> printable.
>
> Ben
>

Alright, thx ben, that gives me something more.  The documentation there 
though is pretty thin.  I do see the thing I'm looking for in the list 
when I call the text method:

$ ./capture3.pl
tags
link
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
URI
yoursky.css
/
help/help.html
cities.html
/yoursky/help/controls.html#Site
/yoursky/cities.html
cities.html
help/horcontrols.html#Site
cities.html
help/horcontrols.html#ViewPoint
help/horcontrols.html#Azimuth
help/telcontrols.html
telcustom.html
catalogues/catalogues.html
help/telcontrols.html#AimPoint
help/telcontrols.html#RightAscension
help/telcontrols.html#Declination
catalogues/catalogues.html
help/help.html
/homeplanet/homeplanet.html
/earthview/vplanet.html
/solar/solar.html
/earthview/moon_ap_per.html
/terranova/terranova.html
/homeplanet/
/skyscrsv/
/sitemap.html#moontoolw
/craters/
/sitemap.html#poss
/sitemap.html#moontool
/sitemap.html#xsunclock
/
credits.html
/
http://validator.w3.org/check?uri=referer
http://jigsaw.w3.org/css-validator/check/referer
text
Use of uninitialized value in concatenation (.) or string at 
 ./capture3.pl line 21.

John Walker
help file
nearby city
Observing Site
Set for nearby city
nearby city
Observing Site
Set for nearby city
Viewpoint
Azimuth
Controls
custom settings
object catalogues
Aim Point
Right Ascension
Declination
Find object in catalogue
Your Sky help
Home Planet
Earth and Moon Viewer
Solar System Live: interactive orrery
Moon at Perigee and Apogee
Terranova: a new terraformed planet every day
Home Planet
Sky screen saver
Moontool
Craters screen saver
catalogue
Moontool
Xsunclock
home page
credits
John Walker
Valid XHTML 1.0
Valid CSS
$ cat capture3.pl
#!/usr/bin/perl -w
use strict;
use autodie;
use utf8;
use WWW::Mechanize;
use HTML::TokeParser;

my $url = 'http://www.fourmilab.ch/yoursky/';
my $mech = WWW::Mechanize->new;
my $result = $mech->get( $url );
die "GET failedn" unless $result->is_success;
# print "$result\n";
my $filename = 'content1.txt';
$mech->save_content( $filename );
my @links = $mech->links();
print "tags\n";
print $_->tag()."\n" foreach @links;
print "URI\n";
print $_->URI()."\n" foreach @links;
print "text\n";
print $_->text()."\n" foreach @links;


$

All I need to do is click on
Set for nearby city
 .  Do I need to submit a form?

What gives with all these warnings?

Use of uninitialized value in concatenation (.) or string at 
 ./capture3.pl line 21.

My guess is that the field doesn't exist.
-- 
Cal


------------------------------

Date: Thu, 31 Jan 2013 14:25:05 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: capturing, computing the ephemeris and passing it to gfortran
Message-Id: <1fbqt9-0331.ln1@anubis.morrow.me.uk>


Quoth Cal Dershowitz <cal@example.invalid>:
> On 01/30/2013 06:47 AM, Ben Morrow wrote:
> 
> >  The word you're looking for is 'indirection', and yes, that's exactly
> > what's wrong. $list above is a reference to an array of objects, so
> > before you can print anything you need to first pull out the object you
> > want and then call a method on that object which returns something
> > printable.
> 
> Alright, thx ben, that gives me something more.  The documentation there 
> though is pretty thin.  I do see the thing I'm looking for in the list 
> when I call the text method:
<snip>
> 
> All I need to do is click on
> Set for nearby city
> .  Do I need to submit a form?

I don't know. Is the link you want to click on one which submits a form?

Probably you want Mech's ->follow_link method, or you just want to start
on the right page in the first place.

> What gives with all these warnings?
> 
> Use of uninitialized value in concatenation (.) or string at 
> ./capture3.pl line 21.
> 
> My guess is that the field doesn't exist.

That warning was given when you called ->text on a <link> element.
<link> elements don't have any content.

Ben



------------------------------

Date: Thu, 31 Jan 2013 09:56:16 -0500
From: Charlton Wilbur <cwilbur@chromatico.net>
Subject: Re: capturing, computing the ephemeris and passing it to gfortran
Message-Id: <87y5f9wgdr.fsf@new.chromatico.net>

>>>>> "CD" == Cal Dershowitz <cal@example.invalid> writes:

    CD> The documentation
    CD> there though is pretty thin. 

Perhaps you are not looking in the right place:

        perldoc perlref
        perldoc perlreftut

Charlton


-- 
Charlton Wilbur
cwilbur@chromatico.net


------------------------------

Date: Wed, 30 Jan 2013 22:03:37 +0000 (UTC)
From: tmcd@panix.com (Tim McDaniel)
Subject: Comparing a reference?
Message-Id: <kec5bp$mlr$1@reader1.panix.com>

I inherited code that had, in effect,

    my $kind = 'Val';
    ...
    if ($kind eq 'Val')

For various reasons, I want to change it to

    my $kind = \&some_sub;

Is there a reliable, guaranteed way to do that and still have the
conditional?   (This is in 5.8.8 and I have no way to change that.)

man perlref says

     Using a string or number as a reference produces a symbolic
     reference, as explained above.  Using a reference as a number
     produces an integer representing its storage location in memory.
     The only useful thing to be done with this is to compare two
     references numerically to see whether they refer to the same
     location.

        if ($ref1 == $ref2) {  # cheap numeric compare of references
            print "refs 1 and 2 refer to the same thing\n";
        }

I also ran across
http://stackoverflow.com/questions/4064001/how-should-i-compare-perl-references
, where there was one reply that said

    The function you are looking for is refaddr from Scalar::Util
    (after ensuring that the values being compared really are
    references):

    use Scalar::Util 'refaddr';

    if ($obj1 and ref($obj1) and $obj2 and ref($obj2) and
        refaddr($obj1) == refaddr($obj2))
        {
        # objects are the same...
        }

with tchrist replying "The extraordinary measures taken by
cpan/List-Util/lib/Scalar/Util/PP.pm's refaddr() function to divine
the referent's real address are exceeded only by the blessed()
function's measures to find the package name.", and a reply to that
that it's in the Perl core and compiled, so it's cheap.

Is there a reason to use SCalar::Util::refaddr instead of ==?

-- 
Tim McDaniel, tmcd@panix.com


------------------------------

Date: Wed, 30 Jan 2013 23:04:05 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Comparing a reference?
Message-Id: <5glot9-ggr.ln1@anubis.morrow.me.uk>


Quoth tmcd@panix.com:
> I inherited code that had, in effect,
> 
>     my $kind = 'Val';
>     ...
>     if ($kind eq 'Val')
> 
> For various reasons, I want to change it to
> 
>     my $kind = \&some_sub;
> 
> Is there a reliable, guaranteed way to do that and still have the
> conditional?   (This is in 5.8.8 and I have no way to change that.)

As you seem to have already worked out, either

    if ($kind == \&some_sub)

or

    if (refaddr $kind == refaddr \&some_sub)

> man perlref says
> 
>      Using a string or number as a reference produces a symbolic
>      reference, as explained above.  Using a reference as a number
>      produces an integer representing its storage location in memory.
>      The only useful thing to be done with this is to compare two
>      references numerically to see whether they refer to the same
>      location.
> 
>         if ($ref1 == $ref2) {  # cheap numeric compare of references
>             print "refs 1 and 2 refer to the same thing\n";
>         }
> 
> I also ran across
> http://stackoverflow.com/questions/4064001/how-should-i-compare-
> perl-references
> , where there was one reply that said
> 
>     The function you are looking for is refaddr from Scalar::Util
>     (after ensuring that the values being compared really are
>     references):
> 
>     use Scalar::Util 'refaddr';
> 
>     if ($obj1 and ref($obj1) and $obj2 and ref($obj2) and
>         refaddr($obj1) == refaddr($obj2))

If you are comparing to a literal ref you can skip the safety checks,
since refaddr returns undef if its argument isn't a ref. If $kind not
being a ref is an expected occurence, you may want to turn off
uninitialized value warnings.

>         {
>         # objects are the same...
>         }
> 
> with tchrist replying "The extraordinary measures taken by
> cpan/List-Util/lib/Scalar/Util/PP.pm's refaddr() function to divine
> the referent's real address are exceeded only by the blessed()
> function's measures to find the package name.", and a reply to that
> that it's in the Perl core and compiled, so it's cheap.

The XS version of Scalar::Util is core under 5.8.8 (corelist
Scalar::Util), so the PP version is irrelevant. (Tom is right that the
mechanism for finding the address in pure Perl is ridiculous, but it
does work reliably.)

> Is there a reason to use SCalar::Util::refaddr instead of ==?

The only reason is if there is a chance either ref might point to an
object which overloads either == or numify, or if there is a chance
$kind might not hold a ref at all. If you know you are dealing with
unblessed refs there is no reason not to use ==.

If there is a chance you might be running under ithreads, it's important
to check against the current value of \&some_sub, rather than trying to
cache the numeric value, since the value of the ref will change when a
new thread is forked.

Ben



------------------------------

Date: Wed, 30 Jan 2013 23:26:23 +0000 (UTC)
From: tmcd@panix.com (Tim McDaniel)
Subject: Re: Comparing a reference?
Message-Id: <keca6v$d31$1@reader1.panix.com>

In article <5glot9-ggr.ln1@anubis.morrow.me.uk>,
Ben Morrow  <ben@morrow.me.uk> wrote:
>
>Quoth tmcd@panix.com:
>> Is there a reason to use SCalar::Util::refaddr instead of ==?
>
>The only reason is if there is a chance either ref might point to an
>object which overloads either == or numify, or if there is a chance
>$kind might not hold a ref at all. If you know you are dealing with
>unblessed refs there is no reason not to use ==.
>
>If there is a chance you might be running under ithreads, it's important
>to check against the current value of \&some_sub, rather than trying to
>cache the numeric value, since the value of the ref will change when a
>new thread is forked.

Thank you for the quick response.

$kind is one reference out of four possible refs, along the lines of

    my $lookup = {
        CASE1 => \&Pkg::Sub1,
        CASE2 => \&Pkg::Sub2,
        CASE3 => \&Pkg::Sub3,
        CASE4 => \&Pkg::Sub4,
    };
    $kind = $lookup->{$arg};
    return unless $kind;

There is no threading, no blessing, and no object munging in the 40
lines between setting and this comparison.

Reading in further Google hits, though, I think I may do refaddr just
to be paranoid.

-- 
Tim McDaniel, tmcd@panix.com



------------------------------

Date: Thu, 31 Jan 2013 00:36:50 +0000 (UTC)
From: tmcd@panix.com (Tim McDaniel)
Subject: Re: Comparing a reference?
Message-Id: <keceb2$cp5$1@reader1.panix.com>

In article <keca6v$d31$1@reader1.panix.com>,
Tim McDaniel <tmcd@panix.com> wrote:
>$kind is one reference out of four possible refs, along the lines of
>
>    my $lookup = {
>        CASE1 => \&Pkg::Sub1,
>        CASE2 => \&Pkg::Sub2,
>        CASE3 => \&Pkg::Sub3,
>        CASE4 => \&Pkg::Sub4,
>    };
>    $kind = $lookup->{$arg};
>    return unless $kind;
>
>There is no threading, no blessing, and no object munging in the 40
>lines between setting and this comparison.
>
>Reading in further Google hits, though, I think I may do refaddr just
>to be paranoid.

I just realized that $arg is also in scope at the point of the
comparison.  So instead of doing

    if (Scalar::Util::refaddr($kind) == Scalar::Util::refaddr(\&Pkg::Sub1)) {

I can simply do

    if ($arg eq 'CASE1') {

and finesse away the whole issue.

Still, thank you for the answer.

-- 
Tim McDaniel, tmcd@panix.com


------------------------------

Date: Tue, 29 Jan 2013 21:57:20 +0000 (UTC)
From: tmcd@panix.com (Tim McDaniel)
Subject: The definitive statement on parsing HTML with regular expressions
Message-Id: <ke9gk0$9vd$1@reader1.panix.com>

I'd have to say that at
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
the first answer is definitive.  I know that The Pony is real, for I
have fed carrots to His Effulgent Face.  And I don't even know what
"Effulgent" means, except that it means His Face.

Actually, I just saw it on the Cheezburger Network and thought it was
funny.

And yes, if you *know* that your HTML is simple and limited (for
example, generated by a known program), you may be able to parse those
particular files with regexps.

-- 
Tim McDaniel, tmcd@panix.com


------------------------------

Date: Tue, 29 Jan 2013 22:46:30 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: The definitive statement on parsing HTML with regular expressions
Message-Id: <630mt9-t8h.ln1@anubis.morrow.me.uk>


Quoth tmcd@panix.com:
> I'd have to say that at
> http://stackoverflow.com/questions/1732348/regex-match-open-tags-
> except-xhtml-self-contained-tags

That was a posted a *long* time ago...

> the first answer is definitive.  I know that The Pony is real, for I
> have fed carrots to His Effulgent Face.  And I don't even know what
> "Effulgent" means, except that it means His Face.

(A BtVS reference?)

> And yes, if you *know* that your HTML is simple and limited (for
> example, generated by a known program), you may be able to parse those
> particular files with regexps.

It is, in fact, possible to parse HTML correctly with Perl regexen: HTML
is a context-free language, and with (?<>) and (?&) Perl's regexen are
capable of matching context-free languages. Below is a pattern which
matches valid XML, except for DTDs (which are a whole nother
sublanguage). It's translated fairly directly from the BNF in the
standard. However, it's currently rather difficult to modify it to do
anything *useful* with the result, most importantly because of the
limitations on both (?(DEFINE)) and (?{}).

Ben

m(  (?<xml> (?&document) )

    (?(DEFINE)

        # Document

        (?<document>
            (?&prolog) (?&element) (?&Misc)*
        )

        # Prolog

        (?<prolog>      (?&XMLDecl) (?&Misc)* )
        (?<XMLDecl>     
            <\?xml (?&VersionInfo) (?&EncodingDecl)? (?&S)? \?> 
        )
        (?<Misc>        (?&Comment) | (?&S) )
        (?<Eq>          (?&S)? = (?&S)? )

        (?<VersionInfo> (?&S) version (?&Eq) (?: '1\.[10]' | "1\.[10]" ) )

        (?<EncodingDecl>    
            (?&S) encoding (?&Eq) (?: "(?&EncName)" | '(?&EncName)' )
        )
        (?<EncName>     [A-Za-z] (?: [A-Za-z0-9._-] )* )

        # Character sets

        (?<Char> 
            [\x9-\xA\xD\x20-\x7E\x85\xA0-\x{D7FF}]      |
            [\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]
        )
        (?<S> [\x20\x9\xD\xA]+ )

        # Names

        (?<NameStartChar>
            [:A-Z_a-z\xC0-\xD6\xD8-\xF6\xF8-\x{2FF}\x{370}-\x{37D}]     | 
            [\x{37F}-\x{1FFF}\x{200C}-\x{200D}\x{2070}-\x{218F}]        |
            [\x{2C00}-\x{2FEF}\x{3001}-\x{D7FF}\x{F900}-\x{FDCD}]       |
            [\x{FDF0}-\x{FFFD}\x{10000}-\x{EFFFF}]
        )
        (?<NameChar>
            (?&NameStartChar)                               | 
            [-.0-9\xB7\x{0300}-\x{036F}\x{203F}-\x{2040}]
        )
        (?<Name>        (?&NameStartChar) (?&NameChar)* )
        (?<Names>       (?&Name) (?: \x20 (?&Name) )* )

        # Comments

        (?<Comment>
            <!-- (?:
                (?: (?! - ) (?&Char) ) |
                (?: - (?! - ) (?&Char) )
            )* -->
        )

        # CDATA sections

        (?<CDSect>  (?&CDStart) (?&CData) (?&CDEnd) )
        (?<CDStart> <!\[CDATA\[ )
        (?<CData>   (?: (?! \]\]> ) (?&Char) )* )
        (?<CDEnd>   \]\]> )

        # Element

        (?<element> (?&EmptyElemTag) | (?&STag) (?&content) (?&ETag) )

        (?<STag>        < (?&Name) (?: (?&S) (?&Attribute) )* (?&S)? > )
        (?<Attribute>   (?&Name) (?&Eq) (?&AttValue) )
        (?<ETag>        </ (?&Name) (?&S)? > )

        (?<AttValue>
            " (?: [^<&"] | (?&Reference) )* " |
            ' (?: [^<&'] | (?&Reference) )* '
        )

        # Content of elements

        (?<content>
            (?&CharData)? (?:
                (?: (?&element) | (?&Reference) | (?&CDSect) |
                    (?&Comment)
                )
                (?&CharData)?
            )*
        )
        (?<CharData> (?! [^<&]* \]\]> [^<&]* ) [^<&]* )

        # Empty elements

        (?<EmptyElemTag>
            < (?&Name) 
                (?: (?&S) (?&Attribute) )* (?&S)?
            />
        )

        # Character reference

        (?<Reference>   (?&EntityRef) | (?&CharRef) )
        (?<CharRef>     &\# [0-9]+ ; | &\#x [0-9a-fA-F]+ ; )
        (?<EntityRef>   & (?&Name) ; )
    )
)x


------------------------------

Date: Wed, 30 Jan 2013 00:12:52 +0000 (UTC)
From: tmcd@panix.com (Tim McDaniel)
Subject: Re: The definitive statement on parsing HTML with regular expressions
Message-Id: <ke9oi4$or5$1@reader1.panix.com>

In article <630mt9-t8h.ln1@anubis.morrow.me.uk>,
Ben Morrow  <ben@morrow.me.uk> wrote:
>
>Quoth tmcd@panix.com:
>> I'd have to say that at
>> http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
>
>That was a posted a *long* time ago...

March 2012 now counts as "a *long* time ago" in Interweb Time.
In any event, I wrote,
>> Actually, I just saw it on the Cheezburger Network and thought it was
>> funny.

>> the first answer is definitive.  I know that The Pony is real, for I
>> have fed carrots to His Effulgent Face.  And I don't even know what
>> "Effulgent" means, except that it means His Face.
>
>(A BtVS reference?)

If so, only by accident.  Looking up "effulgent", I should have
written "Darkly Effulgent" for better effect.

>It is, in fact, possible to parse HTML correctly with Perl regexen
 ...
>Below is a pattern which matches valid XML

Um, your goalposts seem to be moving.

>However, it's currently rather difficult to modify it to do
>anything *useful* with the result, most importantly because of the
>limitations on both (?(DEFINE)) and (?{}).

Bit of a drawback, eh wot? as few people want to merely recognize XML.

In any event, I think it's difficult to parse HTML or XML *correctly*
with *any* technology, due to corner cases and features.  In general,
a better answer is usually to use an existing module.

-- 
Tim McDaniel, tmcd@panix.com


------------------------------

Date: Wed, 30 Jan 2013 01:03:28 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: The definitive statement on parsing HTML with regular expressions
Message-Id: <048mt9-7gi.ln1@anubis.morrow.me.uk>


Quoth tmcd@panix.com:
> In article <630mt9-t8h.ln1@anubis.morrow.me.uk>,
> Ben Morrow  <ben@morrow.me.uk> wrote:
> >
> >Quoth tmcd@panix.com:
> >> I'd have to say that at
> >> http://stackoverflow.com/questions/1732348/regex-match-open-tags-
> >> except-xhtml-self-contained-tags
> >
> >That was a posted a *long* time ago...
> 
> March 2012 now counts as "a *long* time ago" in Interweb Time.

Yeah, pretty much :).

> >It is, in fact, possible to parse HTML correctly with Perl regexen
> ...
> >Below is a pattern which matches valid XML
> 
> Um, your goalposts seem to be moving.

True. XML is a simpler grammar to implement than HTML (4, which is the
last version with a context-free grammar), and I happened to have that
code lying around. The principle is the same, the HTML grammar is just
longer and more tedious.

(HTML5 is a whole nother kettle of fish, and in practice parsing random
HTML as HTML5 is far more likely to give useful results than parsing it
as HTML4. HTML5's 'grammar' is neither regular nor context-free, so none
of the traditional parsing approaches will help here.)

> >However, it's currently rather difficult to modify it to do
> >anything *useful* with the result, most importantly because of the
> >limitations on both (?(DEFINE)) and (?{}).
> 
> Bit of a drawback, eh wot? as few people want to merely recognize XML.

Yes. However, this facility has the (IMHO) important benefit of knocking
down the academic 'you can't do that in theory' arguments some people
keep putting forward whenever this topic comes up (including the comment
you referenced, amusing though it may be), hopefully leaving room for a
more sensible discussion of what methods are useful in practice, and
under what circumstances.

> In any event, I think it's difficult to parse HTML or XML *correctly*
> with *any* technology, due to corner cases and features.  In general,
> a better answer is usually to use an existing module.

While in the case of an existing language with existing parsers using
them is a good idea, in the more general case of someone needing to
*write* such a parser there is still a question about which technologies
are most suitable. IMHO something like Perl 6's grammars (which are a
generalisation of Perl 5's regexes in the direction of things like
Haskell's Parsec) make the job relatively straightforward. 

The grammar itself is just a straight translation of the ABNF with
actions added, and unlike systems like yacc/Yapp/Parse::RecDescent the
fact that there isn't a compilation step means the grammar can be
extended dynamically to support language extensions. Perl 5's regexes
aren't quite there yet, but when they are it would be a shame for people
to ignore that facility because 'everyone knows you can't parse HTML
with regexes'.

Ben



------------------------------

Date: Wed, 30 Jan 2013 01:18:43 +0000
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: The definitive statement on parsing HTML with regular expressions
Message-Id: <87libbeado.fsf@sapphire.mobileactivedefense.com>

tmcd@panix.com (Tim McDaniel) writes:

[...]

>>However, it's currently rather difficult to modify it to do
>>anything *useful* with the result, most importantly because of the
>>limitations on both (?(DEFINE)) and (?{}).
>
> Bit of a drawback, eh wot? as few people want to merely recognize XML.
>
> In any event, I think it's difficult to parse HTML or XML *correctly*
> with *any* technology, due to corner cases and features.  In general,
> a better answer is usually to use an existing module.

The conclusion "it is difficult" => "everybody else must have solved
it correctly already" seems a little flimsy to me ...


------------------------------

Date: Wed, 30 Jan 2013 10:17:17 -0500
From: Charlton Wilbur <cwilbur@chromatico.net>
Subject: Re: The definitive statement on parsing HTML with regular expressions
Message-Id: <8738xiya2q.fsf@new.chromatico.net>

>>>>> "RW" == Rainer Weikusat <rweikusat@mssgmbh.com> writes:

    RW> tmcd@panix.com (Tim McDaniel) writes: [...]

    >> In any event, I think it's difficult to parse HTML or XML
    >> *correctly* with *any* technology, due to corner cases and
    >> features.  In general, a better answer is usually to use an
    >> existing module.

    RW> The conclusion "it is difficult" => "everybody else must have
    RW> solved it correctly already" seems a little flimsy to me ...

How the hell do you make that leap?

It is difficult, so it is better to use a mature code package that many
people have used (and thus tested) than it is to roll your own.

Charlton



-- 
Charlton Wilbur
cwilbur@chromatico.net


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3873
***************************************


home help back first fref pref prev next nref lref last post