[32519] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 3784 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Sep 25 18:09:18 2012

Date: Tue, 25 Sep 2012 15:09:03 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Tue, 25 Sep 2012     Volume: 11 Number: 3784

Today's topics:
    Re: Can't find a syntax error, hoping a second set of e <ben@morrow.me.uk>
    Re: Can't find a syntax error, hoping a second set of e <ben@morrow.me.uk>
    Re: Can't find a syntax error, hoping a second set of e <ben@morrow.me.uk>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Tue, 25 Sep 2012 09:40:09 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Can't find a syntax error, hoping a second set of eyes will help
Message-Id: <9878j9-tt31.ln1@anubis.morrow.me.uk>


Quoth Jason C <jwcarlton@gmail.com>:
> On Monday, September 24, 2012 11:03:04 AM UTC-4, Ben Morrow wrote:
> 
> >     while (my ($tag, $url) = m#(<a...>(.*?)</a>)#gsi) {
> 
> In this, how does it know that we're testing $test? Or, did you mean to
> type something like:
> 
> while (my (tag, $url) = $text =~ m#(<a...>(.*?)</a>)#gsi)

Just so :). Sorry...

Ben



------------------------------

Date: Tue, 25 Sep 2012 10:28:35 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Can't find a syntax error, hoping a second set of eyes will help
Message-Id: <33a8j9-rf41.ln1@anubis.morrow.me.uk>


Quoth Jason C <jwcarlton@gmail.com>:
> On Monday, September 24, 2012 11:03:04 AM UTC-4, Ben Morrow wrote:
> > > FWIW, this modification did work:
> > > 
> > > while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {
> > >   $pattern = $1$2$3;
<snip>
> > >   if ($2 =~ /^http/i) {
> > >     $text =~ s/$pattern/$repl/gsi;
> > 
> > This almost certainly doesn't do what you think. If nothing else, you
> > want to \Q $pattern.
> 
> Excellent point about \Q. What do you mean, though, that it doesn't do
> what I think?

Well, for one thing, this link

    <a href="http://html5.org">HTML5</a>

will be stripped. I don't think that's what you meant.

> > What are you trying to do here: strip tags?
> 
> Yes and no. I'm using a contenteditable instead of a textarea, and I've
> discovered that when someone copy-and-pastes an URL from Chrome or FF,
> it's automatically making the URL a link. Eg:
> 
> <a href="http://www.google.com">http://www.google.com</a>
> 
> But of course, if you just type the address, then it doesn't. So on my
> end, I was using URI::Find to convert addresses to links, and ending up
> with a mess like:
> 
> <a href="<a href="http://www.google.com">http://www.google.com</a>"><a
> href="http://www.google.com">http://www.google.com</a></a>
> 
> So, my goal here is to remove the <a href> tag, but only if the linked
> text is an URL.

You're doing this backwards. You want to use HTML::Parser (or perhaps
HTML::TokeParser) to separate tags from text, and then just apply
URI::Find to 'text' sections which aren't already inside an <a> element.

> > Why not
> > just do one s/// (or, you know, use a module)?
> 
> I had originally tried doing it with a simple s///, but couldn't figure
> out how to make it conditional. Like this:
> 
> $text =~ s#<a[^>]*? href=(["'])*([^\1>]*)\1[^>]*?>(.*?)</a>#$2#gsi
>   if ($3 =~ /^http/i);
> 
> This worked correctly if I removed the if() statement. In testing, I
> changed the replacement to:
> 
> 1 - $1, 2 - $2, 3 - $3
> 
> just to make sure that $3 did begin with http, and it did, so I couldn't
> figure out why the if() wasn't catching it unless it was dropping the $3
> value before reaching the if().

 ...No. Maybe it would be clearer if you wrote it like this:

    if ($3 =~ /^http/i) {
        $text = s#...#...#gsi;
    }

(which is *exactly* equivalent)? The 'if' condition executes first, so
$3 is something completely random from the previous pattern match; and
in any case, the if covers the *whole* s///, not just one iteration.

You need to push the condition inside the s///. The obvious way of doing
that is

    s#<a ...>http:.*?</a>#$2#gsi;

though in more difficult cases you can use s///ge and put a ?: or
equivalent in the RHS.

Ben



------------------------------

Date: Tue, 25 Sep 2012 10:53:32 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Can't find a syntax error, hoping a second set of eyes will help
Message-Id: <shb8j9-fk41.ln1@anubis.morrow.me.uk>


Quoth Jim Gibson <jimsgibson@gmail.com>:
> In article <6d53b708-9e94-4bc9-8707-d9a130b2da2c@googlegroups.com>,
> Jason C <jwcarlton@gmail.com> wrote:
> 
> > On Monday, September 24, 2012 3:44:44 PM UTC-4, Uri Guttman wrote:
> > 
> > >   JC> while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {
> > > 
> > > it will fail if the opening quote is " and the string has a ' inside
> > > it. perfectly legal html but you can't parse it that way.
> > 
> > I'll probably discard this idea and pursue a module, like you guys suggested.
> > But for the sake of learning...
> > 
> > I recognized this issue, too, which is why I was originally using [^\1], like
> > so:
> > 
> > (["'])*([^\1>]*)\1
> > 
> > I think it was you that pointed out that I can't negate a backreference like
> > that, though.
> > 
> > What would be the correct way to do this, if I can't negate a
> backreference as a character class?
> 
> Capture the leading delimiter and use a backreference that is not in a
> character class:
> 
>   while ($text =~ m{(<a[^>]* href=(["']).*?\2.*?>)(.*?)(</a>)}gsi) {

That's not the same in general: .*? doesn't *want* to match a quote, but
it will if necessary to make the whole match succeed. In this particular
case it doesn't change anything because there is nothing between the \2
and the next .*?, but for instance these two

    m{<a href="[^"]*">}
    m{<a href=".*?">}

don't match the same thing. The second will match q{<a href="foo"">},
because the .*? will match a quote if forced, but the first will not.

The correct way to match 'everything until $rx' is (?:(?!$rx).)*, so in
this case

    m{... href=(["'])(?:(?!\2).)*\2 ...}

(which would certainly benefit from /x).

Ben



------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3784
***************************************


home help back first fref pref prev next nref lref last post