[32519] in Perl-Users-Digest
Perl-Users Digest, Issue: 3784 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Sep 25 18:09:18 2012
Date: Tue, 25 Sep 2012 15:09:03 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Tue, 25 Sep 2012 Volume: 11 Number: 3784
Today's topics:
Re: Can't find a syntax error, hoping a second set of e <ben@morrow.me.uk>
Re: Can't find a syntax error, hoping a second set of e <ben@morrow.me.uk>
Re: Can't find a syntax error, hoping a second set of e <ben@morrow.me.uk>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Tue, 25 Sep 2012 09:40:09 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Can't find a syntax error, hoping a second set of eyes will help
Message-Id: <9878j9-tt31.ln1@anubis.morrow.me.uk>
Quoth Jason C <jwcarlton@gmail.com>:
> On Monday, September 24, 2012 11:03:04 AM UTC-4, Ben Morrow wrote:
>
> > while (my ($tag, $url) = m#(<a...>(.*?)</a>)#gsi) {
>
> In this, how does it know that we're testing $test? Or, did you mean to
> type something like:
>
> while (my (tag, $url) = $text =~ m#(<a...>(.*?)</a>)#gsi)
Just so :). Sorry...
Ben
------------------------------
Date: Tue, 25 Sep 2012 10:28:35 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Can't find a syntax error, hoping a second set of eyes will help
Message-Id: <33a8j9-rf41.ln1@anubis.morrow.me.uk>
Quoth Jason C <jwcarlton@gmail.com>:
> On Monday, September 24, 2012 11:03:04 AM UTC-4, Ben Morrow wrote:
> > > FWIW, this modification did work:
> > >
> > > while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {
> > > $pattern = $1$2$3;
<snip>
> > > if ($2 =~ /^http/i) {
> > > $text =~ s/$pattern/$repl/gsi;
> >
> > This almost certainly doesn't do what you think. If nothing else, you
> > want to \Q $pattern.
>
> Excellent point about \Q. What do you mean, though, that it doesn't do
> what I think?
Well, for one thing, this link
<a href="http://html5.org">HTML5</a>
will be stripped. I don't think that's what you meant.
> > What are you trying to do here: strip tags?
>
> Yes and no. I'm using a contenteditable instead of a textarea, and I've
> discovered that when someone copy-and-pastes an URL from Chrome or FF,
> it's automatically making the URL a link. Eg:
>
> <a href="http://www.google.com">http://www.google.com</a>
>
> But of course, if you just type the address, then it doesn't. So on my
> end, I was using URI::Find to convert addresses to links, and ending up
> with a mess like:
>
> <a href="<a href="http://www.google.com">http://www.google.com</a>"><a
> href="http://www.google.com">http://www.google.com</a></a>
>
> So, my goal here is to remove the <a href> tag, but only if the linked
> text is an URL.
You're doing this backwards. You want to use HTML::Parser (or perhaps
HTML::TokeParser) to separate tags from text, and then just apply
URI::Find to 'text' sections which aren't already inside an <a> element.
> > Why not
> > just do one s/// (or, you know, use a module)?
>
> I had originally tried doing it with a simple s///, but couldn't figure
> out how to make it conditional. Like this:
>
> $text =~ s#<a[^>]*? href=(["'])*([^\1>]*)\1[^>]*?>(.*?)</a>#$2#gsi
> if ($3 =~ /^http/i);
>
> This worked correctly if I removed the if() statement. In testing, I
> changed the replacement to:
>
> 1 - $1, 2 - $2, 3 - $3
>
> just to make sure that $3 did begin with http, and it did, so I couldn't
> figure out why the if() wasn't catching it unless it was dropping the $3
> value before reaching the if().
...No. Maybe it would be clearer if you wrote it like this:
if ($3 =~ /^http/i) {
$text = s#...#...#gsi;
}
(which is *exactly* equivalent)? The 'if' condition executes first, so
$3 is something completely random from the previous pattern match; and
in any case, the if covers the *whole* s///, not just one iteration.
You need to push the condition inside the s///. The obvious way of doing
that is
s#<a ...>http:.*?</a>#$2#gsi;
though in more difficult cases you can use s///ge and put a ?: or
equivalent in the RHS.
Ben
------------------------------
Date: Tue, 25 Sep 2012 10:53:32 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Can't find a syntax error, hoping a second set of eyes will help
Message-Id: <shb8j9-fk41.ln1@anubis.morrow.me.uk>
Quoth Jim Gibson <jimsgibson@gmail.com>:
> In article <6d53b708-9e94-4bc9-8707-d9a130b2da2c@googlegroups.com>,
> Jason C <jwcarlton@gmail.com> wrote:
>
> > On Monday, September 24, 2012 3:44:44 PM UTC-4, Uri Guttman wrote:
> >
> > > JC> while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {
> > >
> > > it will fail if the opening quote is " and the string has a ' inside
> > > it. perfectly legal html but you can't parse it that way.
> >
> > I'll probably discard this idea and pursue a module, like you guys suggested.
> > But for the sake of learning...
> >
> > I recognized this issue, too, which is why I was originally using [^\1], like
> > so:
> >
> > (["'])*([^\1>]*)\1
> >
> > I think it was you that pointed out that I can't negate a backreference like
> > that, though.
> >
> > What would be the correct way to do this, if I can't negate a
> backreference as a character class?
>
> Capture the leading delimiter and use a backreference that is not in a
> character class:
>
> while ($text =~ m{(<a[^>]* href=(["']).*?\2.*?>)(.*?)(</a>)}gsi) {
That's not the same in general: .*? doesn't *want* to match a quote, but
it will if necessary to make the whole match succeed. In this particular
case it doesn't change anything because there is nothing between the \2
and the next .*?, but for instance these two
m{<a href="[^"]*">}
m{<a href=".*?">}
don't match the same thing. The second will match q{<a href="foo"">},
because the .*? will match a quote if forced, but the first will not.
The correct way to match 'everything until $rx' is (?:(?!$rx).)*, so in
this case
m{... href=(["'])(?:(?!\2).)*\2 ...}
(which would certainly benefit from /x).
Ben
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 3784
***************************************