[18138] in Perl-Users-Digest
Perl-Users Digest, Issue: 306 Volume: 10
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat Feb 17 03:10:31 2001
Date: Sat, 17 Feb 2001 00:10:10 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Message-Id: <982397410-v10-i306@ruby.oce.orst.edu>
Content-Type: text
Perl-Users Digest Sat, 17 Feb 2001 Volume: 10 Number: 306
Today's topics:
Re: striping HTML <joe+usenet@sunstarsys.com>
Re: striping HTML <bart.lateur@skynet.be>
Re: striping HTML <gerard@NOSPAMlanois.com>
Re: striping HTML <no@email.com>
Re: striping HTML <beable@my-deja.com>
Re: striping HTML <no@email.com>
Re: striping HTML <beable@my-deja.com>
Re: striping HTML <dontspamthewebmaster@webdragon.net>
Re: three-part form submission using CGI.pm <dontspamthewebmaster@webdragon.net>
Re: Why can't I grab this URL? <whataman@home.com>
Digest Administrivia (Last modified: 16 Sep 99) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: 16 Feb 2001 21:42:41 -0500
From: Joe Schaefer <joe+usenet@sunstarsys.com>
Subject: Re: striping HTML
Message-Id: <m3ofw2f2se.fsf@mumonkan.sunstarsys.com>
Gerard Lanois <gerard@NOSPAMlanois.com> writes:
> Funny, I was just writing a stripper earlier today.
>
> Here is what I came up with. A great deal of your __DATA__
> gets through, though much of it doesn't end up as valid HTML.
>
> Comments and criticisms are welcome.
It's pretty nice, but it's usually safer (not always what
you want though) to convert the special characters to
HTML-encoded versions (like < -> <).
One problem I found with yours is that it doesn't handle the
"\0" character in a browser-safe way. For example, adding some
CGI to your code to do the conversion:
% try.pl t='<%00html><%00head><%00title>foo<%00/title><%00/head>
<%00body bgcolor="red">hola<%00/body><%00/html>'
outputs:
<html><head><title>foo</title></head><body bgcolor="red">hola</body>
Most browsers will parse this just as it appears here. I'm not sure
if '\0' is the only strange char that defeats your stripper, but
offhand I can't think of any others that will break it without
destroying the rendering in a web browser as well.
Here's what I used for your code; the fix for this is commented
out.
% cat try.pl
#!/usr/bin/perl -w
use strict;
package Stripper;
use HTML::Parser;
use HTML::Entities;
use vars qw ( @ISA );
@ISA = qw ( HTML::Parser );
sub new
{
my ($class, %args) = @_;
my $self = $class->SUPER::new(%args);
return $self;
}
sub text
{
my $self = shift;
my $text = shift;
my $dtext = HTML::Entities::decode($text);
# Accumulate pieces of text.
if (defined($self->{_stringref}) && length($dtext)) {
${$self->{_stringref}} .= $dtext;
}
}
sub set_stringref($)
{
my $self = shift;
$self->{_stringref} = shift;
}
1;
package main;
use CGI "-debug";
undef $/;
my $q = new CGI;
my $file = $q->param('t');
# $file =~ s/\0//g; # uncomment this to fix
my $stripper = Stripper->new();
my $result = "";
$stripper->set_stringref(\$result);
$stripper->parse($file);
print $result;
__END__
HTH
Joe Schaefer
--
#include <stdio.h> /* requires gcc and *nix
use strict; system("cc -x c $0") and die $?; open C, "|a.out" or die $! . q*/
main(){char s[32]; remove("a.out"); printf("%s/C hacker\n",fgets(s,32,stdin));
return 0;}/*; print C "Just another Perl"; close C or die $?; #*/
------------------------------
Date: Fri, 16 Feb 2001 23:11:08 GMT
From: Bart Lateur <bart.lateur@skynet.be>
Subject: Re: striping HTML
Message-Id: <sgcr8tomv0gfjhkm5lh7kivkf4244np1vk@4ax.com>
Scott R. Godin wrote:
> | Is there a good perl sub that will strip any HTML tags they put in
> | the message? It would be nice if they could do simple ones like
> | <br>, <b> and other text formatting. I don't want pictures, tables
> | and font changes.
>
>
>You could always try the HTML::Parser module from CPAN to extract just
>the text.
Or you can only leave in certain tags. I did something like that once.
It went something like this...
use HTML::TokeParser;
my $p = HTML::TokeParser->new(shift)
or die "Cannot read HTML file: $!";
my %allowed = map { $_ => 1 } qw(b i u br p);
while(my $token = $p->get_token) {
if($token->[0] eq 'T') {
# text
print $token->[-1];
} elsif ($token->[0] =~ /[SE]/) {
# start or end tag
print $token->[-1] if $allowed{$token->[1]};
}
}
Note that it also strips out comments and processing instructions, which
couldn't be retained in the same manner, by taking the last item of
$token (array ref), anyway.
--
Bart.
------------------------------
Date: 16 Feb 2001 19:21:55 -0800
From: Gerard Lanois <gerard@NOSPAMlanois.com>
Subject: Re: striping HTML
Message-Id: <u1ysyvvsc.fsf@NOSPAMlanois.com>
Joe Schaefer <joe+usenet@sunstarsys.com> writes:
> Gerard Lanois <gerard@NOSPAMlanois.com> writes:
> > Funny, I was just writing a stripper earlier today.
>
> It's pretty nice, but it's usually safer (not always what
> you want though) to convert the special characters to
> HTML-encoded versions (like < -> <).
Right.
My original motivation to write this stripper was to
stuff a description which (although I didn't realize it
at the time) possibly contained HTML into the content
attribute of <META NAME="description" content="...">.
Understandably, I started having trouble with a description
which contained double quotes. That's when I realized I
probably didn't want any HTML in there either.
> One problem I found with yours is that it doesn't handle the
> "\0" character in a browser-safe way.
> ...
> Here's what I used for your code; the fix for this is commented
> out.
> ...
> package main;
> use CGI "-debug";
> undef $/;
> my $q = new CGI;
> my $file = $q->param('t');
>
> # $file =~ s/\0//g; # uncomment this to fix
>
> my $stripper = Stripper->new();
> my $result = "";
> $stripper->set_stringref(\$result);
>
> $stripper->parse($file);
>
> print $result;
>
> __END__
>
> HTH
Yes - very educational. Thanks!
-Gerard
http://www.lanois.com/perl/
------------------------------
Date: Sat, 17 Feb 2001 05:40:14 GMT
From: "Frank Miller" <no@email.com>
Subject: Re: striping HTML
Message-Id: <29oj6.435014$U46.12735201@news1.sttls1.wa.home.com>
What FAQ? Please don't treat me like an idiot just because I don't know
about the FAQ. I would LOVE to read the FAQ. Can someone please point me
to it? I am not a long time reader of this group. Geez, lighten up.
FrankM
<nobull@mail.com> wrote in message news:u9u25uedk3.fsf@wcl-l.bham.ac.uk...
> "Frank Miller" <no@email.com> writes:
>
> > Subject: Re: striping HTML
>
> FAQ: "How do I remove HTML from a string?"
>
> You are expected to read the FAQ _before_ posting.
>
> --
> \\ ( )
> . _\\__[oo
> .__/ \\ /\@
> . l___\\
> # ll l\\
> ###LL LL\\
------------------------------
Date: Sat, 17 Feb 2001 05:53:15 GMT
From: Beable van Polasm <beable@my-deja.com>
Subject: Re: striping HTML
Message-Id: <m3k86prhgo.fsf@beable.van.polasm.bigpond.net.au>
"Frank Miller" <no@email.com> writes:
> What FAQ? Please don't treat me like an idiot just because I don't know
> about the FAQ. I would LOVE to read the FAQ. Can someone please point me
> to it? I am not a long time reader of this group. Geez, lighten up.
Try running this command:
perldoc -q "How do I remove HTML from a string?"
If you are using Windows, run it in an MS-DOG Prompt.
Run this command to read the FAQ:
perldoc perlfaq
cheers
Beable van Polasm
--
He could knock the wind out of anything, including a sailboat!
-- WCW Nitro Commentator
IQC 78189333
http://members.nbci.com/_______/index.html
------------------------------
Date: Sat, 17 Feb 2001 06:13:41 GMT
From: "Frank Miller" <no@email.com>
Subject: Re: striping HTML
Message-Id: <pEoj6.435107$U46.12741080@news1.sttls1.wa.home.com>
Thanks. Is this in HTML form on the web anywhere?
FrankM
"Beable van Polasm" <beable@my-deja.com> wrote in message
news:m3k86prhgo.fsf@beable.van.polasm.bigpond.net.au...
> "Frank Miller" <no@email.com> writes:
>
> > What FAQ? Please don't treat me like an idiot just because I don't know
> > about the FAQ. I would LOVE to read the FAQ. Can someone please point
me
> > to it? I am not a long time reader of this group. Geez, lighten up.
>
> Try running this command:
> perldoc -q "How do I remove HTML from a string?"
>
> If you are using Windows, run it in an MS-DOG Prompt.
>
> Run this command to read the FAQ:
> perldoc perlfaq
>
> cheers
> Beable van Polasm
> --
> He could knock the wind out of anything, including a sailboat!
> -- WCW Nitro Commentator
> IQC 78189333
> http://members.nbci.com/_______/index.html
------------------------------
Date: Sat, 17 Feb 2001 06:50:53 GMT
From: Beable van Polasm <beable@my-deja.com>
Subject: Re: striping HTML
Message-Id: <m3g0hdret6.fsf@beable.van.polasm.bigpond.net.au>
"Frank Miller" <no@email.com> writes:
>
> Thanks. Is this in HTML form on the web anywhere?
Well you know the web. This sort of stuff is available everywhere.
For example:
http://language.perl.com/faq/
However, if you look in where you have Perl installed, in the html
directory, you should find the html documentation right on your own
computer.
Also, to save trouble in the future, you should read this:
http://www.uwasa.fi/~ts/http/quote.html
Pay special attention to the remarks about "top-posting".
http://www.btinternet.com/~chiba/sbox/topposters.html
cheers
Beable van Polasm
--
STOP CAKEHOLING PLANKTON IQC 78189333
-- Joe Bay, Society for Plankton Abuse PREVENTION
http://members.nbci.com/_______/index.html
------------------------------
Date: 17 Feb 2001 07:55:10 GMT
From: "Scott R. Godin" <dontspamthewebmaster@webdragon.net>
Subject: Re: striping HTML
Message-Id: <96laou$2nv$1@216.155.33.113>
In article <pEoj6.435107$U46.12741080@news1.sttls1.wa.home.com>, "Frank
Miller" <no@email.com> wrote:
don't do this.. what I'm doing right here, and what you did right
below.. it's called top-posting, and it's BAD. EVIL. UGLY, includes TONS
of useless extra fluff (i.e. the ENTIRE BLOODY PREVIOUS ARTICLE (why the
hell do you think there's such things as backreferences in newsgroups?
look at the headers!) gets included) (see below where the reply BELONGS
for explanation. Thanks! :)
| Thanks. Is this in HTML form on the web anywhere?
|
| FrankM
[snip of extraneous stuff]
| > > What FAQ? Please don't treat me like an idiot just because I don't
| > > know
[snip of extraneous stuff]
here's another 'faq' about top-posting you should DEFINITELY be aware
of, before people simply killfile you for doing it too much (I'm totally
serious)
[
Please put your comments *following* the quoted text that you
are commenting on.
Please do not quote an entire article. Get right in there and TRIM
that thang!
Please do not quote .sigs. Really.
Please see: http://www.geocities.com/nnqweb/nquote.html
Thank you.
Jeopardectomy performed.
]
it's called Jeopardy-quoting after the gameshow "Jeopardy" where the
answer comes before the question. Humans don't read this way. we read
the question first, and THEN the answer.
If you can get over what on the surface SEEMS to be a gruff and upset
and stressing reply (which is isn't, merely emphatic), you'll see that I
*am* trying to help you educate yourself. Even if it's something as
simple as posting to usenet and comp.lang.perl.misc. :)
--
unmunge e-mail here:
#!perl -w
print map {chr(ord($_)-3)} split //, "zhepdvwhuCzhegudjrq1qhw";
# ( damn spammers. *shakes fist* take a hint. =:P )
------------------------------
Date: 17 Feb 2001 07:42:18 GMT
From: "Scott R. Godin" <dontspamthewebmaster@webdragon.net>
Subject: Re: three-part form submission using CGI.pm
Message-Id: <96la0q$2nv$0@216.155.33.113>
In article <m31ysygpti.fsf@mumonkan.sunstarsys.com>, Joe Schaefer
<joe+usenet@sunstarsys.com> wrote:
| I think that hidden() is escaping the HTML. If so, it's arguably
| a good feature designed to prevent special characters from screwing
| up your form.
yes definitely, in fact I'm again calling escapeHTML() on each of the
variables I'm dumping into the resultant html output file (although I'm
not sure if I need to -- just being cautious). I was curious to notice
that unescapeHTML wasn't exported by default. *scratching head*
| I'm not quite sure what you want to do, but I'd recommend either
| doing a s/ /<br>/ etc on the returned hidden form values (after
| they'd been resubmitted), or else place the returned data within a
| <pre>...</pre> block as is.
I figured out what I had to do.. what I wound up doing (and this is sort
of strange) was a broader fix_paragraphs regex sub.
whatever was happening with the hidden() params, by the time it got
there, it was *already* DOUBLED i.e. an \n in the textarea field turned
somehow into an \n\n (or something like that anyway -- there shouldn't
have been in the phase two hidden fields, but there
*was* and somehow that gets turned into FOUR of the little /\n|\r/
buggers at output where there should only be two), and was being passed
via POST to the third sequence.. once it got there it was decoded again
before it even got to my $UT:: namespace (I checked) ... so what I wound
up doing in phase three (output phase) was this: (after much trial and
error)
sub fix_paragraphs () {
my $fixee = shift;
$fixee =~ s|\n\n|<br><br>|g;
$fixee =~ s|\r\r|<br><br>|g;
$fixee =~ s/\n|\r/<br>/g; # just in case
$fixee =~ s/<br><br><br><br>/<br><br>/g;
return $fixee;
}
which works perfectly with the results sent by the
map {
hidden({-name=>$_, -Values=>unescapeHTML(param($_))}), "\n"
} (param)
in phase two (validation phase)
I could probably reduce the regexes to:
$fixee =~ s|\n\n|<br>|g;
$fixee =~ s|\r\r|<br>|g;
$fixee =~ s/\n|\r/<br>/g; # just in case
but it's not like it's going to eat a lot of processing for a script of
this nature, and it WILL still break up a <br> x 4 sequence that might
occur.
At least this is working now and I can move on to the rest of the
script's chunks of code. I just wish I knew why it was doubling the
paragraph returns from the textarea field. I *am* using CGI.pm 2.74, and
not an older model. *scratching head in bewilderment*
--
unmunge e-mail here:
#!perl -w
print map {chr(ord($_)-3)} split //, "zhepdvwhuCzhegudjrq1qhw";
# ( damn spammers. *shakes fist* take a hint. =:P )
------------------------------
Date: Sat, 17 Feb 2001 02:19:55 GMT
From: "What A Man !" <whataman@home.com>
Subject: Re: Why can't I grab this URL?
Message-Id: <3A8DE03E.AC788757@home.com>
Bart Lateur wrote:
>
> What A Man ! wrote:
>
> >Why can't I grab this URL?
> >
> >http://www.fortunecity.com/millenium/blyton/243/clemrule.zip
>
> If I try to get it from a browser, I get:
>
> > We're sorry, but we can't supply the file you requested.
> >
> > In order for us to continue to provide our members with the first-class
> > service they expect, we don't allow people to link files from sites hosted
> > with other providers.
>
> Doesn't that answer your question?
Thanks, but No. That only tells you that they don't want
people hotlinking to those URLs. It doesn't mean they
restrict downloading them; and if they allow a browser to
download the URL (which they do), I see no reason why they
would disallow a script to do the same thing.
>
> I think you probably need to tweak the REFERER.
I appreciate that suggestion. Obviously, this is the key.
Someone just e-mailed me a way to do it. I can't wait to
try it to see if it works.
to Godzilla: Eat that donkey dung yourself. I used IE5,
and it worked exactly as I said. I had no reason to lie
about it. You obviously didn't do it correctly! Again,
while in IE5, put
http://www.fortunecity.com/millenium/blyton/243/clemrule.zip
into your GoTo Box, and press return. You'll then see that
it asks whether you want to either _ open from current
location or _save this file to disk. Check "open from
current location," and if you have Winzip installed, it
automatically goes to WinZip and unzips the file. I hope
this helps you.
Well, this subject has been exhausted, but what amazes me
is that perl book author Randall Schwartz didn't give the
correct answer either. He did point out the incorrect
'Accept'='text/html', though. Thanks for all of the
helpful responses, and especially for the helpful emails
(you know who you are). I really do appreciate it.
Kind regards,
--Dennis
------------------------------
Date: 16 Sep 99 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 16 Sep 99)
Message-Id: <null>
Administrivia:
The Perl-Users Digest is a retransmission of the USENET newsgroup
comp.lang.perl.misc. For subscription or unsubscription requests, send
the single line:
subscribe perl-users
or:
unsubscribe perl-users
to almanac@ruby.oce.orst.edu.
| NOTE: The mail to news gateway, and thus the ability to submit articles
| through this service to the newsgroup, has been removed. I do not have
| time to individually vet each article to make sure that someone isn't
| abusing the service, and I no longer have any desire to waste my time
| dealing with the campus admins when some fool complains to them about an
| article that has come through the gateway instead of complaining
| to the source.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
To request back copies (available for a week or so), send your request
to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
where x is the volume number and y is the issue number.
For other requests pertaining to the digest, send mail to
perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
sending perl questions to the -request address, I don't have time to
answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V10 Issue 306
**************************************