[18138] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 306 Volume: 10

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat Feb 17 03:10:31 2001

Date: Sat, 17 Feb 2001 00:10:10 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Message-Id: <982397410-v10-i306@ruby.oce.orst.edu>
Content-Type: text

Perl-Users Digest           Sat, 17 Feb 2001     Volume: 10 Number: 306

Today's topics:
    Re: striping HTML <joe+usenet@sunstarsys.com>
    Re: striping HTML <bart.lateur@skynet.be>
    Re: striping HTML <gerard@NOSPAMlanois.com>
    Re: striping HTML <no@email.com>
    Re: striping HTML <beable@my-deja.com>
    Re: striping HTML <no@email.com>
    Re: striping HTML <beable@my-deja.com>
    Re: striping HTML <dontspamthewebmaster@webdragon.net>
    Re: three-part form submission using CGI.pm <dontspamthewebmaster@webdragon.net>
    Re: Why can't I grab this URL? <whataman@home.com>
        Digest Administrivia (Last modified: 16 Sep 99) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: 16 Feb 2001 21:42:41 -0500
From: Joe Schaefer <joe+usenet@sunstarsys.com>
Subject: Re: striping HTML
Message-Id: <m3ofw2f2se.fsf@mumonkan.sunstarsys.com>

Gerard Lanois <gerard@NOSPAMlanois.com> writes:

> Funny, I was just writing a stripper earlier today.
> 
> Here is what I came up with.  A great deal of your __DATA__
> gets through, though much of it doesn't end up as valid HTML.
> 
> Comments and criticisms are welcome.

It's pretty nice, but it's usually safer (not always what 
you want though) to convert the special characters to 
HTML-encoded versions (like < -> &lt;).

One problem I found with yours is that it doesn't handle the
"\0" character in a browser-safe way. For example, adding some 
CGI to your code to do the conversion:

    % try.pl t='<%00html><%00head><%00title>foo<%00/title><%00/head>
<%00body bgcolor="red">hola<%00/body><%00/html>'

outputs:

<html><head><title>foo</title></head><body bgcolor="red">hola</body>

Most browsers will parse this just as it appears here. I'm not sure
if '\0' is the only strange char that defeats your stripper, but 
offhand I can't think of any others that will break it without
destroying the rendering in a web browser as well.

Here's what I used for your code; the fix for this is commented
out.

% cat try.pl
#!/usr/bin/perl -w

use strict;

package Stripper;

use HTML::Parser;
use HTML::Entities;

use vars qw ( @ISA );

@ISA = qw ( HTML::Parser );

sub new
{
    my ($class, %args) = @_;
    my $self = $class->SUPER::new(%args);
    return $self;
}

sub text
{
    my $self = shift;
    my $text = shift;

    my $dtext = HTML::Entities::decode($text);

    # Accumulate pieces of text.
    if (defined($self->{_stringref}) && length($dtext)) {
        ${$self->{_stringref}} .= $dtext;
    }
}

sub set_stringref($)
{
    my $self = shift;
    $self->{_stringref} = shift;
}

1;


package main;
use CGI "-debug";
undef $/;
my $q = new CGI;
my $file = $q->param('t');

# $file =~ s/\0//g;  # uncomment this to fix

my $stripper = Stripper->new();
my $result = "";
$stripper->set_stringref(\$result);

$stripper->parse($file);

print $result;

__END__

HTH

Joe Schaefer
-- 
#include <stdio.h> /* requires gcc and *nix
use strict; system("cc -x c $0") and die $?; open C, "|a.out" or die $! . q*/
main(){char s[32]; remove("a.out"); printf("%s/C hacker\n",fgets(s,32,stdin));
return 0;}/*; print C "Just another Perl"; close C or die $?; #*/


------------------------------

Date: Fri, 16 Feb 2001 23:11:08 GMT
From: Bart Lateur <bart.lateur@skynet.be>
Subject: Re: striping HTML
Message-Id: <sgcr8tomv0gfjhkm5lh7kivkf4244np1vk@4ax.com>

Scott R. Godin wrote:

> | Is there a good perl sub that will strip any HTML tags they put in 
> | the message?  It would be nice if they could do simple ones like 
> | <br>, <b> and other text formatting.  I don't want pictures, tables 
> | and font changes.
>
>
>You could always try the HTML::Parser module from CPAN to extract just 
>the text.

Or you can only leave in certain tags. I did something like that once.
It went something like this...

    use HTML::TokeParser;
    my $p = HTML::TokeParser->new(shift)
      or die "Cannot read HTML file: $!";
    my %allowed = map { $_ => 1 } qw(b i u br p);
    while(my $token = $p->get_token) {
        if($token->[0] eq 'T') {
            # text
            print $token->[-1];
        } elsif ($token->[0] =~ /[SE]/) {
            # start or end tag
            print $token->[-1] if $allowed{$token->[1]};
        }
    }

Note that it also strips out comments and processing instructions, which
couldn't be retained in the same manner, by taking the last item of
$token (array ref), anyway.

-- 
	Bart.


------------------------------

Date: 16 Feb 2001 19:21:55 -0800
From: Gerard Lanois <gerard@NOSPAMlanois.com>
Subject: Re: striping HTML
Message-Id: <u1ysyvvsc.fsf@NOSPAMlanois.com>

Joe Schaefer <joe+usenet@sunstarsys.com> writes:

> Gerard Lanois <gerard@NOSPAMlanois.com> writes:
> > Funny, I was just writing a stripper earlier today.
>
> It's pretty nice, but it's usually safer (not always what 
> you want though) to convert the special characters to 
> HTML-encoded versions (like < -> &lt;).

Right.

My original motivation to write this stripper was to
stuff a description which (although I didn't realize it
at the time) possibly contained HTML into the content
attribute of <META NAME="description" content="...">.

Understandably, I started having trouble with a description
which contained double quotes.  That's when I realized I
probably didn't want any HTML in there either.

> One problem I found with yours is that it doesn't handle the
> "\0" character in a browser-safe way. 
> ...
> Here's what I used for your code; the fix for this is commented
> out.
> ...
> package main;
> use CGI "-debug";
> undef $/;
> my $q = new CGI;
> my $file = $q->param('t');
> 
> # $file =~ s/\0//g;  # uncomment this to fix
> 
> my $stripper = Stripper->new();
> my $result = "";
> $stripper->set_stringref(\$result);
> 
> $stripper->parse($file);
> 
> print $result;
> 
> __END__
> 
> HTH

Yes - very educational.  Thanks!

-Gerard
http://www.lanois.com/perl/



------------------------------

Date: Sat, 17 Feb 2001 05:40:14 GMT
From: "Frank Miller" <no@email.com>
Subject: Re: striping HTML
Message-Id: <29oj6.435014$U46.12735201@news1.sttls1.wa.home.com>

What FAQ?  Please don't treat me like an idiot just because I don't know
about the FAQ.  I would LOVE to read the FAQ.  Can someone please point me
to it?  I am not a long time reader of this group.  Geez, lighten up.

FrankM

<nobull@mail.com> wrote in message news:u9u25uedk3.fsf@wcl-l.bham.ac.uk...
> "Frank Miller" <no@email.com> writes:
>
> > Subject: Re: striping HTML
>
> FAQ: "How do I remove HTML from a string?"
>
> You are expected to read the FAQ _before_ posting.
>
> --
>      \\   ( )
>   .  _\\__[oo
>  .__/  \\ /\@
>  .  l___\\
>   # ll  l\\
>  ###LL  LL\\




------------------------------

Date: Sat, 17 Feb 2001 05:53:15 GMT
From: Beable van Polasm <beable@my-deja.com>
Subject: Re: striping HTML
Message-Id: <m3k86prhgo.fsf@beable.van.polasm.bigpond.net.au>

"Frank Miller" <no@email.com> writes:

> What FAQ?  Please don't treat me like an idiot just because I don't know
> about the FAQ.  I would LOVE to read the FAQ.  Can someone please point me
> to it?  I am not a long time reader of this group.  Geez, lighten up.

Try running this command:
perldoc -q "How do I remove HTML from a string?"

If you are using Windows, run it in an MS-DOG Prompt.

Run this command to read the FAQ:
perldoc perlfaq

cheers
Beable van Polasm
-- 
    He could knock the wind out of anything, including a sailboat! 
               	-- WCW Nitro Commentator
                        IQC 78189333
          http://members.nbci.com/_______/index.html


------------------------------

Date: Sat, 17 Feb 2001 06:13:41 GMT
From: "Frank Miller" <no@email.com>
Subject: Re: striping HTML
Message-Id: <pEoj6.435107$U46.12741080@news1.sttls1.wa.home.com>

Thanks.  Is this in HTML form on the web anywhere?

FrankM

"Beable van Polasm" <beable@my-deja.com> wrote in message
news:m3k86prhgo.fsf@beable.van.polasm.bigpond.net.au...
> "Frank Miller" <no@email.com> writes:
>
> > What FAQ?  Please don't treat me like an idiot just because I don't know
> > about the FAQ.  I would LOVE to read the FAQ.  Can someone please point
me
> > to it?  I am not a long time reader of this group.  Geez, lighten up.
>
> Try running this command:
> perldoc -q "How do I remove HTML from a string?"
>
> If you are using Windows, run it in an MS-DOG Prompt.
>
> Run this command to read the FAQ:
> perldoc perlfaq
>
> cheers
> Beable van Polasm
> --
>     He could knock the wind out of anything, including a sailboat!
>                -- WCW Nitro Commentator
>                         IQC 78189333
>           http://members.nbci.com/_______/index.html




------------------------------

Date: Sat, 17 Feb 2001 06:50:53 GMT
From: Beable van Polasm <beable@my-deja.com>
Subject: Re: striping HTML
Message-Id: <m3g0hdret6.fsf@beable.van.polasm.bigpond.net.au>

"Frank Miller" <no@email.com> writes:
>
> Thanks.  Is this in HTML form on the web anywhere?

Well you know the web. This sort of stuff is available everywhere.
For example:
http://language.perl.com/faq/

However, if you look in where you have Perl installed, in the html
directory, you should find the html documentation right on your own
computer.

Also, to save trouble in the future, you should read this:
http://www.uwasa.fi/~ts/http/quote.html
Pay special attention to the remarks about "top-posting".
http://www.btinternet.com/~chiba/sbox/topposters.html

cheers
Beable van Polasm
-- 
STOP CAKEHOLING PLANKTON               IQC 78189333
-- Joe Bay, Society for Plankton Abuse PREVENTION
http://members.nbci.com/_______/index.html


------------------------------

Date: 17 Feb 2001 07:55:10 GMT
From: "Scott R. Godin" <dontspamthewebmaster@webdragon.net>
Subject: Re: striping HTML
Message-Id: <96laou$2nv$1@216.155.33.113>

In article <pEoj6.435107$U46.12741080@news1.sttls1.wa.home.com>, "Frank 
Miller" <no@email.com> wrote:

don't do this.. what I'm doing right here, and what you did right 
below.. it's called top-posting, and it's BAD. EVIL. UGLY, includes TONS 
of useless extra fluff (i.e. the ENTIRE BLOODY PREVIOUS ARTICLE (why the 
hell do you think there's such things as backreferences in newsgroups? 
look at the headers!) gets included)  (see below where the reply BELONGS 
for explanation. Thanks! :)

 | Thanks.  Is this in HTML form on the web anywhere?
 | 
 | FrankM
 
[snip of extraneous stuff]

 | > > What FAQ?  Please don't treat me like an idiot just because I don't 
 | > > know

[snip of extraneous stuff]

here's another 'faq' about top-posting you should DEFINITELY be aware 
of, before people simply killfile you for doing it too much (I'm totally 
serious)

[
  Please put your comments *following* the quoted text that you
  are commenting on.

  Please do not quote an entire article. Get right in there and TRIM 
  that thang!

  Please do not quote .sigs. Really.

  Please see:     http://www.geocities.com/nnqweb/nquote.html

  Thank you.

  Jeopardectomy performed.
]

it's called Jeopardy-quoting after the gameshow "Jeopardy" where the 
answer comes before the question. Humans don't read this way. we read 
the question first, and THEN the answer. 

If you can get over what on the surface SEEMS to be a gruff and upset 
and stressing reply (which is isn't, merely emphatic), you'll see that I 
*am* trying to help you educate yourself. Even if it's something as 
simple as posting to usenet and comp.lang.perl.misc. :)

-- 
unmunge e-mail here:
#!perl -w
print map {chr(ord($_)-3)} split //, "zhepdvwhuCzhegudjrq1qhw"; 
# ( damn spammers. *shakes fist* take a hint. =:P )


------------------------------

Date: 17 Feb 2001 07:42:18 GMT
From: "Scott R. Godin" <dontspamthewebmaster@webdragon.net>
Subject: Re: three-part form submission using CGI.pm
Message-Id: <96la0q$2nv$0@216.155.33.113>

In article <m31ysygpti.fsf@mumonkan.sunstarsys.com>, Joe Schaefer 
<joe+usenet@sunstarsys.com> wrote:

 | I think that hidden() is escaping the HTML.  If so, it's arguably
 | a good feature designed to prevent special characters from screwing 
 | up your form.

yes definitely, in fact I'm again calling escapeHTML() on each of the 
variables I'm dumping into the resultant html output file (although I'm 
not sure if I need to -- just being cautious). I was curious to notice 
that unescapeHTML wasn't exported by default. *scratching head*
 
 | I'm not quite sure what you want to do, but I'd recommend either
 | doing a s/&#13;&#10;/<br>/ etc on the returned hidden form values (after
 | they'd been resubmitted), or else place the returned data within a
 | <pre>...</pre> block as is.

I figured out what I had to do.. what I wound up doing (and this is sort 
of strange) was a broader fix_paragraphs regex sub. 

whatever was happening with the hidden() params, by the time it got 
there, it was *already* DOUBLED i.e. an \n in the textarea field turned 
somehow into an \n\n (or something like that anyway -- there shouldn't 
have been &#13;&#10;&#13;&#10; in the phase two hidden fields, but there 
*was* and somehow that gets turned into FOUR of the little /\n|\r/ 
buggers at output  where there should only be two), and was being passed 
via POST to the third sequence.. once it got there it was decoded again 
before it even got to my $UT:: namespace (I checked) ... so what I wound 
up doing in phase three (output phase) was this: (after much trial and 
error)

sub fix_paragraphs () {
    my $fixee = shift;
    $fixee =~ s|\n\n|<br><br>|g;
    $fixee =~ s|\r\r|<br><br>|g;
    $fixee =~ s/\n|\r/<br>/g; # just in case
    $fixee =~ s/<br><br><br><br>/<br><br>/g;
    return $fixee;
}

which works perfectly with the results sent by the 

   map { 
    hidden({-name=>$_, -Values=>unescapeHTML(param($_))}), "\n" 
   } (param)

in phase two (validation phase) 

I could probably reduce the regexes to: 

    $fixee =~ s|\n\n|<br>|g;
    $fixee =~ s|\r\r|<br>|g;
    $fixee =~ s/\n|\r/<br>/g; # just in case

but it's not like it's going to eat a lot of processing for a script of 
this nature, and it WILL still break up a <br> x 4 sequence that might 
occur. 

At least this is working now and I can move on to the rest of the 
script's chunks of code. I just wish I knew why it was doubling the 
paragraph returns from the textarea field. I *am* using CGI.pm 2.74, and 
not an older model. *scratching head in bewilderment*

-- 
unmunge e-mail here:
#!perl -w
print map {chr(ord($_)-3)} split //, "zhepdvwhuCzhegudjrq1qhw"; 
# ( damn spammers. *shakes fist* take a hint. =:P )


------------------------------

Date: Sat, 17 Feb 2001 02:19:55 GMT
From: "What A Man !" <whataman@home.com>
Subject: Re: Why can't I grab this URL?
Message-Id: <3A8DE03E.AC788757@home.com>

Bart Lateur wrote:
> 
> What A Man ! wrote:
> 
> >Why can't I grab this URL?
> >
> >http://www.fortunecity.com/millenium/blyton/243/clemrule.zip
> 
> If I try to get it from a browser, I get:
> 
> >     We're sorry, but we can't supply the file you requested.
> >
> >     In order for us to continue to provide our members with the first-class
> >     service they expect, we don't allow people to link files from sites hosted
> >     with other providers.
> 
> Doesn't that answer your question?

Thanks, but No. That only tells you that they don't want
people hotlinking to those URLs. It doesn't mean they
restrict downloading them; and if they allow a browser to
download the URL (which they do), I see no reason why they
would disallow a script to do the same thing.
> 
> I think you probably need to tweak the REFERER.

I appreciate that suggestion. Obviously, this is the key.
Someone just e-mailed me a way to do it. I can't wait to
try it to see if it works. 

to Godzilla: Eat that donkey dung yourself. I used IE5,
and it worked exactly as I said. I had no reason to lie
about it. You obviously didn't do it correctly! Again,
while in IE5, put
http://www.fortunecity.com/millenium/blyton/243/clemrule.zip
into your GoTo Box, and press return. You'll then see that
it asks whether you want to either  _ open from current
location   or   _save this file to disk. Check "open from
current location," and if you have Winzip installed, it
automatically goes to WinZip and unzips the file. I hope
this helps you.

Well, this subject has been exhausted, but what amazes me
is that perl book author Randall Schwartz didn't give the
correct answer either. He did point out the incorrect
'Accept'='text/html', though. Thanks for all of the
helpful responses, and especially for the helpful emails
(you know who you are). I really do appreciate it.

Kind regards,
--Dennis


------------------------------

Date: 16 Sep 99 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 16 Sep 99)
Message-Id: <null>


Administrivia:

The Perl-Users Digest is a retransmission of the USENET newsgroup
comp.lang.perl.misc.  For subscription or unsubscription requests, send
the single line:

	subscribe perl-users
or:
	unsubscribe perl-users

to almanac@ruby.oce.orst.edu.  

| NOTE: The mail to news gateway, and thus the ability to submit articles
| through this service to the newsgroup, has been removed. I do not have
| time to individually vet each article to make sure that someone isn't
| abusing the service, and I no longer have any desire to waste my time
| dealing with the campus admins when some fool complains to them about an
| article that has come through the gateway instead of complaining
| to the source.

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

To request back copies (available for a week or so), send your request
to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
where x is the volume number and y is the issue number.

For other requests pertaining to the digest, send mail to
perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
sending perl questions to the -request address, I don't have time to
answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V10 Issue 306
**************************************


home help back first fref pref prev next nref lref last post