[17437] in Perl-Users-Digest
Perl-Users Digest, Issue: 4857 Volume: 9
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Nov 9 14:10:55 2000
Date: Thu, 9 Nov 2000 11:10:14 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Message-Id: <973797014-v9-i4857@ruby.oce.orst.edu>
Content-Type: text
Perl-Users Digest Thu, 9 Nov 2000 Volume: 9 Number: 4857
Today's topics:
Re: strip html tags from string $text <james@NOSPAM.demon.co.uk>
Re: strip html tags from string $text <joe+usenet@sunstarsys.com>
Re: strip html tags from string $text (Tad McClellan)
Re: strip html tags from string $text <james@NOSPAM.demon.co.uk>
Re: strip html tags from string $text <joe+usenet@sunstarsys.com>
Re: strip html tags from string $text (Tad McClellan)
Re: strip html tags from string $text <jeff@vpservices.com>
Re: strip html tags from string $text <joe+usenet@sunstarsys.com>
Re: strip html tags from string $text <jeff@vpservices.com>
Re: strip html tags from string $text (Tad McClellan)
Re: strip html tags from string $text <james@NOSPAM.demon.co.uk>
Re: strip html tags from string $text <flavell@mail.cern.ch>
Re: suggestions for sorting data (Michel Dalle)
That IxHash ordered Data::Dump again ... <caesura@freenetname.co.uk>
Digest Administrivia (Last modified: 16 Sep 99) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Thu, 9 Nov 2000 16:11:57 +0000
From: James Taylor <james@NOSPAM.demon.co.uk>
Subject: Re: strip html tags from string $text
Message-Id: <ant0916570b0fNdQ@oakseed.demon.co.uk>
In article <m3zoj9du3a.fsf@mumonkan.sunstarsys.com>, Joe Schaefer
<URL:mailto:joe+usenet@sunstarsys.com> wrote:
>
> James Taylor <james@NOSPAM.demon.co.uk> writes:
> >
> > > > $text =~ s/<.*?>//g;
> >
> > > Not much of an effort for not much of a solution. Had you read the FAQ
> > > yourself you'd have realised why your solution is inadequate at best.
> >
> > Do you have a particular case in mind where Damir's regex would fail
> > to strip HTML tags from $text?
> >
>
> <html><head><title>Hi</title></head>
> <body><a href='http://www.perl.com'
> target='ext' onClick='alert( "1 > 2");'>
Yes, I saw that once I'd posted. Typical eh?
How about we add an /s on the end like so:
$text =~ s/<.*?>//gs;
Can you still think of situations where this would
fail to strip HTML tags?
--
James Taylor <james (at) oakseed demon co uk>
PGP key available ID: 3FBE1BF9
Fingerprint: F19D803624ED6FE8 370045159F66FD02
------------------------------
Date: 09 Nov 2000 11:17:30 -0500
From: Joe Schaefer <joe+usenet@sunstarsys.com>
Subject: Re: strip html tags from string $text
Message-Id: <m3vgtxdsrp.fsf@mumonkan.sunstarsys.com>
James Taylor <james@NOSPAM.demon.co.uk> writes:
> In article <m3zoj9du3a.fsf@mumonkan.sunstarsys.com>, Joe Schaefer
> <URL:mailto:joe+usenet@sunstarsys.com> wrote:
> >
> > James Taylor <james@NOSPAM.demon.co.uk> writes:
> > >
> > > > > $text =~ s/<.*?>//g;
> > >
> > > > Not much of an effort for not much of a solution. Had you read the FAQ
> > > > yourself you'd have realised why your solution is inadequate at best.
> > >
> > > Do you have a particular case in mind where Damir's regex would fail
> > > to strip HTML tags from $text?
> > >
> >
> <html><head><title>Hi</title></head>
> <body><a href='http://www.perl.com'
> target='ext' onClick='alert( "1 > 2");'>
>
> Yes, I saw that once I'd posted. Typical eh?
> How about we add an /s on the end like so:
>
> $text =~ s/<.*?>//gs;
>
> Can you still think of situations where this would
> fail to strip HTML tags?
<html><head><title>Hi</title></head>
<body><a href='http://www.perl.com'
target='ext' onClick='alert( "1 > 2");'>
--
Joe Schaefer
------------------------------
Date: Thu, 9 Nov 2000 10:13:24 -0500
From: tadmc@metronet.com (Tad McClellan)
Subject: Re: strip html tags from string $text
Message-Id: <slrn90lfok.bi3.tadmc@magna.metronet.com>
[ Jeopardectomy performed (a clear indication that the poster lacks clue) ]
On Thu, 09 Nov 2000 14:38:53 +0100, Damir Boraska <damir.boraska@hrt.hr> wrote:
>dobrocinitelj@my-deja.com wrote:
>>
>> I have a string $text containing multiple lines of text, how can I
>> strip all the html tags?
>Instead of writing useless sentences such as
>"go and read the FAQ",
It is not useless. It is correct. Yours is NOT correct.
A "solution" that only works "sometimes" should be labeled as such.
You forgot to disclaim yours.
The FAQ (you didn't even go see what it said) shows at least
a half dozen examples of HTML that your "solution" won't
handle correctly.
The reasons for pointing to the FAQ instead of trying to reanswer
the question are many (and your followup provides verification
of several of them):
1) You don't know the quality of the answer you recieve. FAQ answers
are written by experts. (the answer below is of profoundly poor
quality for example, and the FAQ even anticipates such foolishness
and warns against it complete with reasons why it is foolish)
2) Even an expert who tries to reanswer the question might make
a mistake. FAQs are reviewed by hundreds of people. Mistakes
don't go unnoticed for long.
3) Time spent answering an already answered question takes time
away from answering UNanswered questions. It hurts your peers.
4) Bad quality answers take time away from everybody else because
question answerers have to spend time fixing the bad advice
lest someone too inexperienced to know what to trust tries
to use it (this followup for example).
>you people could have
>just written this:
>
> $text =~ s/<.*?>//g;
You may proclaim your foolishness in front of thousands of
people if you so choose. But suggesting that others do it is
pretentious.
>and actually HELP someone.
*You* have hurt someone (thousands of someones in fact).
The code you provide is hopelessly inadequate, leading to
yet more questions being posted when they discover that it
doesn't work, and followups to such questions, and followups
to fix the bad advice in the first place.
Huge gobs of time taken away from people who really need help
by people who are too lazy to type 15 characters.
perldoc -q HTML
Your selfishness has hurt thousands of your peers.
And you suggest that that is *helping* people?
>Was that such an effort????
You spent a little of your time and wasted the time of thousands
of others. You are too selfish to be tolerated.
*plonk*
--
Tad McClellan SGML consulting
tadmc@metronet.com Perl programming
Fort Worth, Texas
------------------------------
Date: Thu, 9 Nov 2000 16:30:07 +0000
From: James Taylor <james@NOSPAM.demon.co.uk>
Subject: Re: strip html tags from string $text
Message-Id: <ant091607b49fNdQ@oakseed.demon.co.uk>
In article <m3vgtxdsrp.fsf@mumonkan.sunstarsys.com>, Joe Schaefer
<URL:mailto:joe+usenet@sunstarsys.com> wrote:
>
> James Taylor <james@NOSPAM.demon.co.uk> writes:
> >
> > $text =~ s/<.*?>//gs;
> >
> > Can you still think of situations where this would
> > fail to strip HTML tags?
>
> <html><head><title>Hi</title></head>
> <body><a href='http://www.perl.com'
> target='ext' onClick='alert( "1 > 2");'>
Duh! I'm not looking carefully enough. I'm amazed that it is legal
to have > characters in the value of a tag attribute. Is this
just a JavaScript thing or is it part of the standard HTML spec?
I wonder what pre-JavaScript browsers do with it...
--
James Taylor <james (at) oakseed demon co uk>
PGP key available ID: 3FBE1BF9
Fingerprint: F19D803624ED6FE8 370045159F66FD02
------------------------------
Date: 09 Nov 2000 11:42:33 -0500
From: Joe Schaefer <joe+usenet@sunstarsys.com>
Subject: Re: strip html tags from string $text
Message-Id: <m3r94ldrly.fsf@mumonkan.sunstarsys.com>
James Taylor <james@NOSPAM.demon.co.uk> writes:
> In article <m3vgtxdsrp.fsf@mumonkan.sunstarsys.com>, Joe Schaefer
> <URL:mailto:joe+usenet@sunstarsys.com> wrote:
> >
> > James Taylor <james@NOSPAM.demon.co.uk> writes:
> > >
> > > $text =~ s/<.*?>//gs;
> > >
> > > Can you still think of situations where this would
> > > fail to strip HTML tags?
> >
> > <html><head><title>Hi</title></head>
> > <body><a href='http://www.perl.com'
> > target='ext' onClick='alert( "1 > 2");'>
>
> Duh! I'm not looking carefully enough. I'm amazed that it is legal
> to have > characters in the value of a tag attribute. Is this
> just a JavaScript thing or is it part of the standard HTML spec?
http://www.w3.org/TR/html4/interact/scripts.html#adef-onclick
> I wonder what pre-JavaScript browsers do with it...
Who cares ;)
--
Joe Schaefer
------------------------------
Date: Thu, 9 Nov 2000 10:50:11 -0500
From: tadmc@metronet.com (Tad McClellan)
Subject: Re: strip html tags from string $text
Message-Id: <slrn90lhtj.bkv.tadmc@magna.metronet.com>
On Thu, 9 Nov 2000 16:11:57 +0000, James Taylor
<james@NOSPAM.demon.co.uk> wrote:
>Can you still think of situations where this would
>fail to strip HTML tags?
Yes. I can think of situations where:
it will fail to strip HTML tags
it will strip things that are not HTML tags
it will strip _part of_ a tag, along with some *data*!!
C'mon man! Check the damned FAQ. It has several such examples.
This type of thing is beat to death here twice a month or so.
Let's not drag everybody through it yet again.
--
Tad McClellan SGML consulting
tadmc@metronet.com Perl programming
Fort Worth, Texas
------------------------------
Date: Thu, 09 Nov 2000 08:59:34 -0800
From: Jeff Zucker <jeff@vpservices.com>
Subject: Re: strip html tags from string $text
Message-Id: <3A0AD7F6.B8712B18@vpservices.com>
Damir Boraska wrote:
>
> Instead of writing useless sentences such as
> "go and read the FAQ", you people could have
> just written this:
>
> $text =~ s/<.*?>//g;
Right, instead of reading the correct answer, we could have just given
an incorrect one like you did.
> and actually HELP someone.
Whom does giving the wrong answer help?
> Was that such an effort????
Yes, now, thanks to you it is becoming an effort. It will be lots of
effort for those who try your broken solution and end up with garbage.
--
Jeff
------------------------------
Date: 09 Nov 2000 12:02:48 -0500
From: Joe Schaefer <joe+usenet@sunstarsys.com>
Subject: Re: strip html tags from string $text
Message-Id: <m3n1f9dqo7.fsf@mumonkan.sunstarsys.com>
tadmc@metronet.com (Tad McClellan) writes:
> C'mon man! Check the damned FAQ. It has several such examples.
Careful- some FAQ's still have it wrong:
=============
Q4.6: How can I strip all the html tags from a
document with a Perl substitute?
Here is a simple regular expression that will
strip HTML tags:
$line =~ s/<(([^ >]|\n)*)>//g;
Or you can "escape" certain characters in a
HTML tag so that it can be displayed:
$line =~ s/<(([^>]|\n)*)>/<$1>/g;
For more information, see Tom's striphtml program,
which is also included in his tour of perl5 regexps.
=============
See http://www.perl.com/pub/doc/FAQs/cgi/perl-cgi-faq.html
for the ugly truth.
--
Joe Schaefer
------------------------------
Date: Thu, 09 Nov 2000 09:04:11 -0800
From: Jeff Zucker <jeff@vpservices.com>
Subject: Re: strip html tags from string $text
Message-Id: <3A0AD90B.6EE75E19@vpservices.com>
James Taylor wrote:
>
> I'm amazed that it is legal
> to have > characters in the value of a tag attribute. Is this
> just a JavaScript thing or is it part of the standard HTML spec?
Why don't you ask in an HTML group, or better yet, read the spec
yourself?
> I wonder what pre-JavaScript browsers do with it...
If you had bothered to read the FAQ, you would know that JavaScript has
nothing to do with it, that angle brackets can be nested inside plain
comments or even img alt attributes in completely JS-free pages.
<IMG SRC = "foo.gif" ALT = "A > B">
<!-- <A comment> -->
Both those examples are in the FAQ, so I really, really do not
understand why you would bother writing in useless speculation when the
answer is clearly available to anyone who reads English.
--
Jeff
------------------------------
Date: Thu, 9 Nov 2000 11:23:32 -0500
From: tadmc@metronet.com (Tad McClellan)
Subject: Re: strip html tags from string $text
Message-Id: <slrn90ljs4.bof.tadmc@magna.metronet.com>
On Thu, 9 Nov 2000 16:30:07 +0000, James Taylor
<james@NOSPAM.demon.co.uk> wrote:
>I'm amazed that it is legal
>to have > characters in the value of a tag attribute. Is this
>just a JavaScript thing or is it part of the standard HTML spec?
It complies with the SGML spec (ISO-8879).
HTML is defined (by the w3c, not by browser makers) as an SGML application.
--
Tad McClellan SGML consulting
tadmc@metronet.com Perl programming
Fort Worth, Texas
------------------------------
Date: Thu, 9 Nov 2000 17:49:48 +0000
From: James Taylor <james@NOSPAM.demon.co.uk>
Subject: Re: strip html tags from string $text
Message-Id: <ant091748868fNdQ@oakseed.demon.co.uk>
In article <slrn90lhtj.bkv.tadmc@magna.metronet.com>, Tad McClellan
<URL:mailto:tadmc@metronet.com> wrote:
>
> C'mon man! Check the damned FAQ. It has several such examples.
>
> This type of thing is beat to death here twice a month or so.
>
> Let's not drag everybody through it yet again.
Sorry.
I'll now go and do what I'm told.
--
James Taylor <james (at) oakseed demon co uk>
PGP key available ID: 3FBE1BF9
Fingerprint: F19D803624ED6FE8 370045159F66FD02
------------------------------
Date: Thu, 9 Nov 2000 18:17:39 +0100
From: "Alan J. Flavell" <flavell@mail.cern.ch>
Subject: Re: strip html tags from string $text
Message-Id: <Pine.GHP.4.21.0011091803350.20030-100000@hpplus03.cern.ch>
On Thu, 9 Nov 2000, James Taylor wrote:
> Duh! I'm not looking carefully enough. I'm amazed that it is legal
> to have > characters in the value of a tag attribute.
It always has been. Some very early browsers got confused by it, but
on the other hand, other very early browsers got confused by writing
> instead, so if you needed it, you had to live with one or the
other browser bug. We're talking like 1994 or so, I guess, so that's
like the Age of the Pharoahs in WWW history timescales.
> Is this just a JavaScript thing
Certainly not! It's about markup syntax in general. onclick is an
attribute of many HTML tags, and is applicable not only to one
particular scripting language.
> or is it part of the standard HTML spec?
With respect, you appear to be dimly aware that there is an HTML
specification: so if you want to know what it says, are you unaware
that it is available online for you to consult at no cost to the
worldwide usenet community?
> I wonder what pre-JavaScript browsers do with it...
They do just what it advises in the HTML2.0 spec: parse the attribute
according to normal HTML syntax rules, and otherwise disregard it.
Don't get so hung up on "JavaScript": _this_ works the same for any
kind of extension to HTML attributes.
for sure, this is really off-topic for clpm, but you did ask.
------------------------------
Date: Thu, 09 Nov 2000 17:29:31 GMT
From: michel.dalle@usa.net (Michel Dalle)
Subject: Re: suggestions for sorting data
Message-Id: <8uen1p$j36$1@news.mch.sbs.de>
In article <8uee0j$rc5$1@nnrp1.deja.com>, skoch71@my-deja.com wrote:
>I'm working on a perl script to parse through Squid logs. I know
>there's a bunch of other stuff already out there to do that kind of
>stuff, but I'm doing this for me as a learning experience.
[snip]
> Anyway, if
>anyone would mind giving me a tip on an approach for sorting my data,
>sorting an array, another hash...?, I'd sure appreciate it. Thanks.
Let's start with the sort problem. Have you had a look at Perl FAQ 4
- in particular the questions "How do I sort an array by (anything)?"
and "How do I sort a hash (optionally by value instead of key)?"
There are more advanced methods descibed in the FMTEYEWTK
document about sort, and in the paper of Uri Guttman and Larry
Rosler : http://www.sysarch.com/perl/sort_paper.html if you want
to go into more details.
Once you feel comfortable sorting arrays and hashes, let's move
on to which data structure you should use to store your information.
I'd recommend that you have a look at perldata, perldsc and perllol
for that - they show how you can build different structures for
the same basic input data. The main question you'll have to ask
yourself is what kind of operations you're going to do on that data.
You mention sorting of the entries. If that's the only thing you're
going to do, simply pushing everything in an array might be
enough. But unless your Squid logs are reasonably small, you'll
have big problems storing the whole log into memory and sorting it.
Maybe you're more interested in getting less detailed statistics, like
the number of times a source goes to a destination. For that, you
might use a hash of hashes. And maybe you'd also like to keep
track of the number of requests in a given time period - an array
would do the trick here.
As you can see, thinking about what you're going to do with the
data is the most important step in the process - how you do it
is then a matter of choosing some data structure(s) that allows
you to easily generate the output you want.
And as you continue playing with the logs, you'll probably notice
that you're switching data structures or using a combination of
them. By then, you'll have a good idea of which data structures fit
what kind of operations, and you'll have learned a valuable lesson.
Don't be afraid to experiment :)
HTH,
Michel.
------------------------------
Date: Thu, 9 Nov 2000 18:46:24 -0000
From: "Andrew Cragg" <caesura@freenetname.co.uk>
Subject: That IxHash ordered Data::Dump again ...
Message-Id: <8uercg$dhb$1@gxsn.com>
Hello,
I've posted this before and got no response at all - anyone tell me how to
dump a hash using Data::Dumper in the same order each time (so that diffs
are sensible). Even if you don't know then I'd be interested in your
comments :)
Thanks,
here is wot I sent last time :
Anyone tried to Data::Dump a Tie::IxHash'ed hash reference?
This is what I tried:
#! /usr/bin/perl -w
use strict;
use lib "$ENV{STLCM_LIB}/Tie-IxHash-1.21/lib";
use Tie::IxHash;
use Data::Dumper;
my $A_HashRef = {};
tie %$A_HashRef, "Tie::IxHash";
$A_HashRef->{'Rock'} = "Rock";
$A_HashRef->{'Cary'} = "Cary";
$A_HashRef->{'Howard'} = "Howard";
$A_HashRef->{'Clint'} = "Clint";
$A_HashRef->{'Fred'} = "Fred";
$A_HashRef->{'Ginger'} = "Ginger";
open A_FILE, ">a_file.txt" or die "Cannot open a_file.txt\n";
print A_FILE Data::Dumper->Dump ([$A_HashRef], ["A_HashRef"]);
close A_FILE;
And this is wot I got in a_file.txt :
$A_HashRef = {
'Rock' => 'Rock',
'Cary' => 'Cary',
'Howard' => \$A_HashRef->{'Rock'},
'Clint' => \$A_HashRef->{'Cary'},
'Fred' => \$A_HashRef->{'Rock'},
'Ginger' => \$A_HashRef->{'Cary'}
};
Anyone?
Thank youse all greatly.
Andy
http://www.caesura.co.uk
------------------------------
Date: 16 Sep 99 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 16 Sep 99)
Message-Id: <null>
Administrivia:
The Perl-Users Digest is a retransmission of the USENET newsgroup
comp.lang.perl.misc. For subscription or unsubscription requests, send
the single line:
subscribe perl-users
or:
unsubscribe perl-users
to almanac@ruby.oce.orst.edu.
| NOTE: The mail to news gateway, and thus the ability to submit articles
| through this service to the newsgroup, has been removed. I do not have
| time to individually vet each article to make sure that someone isn't
| abusing the service, and I no longer have any desire to waste my time
| dealing with the campus admins when some fool complains to them about an
| article that has come through the gateway instead of complaining
| to the source.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
To request back copies (available for a week or so), send your request
to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
where x is the volume number and y is the issue number.
For other requests pertaining to the digest, send mail to
perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
sending perl questions to the -request address, I don't have time to
answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V9 Issue 4857
**************************************