[28088] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 9452 Volume: 10

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Wed Jul 12 00:10:14 2006

Date: Tue, 11 Jul 2006 21:10:06 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Tue, 11 Jul 2006     Volume: 10 Number: 9452

Today's topics:
    Re: RSS feeds and HTML special characters <john@castleamber.com>
    Re: RSS feeds and HTML special characters <mumia.w.18.spam+nospam.usenet@earthlink.net>
    Re: RSS feeds and HTML special characters <benmorrow@tiscali.co.uk>
    Re: RSS feeds and HTML special characters eurosnob@gmail.com
    Re: RSS feeds and HTML special characters <1usa@llenroc.ude.invalid>
    Re: RSS feeds and HTML special characters <john@castleamber.com>
    Re: RSS feeds and HTML special characters <benmorrow@tiscali.co.uk>
    Re: What is a type error? <jo@durchholz.org>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: 11 Jul 2006 22:50:04 GMT
From: John Bokma <john@castleamber.com>
Subject: Re: RSS feeds and HTML special characters
Message-Id: <Xns97FDB56A44404castleamber@130.133.1.4>

eurosnob@gmail.com wrote:

> If this isn't the right place to post this, please point me in the
> right direction?
> 
> I'm a relatively casual Perl programmer trying to implement an RSS feed
> into my personal site.  I've got it working using a slightly modified
> example from the O'Reilly book (content syndication with RSS), but for
> one annoying caveat...
> 
> If I load the feed in, say, Firefox, a title might look like this:
> 
> <title>This Artist is Good - Frank D'Armata</title>
> 
> When I use "View Source," the title string is actually:
> 
> <title>This Artist is Good - Frank D&#8217;Armata</title>
> 
> However, when I go to use the string from within Perl, I get a Warning,
> "Wide character in print", and giberish printed where the special
> character sits:
> 
> This Artist is Good - Frank D’Armata
> 
> (That's a lowercase 'a' with an accent, the Euro symbol, and the
> trademark symbol, between D and Armata.)
> 
> I'm sure there's a relatively simple fix, but I'm kind of lost at this
> point...  Help?!

You might want to study:
http://ahinea.com/en/tech/perl-unicode-struggle.html

-- 
John Bokma          Freelance software developer
                                &
                    Experienced Perl programmer: http://castleamber.com/


------------------------------

Date: Tue, 11 Jul 2006 23:56:11 GMT
From: "Mumia W." <mumia.w.18.spam+nospam.usenet@earthlink.net>
Subject: Re: RSS feeds and HTML special characters
Message-Id: <vUWsg.3343$vO.493@newsread4.news.pas.earthlink.net>

eurosnob@gmail.com wrote:
> [...]
> <title>This Artist is Good - Frank D&#8217;Armata</title>
> 
> However, when I go to use the string from within Perl, I get a Warning,
> "Wide character in print", and giberish printed where the special
> character sits:
> 
> This Artist is Good - Frank D’Armata
> [...]

You probably need "use encoding 'utf8';" at the top of
your script. You need to output utf8 data (0x8217 is a
unicode character), but STDOUT wasn't warned to look
out for unicode (muti-byte, wide) characters.



------------------------------

Date: Tue, 11 Jul 2006 23:40:52 +0100
From: Ben Morrow <benmorrow@tiscali.co.uk>
Subject: Re: RSS feeds and HTML special characters
Message-Id: <k0mdo3-6i1.ln1@osiris.mauzo.dyndns.org>


Quoth eurosnob@gmail.com:
> I'm a relatively casual Perl programmer trying to implement an RSS feed
> into my personal site.  I've got it working using a slightly modified
> example from the O'Reilly book (content syndication with RSS), but for
> one annoying caveat...
> 
> If I load the feed in, say, Firefox, a title might look like this:
> 
> <title>This Artist is Good - Frank D'Armata</title>
> 
> When I use "View Source," the title string is actually:
> 
> <title>This Artist is Good - Frank D&#8217;Armata</title>
> 
> However, when I go to use the string from within Perl, I get a Warning,
> "Wide character in print", and giberish printed where the special
> character sits:
> 
> This Artist is Good - Frank D’Armata
> 
> (That's a lowercase 'a' with an accent, the Euro symbol, and the
> trademark symbol, between D and Armata.)

Firstly, you need to be using perl 5.8.

Next, we need to know how you are getting hold of these strings. Please
post a minimal complete program that shows what you are doing.

Basically, your data is coming in (from wherever you're getting it from)
in the UTF8 encoding, and you haven't told Perl that. Perl assumes data
is in ISO8859-1 unless you tell it otherwise (for hysterical raisins),
so you're getting gibberish. If you show us how you're reading your data
we can tell you how to tell Perl it's in UTF8.

Ben

-- 
For far more marvellous is the truth than any artists of the past imagined!
Why do the poets of the present not speak of it? What men are poets who can
speak of Jupiter if he were like a man, but if he is an immense spinning
sphere of methane and ammonia must be silent?~Feynmann~benmorrow@tiscali.co.uk


------------------------------

Date: 11 Jul 2006 17:46:08 -0700
From: eurosnob@gmail.com
Subject: Re: RSS feeds and HTML special characters
Message-Id: <1152665168.202118.154470@i42g2000cwa.googlegroups.com>

Ben Morrow wrote:
> Firstly, you need to be using perl 5.8.

"This is perl, v5.8.3 built for i386-linux-thread-multi"

> Next, we need to know how you are getting hold of these strings. Please
> post a minimal complete program that shows what you are doing.

use LWP::Simple;
use XML::Simple;

my $feed = get ("http://the/url/of/the/feed");

# At this point, $feed contains:
# ... <title>This Artist is Good - Frank D&#8217;Armata</title> ...

my $parser = XML::Simple->new(  );
my $rss = $parser->XMLin("$feed");

# At this point,  $rss->{'channel'}->{'item'}->[x]->{'title'} contains:
# This Artist is Good - Frank D<gibberish>Armata

So it looks like the XML::Simple routine(s) are unencoding the encoded
HTML entity from the feed.  Since the output is an HTML page, I'd like
to leave the text HTML-encoded (in this example, &#8217;), as that's
what it should properly be for output.



------------------------------

Date: Wed, 12 Jul 2006 02:13:48 GMT
From: "A. Sinan Unur" <1usa@llenroc.ude.invalid>
Subject: Re: RSS feeds and HTML special characters
Message-Id: <Xns97FDE23B1C447asu1cornelledu@127.0.0.1>

eurosnob@gmail.com wrote in
news:1152665168.202118.154470@i42g2000cwa.googlegroups.com: 

> Ben Morrow wrote:
>> Firstly, you need to be using perl 5.8.
> 
> "This is perl, v5.8.3 built for i386-linux-thread-multi"
> 
>> Next, we need to know how you are getting hold of these strings.
>> Please post a minimal complete program that shows what you are doing.
> 
> use LWP::Simple;
> use XML::Simple;
> 
> my $feed = get ("http://the/url/of/the/feed");
> 
> # At this point, $feed contains:
> # ... <title>This Artist is Good - Frank D&#8217;Armata</title> ...
> 
> my $parser = XML::Simple->new(  );
> my $rss = $parser->XMLin("$feed");
> 
> # At this point,  $rss->{'channel'}->{'item'}->[x]->{'title'}
> contains: # This Artist is Good - Frank D<gibberish>Armata
> 
> So it looks like the XML::Simple routine(s) are unencoding the encoded
> HTML entity from the feed.  Since the output is an HTML page, I'd like
> to leave the text HTML-encoded (in this example, &#8217;), as that's
> what it should properly be for output.

I know next to nothing about this. The following two points seem 
pertinent to me:

From <URL: http://www.eopta.com/spec/rss-tutorial/#Tips>

# Encoding HTML — Although it’s tempting, refrain from including HTML 
markup (like <a href="...">, <b> or <p>) in your RSS feed; because you 
don’t know how it will be presented, doing so can prevent your feed from 
being displayed correctly. If you need to include a a tag in the text of 
the feed (e.g., the title of an entry is “Ode to <title>”), make sure 
you escape ampersands and angle brackets (so that it would be “Ode to 
&lt;title&gt;”).
# XML Entities — Remember that XML doesn’t predefine entities like HTML 
does; therefore, you won’t have &nbsp; &copy; and other common entities 
available. You can define them in the XML, or alternatively just use an 
character encoding that makes what you need available.

Aaaaanyway, in my complete ignorance, I would use 
encode_entities_numeric from HTML::Entities:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Simple;
use HTML::Entities qw(encode_entities_numeric);

my $feed = q{<title>This Artist is Good - Frank D&#8217;Armata</title>};

my $rss = XMLin($feed);

use Data::Dumper;
print Dumper $rss;
print Dumper encode_entities_numeric($rss);

__END__

-- 
A. Sinan Unur <1usa@llenroc.ude.invalid>
(remove .invalid and
reverse each component for email address)

comp.lang.perl.misc
guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html


------------------------------

Date: 12 Jul 2006 02:48:02 GMT
From: John Bokma <john@castleamber.com>
Subject: Re: RSS feeds and HTML special characters
Message-Id: <Xns97FDDDC30E7BFcastleamber@130.133.1.4>

eurosnob@gmail.com wrote:

> Ben Morrow wrote:
>> Firstly, you need to be using perl 5.8.
> 
> "This is perl, v5.8.3 built for i386-linux-thread-multi"
> 
>> Next, we need to know how you are getting hold of these strings. Please
>> post a minimal complete program that shows what you are doing.
> 
> use LWP::Simple;
> use XML::Simple;
> 
> my $feed = get ("http://the/url/of/the/feed");
> 
> # At this point, $feed contains:
> # ... <title>This Artist is Good - Frank D&#8217;Armata</title> ...
> 
> my $parser = XML::Simple->new(  );
> my $rss = $parser->XMLin("$feed");
> 
> # At this point,  $rss->{'channel'}->{'item'}->[x]->{'title'} contains:
> # This Artist is Good - Frank D<gibberish>Armata

How did you find this out? By printing? Notice that the printing step 
might do the <gibberish> thing (which probably is the case).

> So it looks like the XML::Simple routine(s) are unencoding the encoded
> HTML entity from the feed.  Since the output is an HTML page, I'd like
> to leave the text HTML-encoded (in this example, &#8217;), as that's
> what it should properly be for output.

Easiest solution is to output your HTML as utf8.

If you use A. Sinan Unur's solution, as far as I know you still have to 
specify somewhere that the HTML document should be rendered as having 
utf8, since &#8217; is just a fancy way of writing an utf character.

-- 
John Bokma          Freelance software developer
                                &
                    Experienced Perl programmer: http://castleamber.com/


------------------------------

Date: Wed, 12 Jul 2006 04:40:59 +0100
From: Ben Morrow <benmorrow@tiscali.co.uk>
Subject: Re: RSS feeds and HTML special characters
Message-Id: <bj7eo3-ne3.ln1@osiris.mauzo.dyndns.org>


Quoth John Bokma <john@castleamber.com>:
> eurosnob@gmail.com wrote:
> 
> > Ben Morrow wrote:
> >> Firstly, you need to be using perl 5.8.
> > 
> > "This is perl, v5.8.3 built for i386-linux-thread-multi"
> > 
> >> Next, we need to know how you are getting hold of these strings. Please
> >> post a minimal complete program that shows what you are doing.
> > 
> > use LWP::Simple;
> > use XML::Simple;
> > 
> > my $feed = get ("http://the/url/of/the/feed");
> > 
> > # At this point, $feed contains:
> > # ... <title>This Artist is Good - Frank D&#8217;Armata</title> ...
> > 
> > my $parser = XML::Simple->new(  );
> > my $rss = $parser->XMLin("$feed");
> > 
> > # At this point,  $rss->{'channel'}->{'item'}->[x]->{'title'} contains:
> > # This Artist is Good - Frank D<gibberish>Armata
> 
> How did you find this out? By printing? Notice that the printing step 
> might do the <gibberish> thing (which probably is the case).

It will, unless you tell Perl what encoding you want for output. This is
what the 'wide character in print' warning means: you need to specify an
encoding on the output filehandle. (This warning could be a little
clearer, IMHO; though it's difficult to reconcile all of Perl's forward-
and backward-compatibility goals :(.)

> > So it looks like the XML::Simple routine(s) are unencoding the encoded
> > HTML entity from the feed.  Since the output is an HTML page, I'd like
> > to leave the text HTML-encoded (in this example, &#8217;), as that's
> > what it should properly be for output.
> 
> Easiest solution is to output your HTML as utf8.

You can just print things and hope, but it's much better (safer, more
flexible and you don't get warnings) to do it right.

1. Decide what encoding you want to use. I generally use us-ascii, 'cos
I *know* it's safe; you may want to stick to ISO8859-1 as that's the
default so it's probably what you've been using.

2. Tell the browser what you've chosen. The right answer is to set the
charset in the HTTP Content-type header; there are other ways if that's
difficult (there have been threads on this recently here).

3. Tell Perl what you want, and tell it to use HTML entities for
characters that don't exist in your chosen encoding:

    use Encode qw/:fallbacks/;

    $PerlIO::encoding::fallback = FB_HTMLCREF;
    binmode STDOUT, ':encoding(iso8859-1)';

Substitute the appropriate filehandle and encoding in the binmode call.

> If you use A. Sinan Unur's solution, as far as I know you still have to 
> specify somewhere that the HTML document should be rendered as having 
> utf8, since &#8217; is just a fancy way of writing an utf character.

The encoding of the HTML document doesn't affect how numeric entities
are interpreted. They always refer to Unicode characters (note: not UTF8
bytes. &#195;&#169; does not mean &eacute;, even though those bytes
represent e-acute in UTF8).

Ben

-- 
  The cosmos, at best, is like a rubbish heap scattered at random.
                                                           Heraclitus
  benmorrow@tiscali.co.uk


------------------------------

Date: Wed, 12 Jul 2006 00:29:38 +0200
From: Joachim Durchholz <jo@durchholz.org>
Subject: Re: What is a type error?
Message-Id: <e9191f$pkb$1@online.de>

Marshall schrieb:
> Now, I'm not fully up to speed on DBC. The contract specifications,
> these are specified statically, but checked dynamically, is that
> right?

That's how it's done in Eiffel, yes.

 > In other words, we can consider contracts in light of
> inheritance, but the actual verification and checking happens
> at runtime, yes?

Sure. Though, while DbC gives rules for inheritance (actually subtypes), 
these are irrelevant to the current discussion; DbC-minus-subtyping can 
still be usefully applied.

> Wouldn't it be possible to do them at compile time? (Although
> this raises decidability issues.)

Exactly, and that's why you'd either uses a restricted assertion 
language (and essentially get something that's somewhere between a type 
system and traditional assertion); or you'd use some inference system 
and try to help it along (not a simple thing either - the components of 
such a system exist, but I'm not aware of any system that was designed 
for the average programmer).

 > Mightn't it also be possible to
> leave it up to the programmer whether a given contract
> was compile-time or runtime?

I'd agree with that, but I'm not sure how well that would hold up in 
practice.

Regards,
Jo


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc.  For subscription or unsubscription requests, send
#the single line:
#
#	subscribe perl-users
#or:
#	unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.  

NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice. 

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V10 Issue 9452
***************************************


home help back first fref pref prev next nref lref last post