[28095] in Perl-Users-Digest
Perl-Users Digest, Issue: 9459 Volume: 10
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Wed Jul 12 18:10:15 2006
Date: Wed, 12 Jul 2006 15:10:08 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Wed, 12 Jul 2006 Volume: 10 Number: 9459
Today's topics:
Re: RSS feeds and HTML special characters <john@castleamber.com>
Re: RSS feeds and HTML special characters <john@castleamber.com>
Re: RSS feeds and HTML special characters <flavell@physics.gla.ac.uk>
Re: RSS feeds and HTML special characters <flavell@physics.gla.ac.uk>
Re: RSS feeds and HTML special characters <ermeyers@adelphia.net>
Re: RSS feeds and HTML special characters <john@castleamber.com>
Re: RSS feeds and HTML special characters <benmorrow@tiscali.co.uk>
Re: RSS feeds and HTML special characters <flavell@physics.gla.ac.uk>
Re: testing or detecting valid "date" data type <benmorrow@tiscali.co.uk>
Re: What is a type error? <marshall.spight@gmail.com>
Re: What is a type error? <jo@durchholz.org>
Re: What is a type error? <jo@durchholz.org>
Re: What is a type error? <dnew@san.rr.com>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: 12 Jul 2006 19:45:17 GMT
From: John Bokma <john@castleamber.com>
Subject: Re: RSS feeds and HTML special characters
Message-Id: <Xns97FE9616F7F2castleamber@130.133.1.4>
Ben Morrow <benmorrow@tiscali.co.uk> wrote:
> Quoth John Bokma <john@castleamber.com>:
[..]
>> Easiest solution is to output your HTML as utf8.
>
> You can just print things and hope, but it's much better (safer, more
> flexible and you don't get warnings) to do it right.
>
> 1. Decide what encoding you want to use. I generally use us-ascii,
> 'cos I *know* it's safe; you may want to stick to ISO8859-1 as that's
> the default so it's probably what you've been using.
The problem is: what is ’ in us-ascii or ISO8859-1? What happens
when you tell a browser: this is ISO8859-1 and next you ask it to decode
’ Does the browser always render in UTF mode?
--
John Bokma Freelance software developer
&
Experienced Perl programmer: http://castleamber.com/
------------------------------
Date: 12 Jul 2006 19:48:06 GMT
From: John Bokma <john@castleamber.com>
Subject: Re: RSS feeds and HTML special characters
Message-Id: <Xns97FE9690EA524castleamber@130.133.1.4>
"Alan J. Flavell" <flavell@physics.gla.ac.uk> wrote:
> Hope this clears things up a bit.
Thanks it does (i hope). So it's safe to assume that browsers handle HTML
internally as utf8 no matter how it's offered by the webserver? And hence
using &#number; with number outside the actual encoding of the HTML file
itself is perfectly legal?
--
John Bokma Freelance software developer
&
Experienced Perl programmer: http://castleamber.com/
------------------------------
Date: Wed, 12 Jul 2006 20:57:20 +0100
From: "Alan J. Flavell" <flavell@physics.gla.ac.uk>
Subject: Re: RSS feeds and HTML special characters
Message-Id: <Pine.LNX.4.64.0607122047460.9658@ppepc20.ph.gla.ac.uk>
On Wed, 12 Jul 2006, John Bokma wrote:
> The problem is: what is ’ in us-ascii or ISO8859-1?
No, that isn't the problem. The problem here, I'm afraid, is that you
/still/ show evidence that you don't understand this aspect of
character representation in HTML.
The characters ampersand, hash, 8 2 1 7 semi-colon *are*, after all,
are all us-ascii characters, each and every one of them. Why do you
think that's a problem?
> What happens when you tell a browser: this is ISO8859-1 and next you
> ask it to decode ’
RFC2070 was published in what - 1997? It tells you want to do.
> Does the browser always render in UTF mode?
No. The browser is required to *behave as if* it understands Unicode.
How it does that internally is entirely its own affair (black box
model). Internally, it might work in EBCDIC DBCS, for all that the
web specifications care.
Or to take another in-principle shot, it *could* perfectly well look
up ’ in its tables and find that (amongst other things) in
Windows-1252 encoding it's 0x92.
h t h
------------------------------
Date: Wed, 12 Jul 2006 21:19:21 +0100
From: "Alan J. Flavell" <flavell@physics.gla.ac.uk>
Subject: Re: RSS feeds and HTML special characters
Message-Id: <Pine.LNX.4.64.0607122104040.9658@ppepc20.ph.gla.ac.uk>
On Wed, 12 Jul 2006, John Bokma wrote:
> "Alan J. Flavell" <flavell@physics.gla.ac.uk> wrote:
>
> > Hope this clears things up a bit.
>
> Thanks it does (i hope).
Sorry, we seem to have overlapped in posting.
> So it's safe to assume that browsers handle HTML
> internally as utf8 no matter how it's offered by the webserver?
It's safe (now that NN4 is practically out of the way) to assume that
they will *understand* all three[1] representations of characters, no
matter what "character encoding scheme" (that's the current technical
term for it) is used for transmitting the data.
As I said in the other f'up - it's entirely up to the browser
developer just how they implement that, inside of their black box, as
long as it works as intended when viewed from the outside. In
practice, most browsers will work internally in some representation of
Unicode, but it's not a requirement.
[1] Those three representations being:
1. an encoded character itself, if the character in question can be
represented in the encoding scheme that's in use;
2. a numerical character reference (&#number; or &#xhexnumber;)
referring to the *Unicode* character number (irrespective of what
character encoding scheme is in use);
3. a character entity (&name;) if one is defined in HTML/4
> And hence using &#number; with number outside the actual encoding of
> the HTML file itself is perfectly legal?
Yes, absolutely, and it has been since at least RFC2070.
Unfortunately, the authors of NN4 don't seem to have understood
RFC2070. Good riddance to NN4.
What would be the point of defining &#number; notation if it did
nothing more than to duplicate the characters which could be encoded
in the encoding scheme used? There'd be little sense in that!
p.s As you may have noticed, this is one of my "special subjects". I
hope I haven't said anything to offend. But if it helps to shock some
readers out of a confidently- but mistakenly-held belief, it may have
done some good.
all the best.
------------------------------
Date: Wed, 12 Jul 2006 16:28:16 -0400
From: "Eric R. Meyers" <ermeyers@adelphia.net>
Subject: Re: RSS feeds and HTML special characters
Message-Id: <-vydnYgUl9L9xijZnZ2dnUVZ_o2dnZ2d@adelphia.com>
Hi Euro,
Re: HTML special characters in RSS feed. I answered you over in
perl.beginners, but I'm bring this over here to where your action is at.
eurosnob@gmail.com wrote:
> If this isn't the right place to post this, please point me in the
> right direction?
>
> I'm a relatively casual Perl programmer trying to implement an RSS feed
> into my personal site. I've got it working using a slightly modified
> example from the O'Reilly book (content syndication with RSS), but for
> one annoying caveat...
>
> If I load the feed in, say, Firefox, a title might look like this:
>
> <title>This Artist is Good - Frank D'Armata</title>
>
> When I use "View Source," the title string is actually:
>
> <title>This Artist is Good - Frank D’Armata</title>
>
> However, when I go to use the string from within Perl, I get a Warning,
> "Wide character in print", and giberish printed where the special
> character sits:
>
> This Artist is Good - Frank DâArmata
>
> (That's a lowercase 'a' with an accent, the Euro symbol, and the
> trademark symbol, between D and Armata.)
>
> I'm sure there's a relatively simple fix, but I'm kind of lost at this
> point... Help?!
>
> Thanks!
You need to find the XML::Simple OPTIONS section called "NumericEscape"
which discusses an XMLout ability to output the high characters as "numeric
entities." I think that you need to use this "NumericEscape" option before
printing.
XML::Simple also mentions a "STRICT MODE" to automatically catch common
errors.
You might also want to read 'perldoc perluniintro' to learn how perl handles
characters internally.
I hope this helps.
Eric
------------------------------
Date: 12 Jul 2006 21:02:49 GMT
From: John Bokma <john@castleamber.com>
Subject: Re: RSS feeds and HTML special characters
Message-Id: <Xns97FEA33C7C95Acastleamber@130.133.1.4>
"Alan J. Flavell" <flavell@physics.gla.ac.uk> wrote:
> p.s As you may have noticed, this is one of my "special subjects". I
> hope I haven't said anything to offend. But if it helps to shock some
> readers out of a confidently- but mistakenly-held belief, it may have
> done some good.
It worked for me :-) Thanks.
--
John Bokma Freelance software developer
&
Experienced Perl programmer: http://castleamber.com/
------------------------------
Date: Wed, 12 Jul 2006 21:20:25 +0100
From: Ben Morrow <benmorrow@tiscali.co.uk>
Subject: Re: RSS feeds and HTML special characters
Message-Id: <952go3-a79.ln1@osiris.mauzo.dyndns.org>
Quoth John Bokma <john@castleamber.com>:
> "Alan J. Flavell" <flavell@physics.gla.ac.uk> wrote:
>
> > Hope this clears things up a bit.
>
> Thanks it does (i hope). So it's safe to assume that browsers handle HTML
> internally as utf8 no matter how it's offered by the webserver?
^^^^
*Unicode*, not UTF8. They are different: and when I say 'Unicode', I
don't mean UCS2 or UTF16 or whatever it is Java and Microsoft mean when
they say it.
Unicode is a big old list of characters, with a number for each one.
UTF8 is a means of representing a sequence of Unicode characters as a
sequence of 8-bit bytes, with certain desirable properties for those
characters with Unicode indices less than 128.
The difference is crucial: for instance, {SG,HT,X}ML use Unicode
character numbers directly in their escape mechanism, whereas for URIs
you have to first encode your Unicode characters into UTF8 bytes and
then escape those.
> And hence
> using &#number; with number outside the actual encoding of the HTML file
> itself is perfectly legal?
Yes. That, after all, is the whole point: to represent characters you
can't put directly in the document :).
Ben
--
Razors pain you / Rivers are damp
Acids stain you / And drugs cause cramp. [Dorothy Parker]
Guns aren't lawful / Nooses give
Gas smells awful / You might as well live. benmorrow@tiscali.co.uk
------------------------------
Date: Wed, 12 Jul 2006 22:55:32 +0100
From: "Alan J. Flavell" <flavell@physics.gla.ac.uk>
Subject: Re: RSS feeds and HTML special characters
Message-Id: <Pine.LNX.4.64.0607122248530.11891@ppepc20.ph.gla.ac.uk>
On Wed, 12 Jul 2006, Ben Morrow wrote:
> Quoth John Bokma <john@castleamber.com>:
> >
> > So it's safe to assume that browsers handle HTML
> > internally as utf8 no matter how it's offered by the webserver?
> ^^^^
> *Unicode*, not UTF8.
Good call; but also, I stand by what I said before, that the internal
workings are at the discretion of the implementer, as long as the
behaviour as seen from outside is correct.
> They are different:
Indeed they are conceptually of different categories.
> and when I say 'Unicode', I don't mean UCS2 or UTF16 or whatever it
> is Java and Microsoft mean when they say it.
It's very annoying that MS can present, on one and the same menu, one
entry that says "utf-8" and another that says "Unicode". It's as
logical as a menu that asks you to chooce between oranges and fruit.
To understand this aspect of Unicode better, it's useful to read
chapter 2 of the Unicode specification,
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf
In particular, 2.5 Code Points, 2.6 Character Encoding Forms, and
2.7 Character Encoding Schemes.
> Unicode is a big old list of characters, with a number for each one.
Yes - what in the jargon is called a "coded character set". See the
Unicode glossary http://www.unicode.org/versions/Unicode4.0.0/b1.pdf
for details.
> UTF8 is a means of representing a sequence of Unicode characters as
> a sequence of 8-bit bytes,
Indeed, and it is one of the "character encoding schemes" of Unicode,
of which there are currently officially seven (some of which have
obsoleted the encodings previously called ucs2 and ucs4).
> The difference is crucial: for instance, {SG,HT,X}ML use Unicode
> character numbers directly in their escape mechanism, whereas for
> URIs you have to first encode your Unicode characters into UTF8
> bytes and then escape those.
Good point.
And just to hammer this in once again: that unfortunately-named
MIME parameter called "charset" specifies what we are now meant to
call a "character encoding scheme".
> > And hence using &#number; with number outside the actual encoding
> > of the HTML file itself is perfectly legal?
>
> Yes. That, after all, is the whole point: to represent characters
> you can't put directly in the document :).
Good stuff.
I was meaning to include a URL for further reading on this topic as it
relates to HTML (and, for the most part, also for XML-based markups).
In fact, the HTML/4.01 specification has quite a useful piece on this.
So I give you: http://www.w3.org/TR/REC-html40/charset.html
Hoping the audience haven't all gone to sleep yet,
best regards
------------------------------
Date: Wed, 12 Jul 2006 16:55:39 +0100
From: Ben Morrow <benmorrow@tiscali.co.uk>
Subject: Re: testing or detecting valid "date" data type
Message-Id: <rkifo3-f47.ln1@osiris.mauzo.dyndns.org>
Quoth "Jack" <jack_posemsky@yahoo.com>:
> Hi folks,
>
> I am looking to find some code that tests a string as being a valid
> "date" data type.. I tried this but it doesnt seem to work
>
> $ENV{TZ} = 'PST';
> use Date::Manip;
> $datetest = '1111';
> $date = ParseDate($datetest);
> print $date;
>
> This program returns: 1111010100:00:00
> which is not a valid date for example, obviously.
What makes you think that? It's midnight on the first of January
AD 1111. A perfectly sensible date, though maybe not what you meant.
Ben
--
It will be seen that the Erwhonians are a meek and long-suffering people,
easily led by the nose, and quick to offer up common sense at the shrine of
logic, when a philosopher convinces them that their institutions are not based
on the strictest morality. [Samuel Butler, paraphrased] benmorrow@tiscali.co.uk
------------------------------
Date: 12 Jul 2006 11:15:50 -0700
From: "Marshall" <marshall.spight@gmail.com>
Subject: Re: What is a type error?
Message-Id: <1152728150.911515.263640@s13g2000cwa.googlegroups.com>
Joachim Durchholz wrote:
> Marshall schrieb:
> > I can see the lack of a formal model being an issue, but is the
> > imperative bit really all that much of an obstacle? How hard
> > is it really to deal with assignment? Or does the issue have
> > more to do with pointers, aliasing, etc.?
>
> Actually aliasing is *the* hard issue.
Okay, sure. Nice explanation.
But one minor point: you describe this as an issue with "imperative"
languages. But aliasing is a problem associated with pointers,
not with assignment. One can have assignment, or other forms
of destructive update, without pointers; they are not part of the
definition of "imperative." (Likewise, one can have pointers without
assignment, although I'm less clear if the aliasing issue is as
severe.)
Marshall
------------------------------
Date: Wed, 12 Jul 2006 21:37:39 +0200
From: Joachim Durchholz <jo@durchholz.org>
Subject: Re: What is a type error?
Message-Id: <e93jb6$ll7$1@online.de>
Marshall schrieb:
> Joachim Durchholz wrote:
>> Marshall schrieb:
>>> I can see the lack of a formal model being an issue, but is the
>>> imperative bit really all that much of an obstacle? How hard
>>> is it really to deal with assignment? Or does the issue have
>>> more to do with pointers, aliasing, etc.?
>> Actually aliasing is *the* hard issue.
>
> Okay, sure. Nice explanation.
>
> But one minor point: you describe this as an issue with "imperative"
> languages. But aliasing is a problem associated with pointers,
> not with assignment.
Aliasing is not a problem if the aliased data is immutable.
> One can have assignment, or other forms
> of destructive update, without pointers; they are not part of the
> definition of "imperative."
Sure.
You can have either of destructive updates and pointers without
incurring aliasing problems. As soon as they are combined, there's trouble.
Functional programming languages often drop assignment entirely. (This
is less inefficient than one would think. If everything is immutable,
you can freely share data structures and avoid some copying, and you can
share across abstraction barriers. In programs with mutable values,
programmers are forced to choose the lesser evil of either copying
entire data structures or doing a cross-abstraction analysis of who
updates what elements of what data structure. A concrete example: the
first thing that Windows does when accepting userland data structures
is... to copy them; this were unnecessary if the structures were immutable.)
Some functional languages restrict assignment so that there can exist at
most a single reference to any mutable data structure. That way, there's
still no aliasing problems, but you can still update in place where it's
really, really necessary.
I know of no professional language that doesn't have references of some
kind.
Regards,
Jo
------------------------------
Date: Wed, 12 Jul 2006 21:44:53 +0200
From: Joachim Durchholz <jo@durchholz.org>
Subject: Re: What is a type error?
Message-Id: <e93joo$meo$1@online.de>
Darren New schrieb:
> There are also problems with the complexity of things. Imagine a
> chess-playing game trying to describe the "generate moves" routine.
> Precondition: An input board with a valid configuration of chess pieces.
> Postcondition: An array of boards with possible next moves for the
> selected team. Heck, if you could write those as assertions, you
> wouldn't need the code.
Actually, in a functional programming language (FPL), you write just the
postconditions and let the compiler generate the code for you.
At least that's what happens for those FPL functions that you write down
without much thinking. You can still tweak the function to make it more
efficient. Or you can define an interface using preconditions and
postconditions, and write a function that fulfills these assertions
(i.e. requires no more preconditions than the interface specifies, and
fulfills at least the postcondition that the interface specifies); here
we'd have a postcondition that's separate from the code, too.
I.e. in such cases, the postconditions separate the accidental and
essential properties of a function, so they still have a role to play.
Regards,
Jo
------------------------------
Date: Wed, 12 Jul 2006 20:16:55 GMT
From: Darren New <dnew@san.rr.com>
Subject: Re: What is a type error?
Message-Id: <XMctg.27137$uy3.7183@tornado.socal.rr.com>
Joachim Durchholz wrote:
> Actually, in a functional programming language (FPL), you write just the
> postconditions and let the compiler generate the code for you.
Certainly. And my point is that the postcondition describing "all valid
chess boards reachable from this one" is pretty much going to be as big
as an implementation for generating it, yes? The postcondition will
still have to contain all the rules of chess in it, for example. At best
you've replaced loops with some sort of universal quanitifier with a
"such that" phrase.
Anyway, I expect you could prove you can't do this in the general case.
Otherwise, you could just write a postcondition that asserts the output
of your function is machine code that when run generates the same
outputs as the input string would. I.e., you'd have a compiler that can
write other compilers, generated automatically from a description of the
semantics of the input stream and the semantics of the machine the code
is to run on. I'm pretty sure we're not there yet, and I'm pretty sure
you start running into the limits of computability if you do that.
--
Darren New / San Diego, CA, USA (PST)
This octopus isn't tasty. Too many
tentacles, not enough chops.
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc. For subscription or unsubscription requests, send
#the single line:
#
# subscribe perl-users
#or:
# unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.
NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V10 Issue 9459
***************************************