[32547] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 3812 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Mon Nov 5 18:14:22 2012

Date: Mon, 5 Nov 2012 15:14:10 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Mon, 5 Nov 2012     Volume: 11 Number: 3812

Today's topics:
    Re: Why "Wide character in print"? <ben@morrow.me.uk>
    Re: Why "Wide character in print"? <ben@morrow.me.uk>
    Re: Why "Wide character in print"? <rweikusat@mssgmbh.com>
    Re: Why "Wide character in print"? <rweikusat@mssgmbh.com>
    Re: Why "Wide character in print"? <hjp-usenet2@hjp.at>
    Re: Why "Wide character in print"? <ben@morrow.me.uk>
    Re: Why was suid support dropped in perl? <ben@morrow.me.uk>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Mon, 5 Nov 2012 19:10:46 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Why "Wide character in print"?
Message-Id: <miflm9-gh41.ln1@anubis.morrow.me.uk>


Quoth Helmut Richter <hhr-m@web.de>:
> On Mon, 29 Oct 2012, Peter J. Holzer wrote:
> 
> > However, for most programs you don't have to know that Perl character
> > strings are Unicode strings.
> 
> Are they? They are strings of characters that are contained in Unicode. They
> are not necessarily internally encoded as Unicode.  People run into problems
> when they make assumptions about the way they are implemented. I would have
> worded:
> 
>   For all programs you must not pretend to know that Perl character strings
>   are Unicode strings.

Perl strings are (nearly) always *Unicode* strings; this is not the same
as saying they are internally represented in any particular encoding of
Unicode. uc(chr(97)) eq chr(65), for instance, and uc(chr(0x450)) eq
chr(0x400). What you must (pretend to) not know is that they are
sometimes internally represented in UTF-8 and sometimes in ISO8859-1; in
principle the internal representation could be changed to UCS-4 or
something else sane without breaking anything. 

(In practice it would break XS, so it probably won't happen, which is a
shame. UTF-8 was a very bad choice of internal representation, in
retrospect, though it seemed to make sense at the time. It makes a great
many internal operations much more complicated than they need to be,
because you can no longer index into an array to find a particular
character in the string.)

Where it gets a bit sketchy is when you are dealing with characters >127
in strings which happen to be internally encoded as ISO8859-1 rather
than UTF-8 (which you shouldn't need to know about). For versions of
perl <5.12, uc(chr(255)) returns chr(255), which is incorrect, because
the correct answer is chr(376) which would require changing the internal
representation. This applies to all characters >127, even those where
the case-change exists in ISO8859-1, but only if the string happened to
be internally represented in ISO8859-1. 

This bug was fixed in 5.12, but for the sake of compatibility the fix is
(so far) only activated in the scope of 'use feature "unicode_strings"',
which is switched on by 'use 5.012'. As a result you may see code
mucking about with utf8::is_utf8 and so on, in an attempt to work around
this bug; a better fix is to upgrade to 5.12 and put 'use 5.012;' at the
top of each file.

If you need to manipulate strings of bytes, for instance for IO, you
simply represent a byte $b by the character chr($b), where 0 <= $b <=
255. If you attempt to do IO to a raw filehandle with a character with
an ordinal >255, you get a warning and perl does something stupid; I
agree with Peter it would be better for it to die in this case.

Ben



------------------------------

Date: Mon, 5 Nov 2012 19:40:55 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Why "Wide character in print"?
Message-Id: <7bhlm9-gh41.ln1@anubis.morrow.me.uk>


Quoth "Peter J. Holzer" <hjp-usenet2@hjp.at>:
>
> whether the string is character string (with the UTF8 bit set) or a byte
> string, 

Careful. You're conflating the existing-only-in-the-programmer's-head
concept of 'do I consider this string to contain bytes for IO or
characters for manipulation' with the perl-internal SvUTF8 flag, which
is exactly the mistake we have been trying to stop people making since
5.8.0 was released and we realised the 3rd-Camel model where Perl keeps
track of the characters/bytes distinction isn't workable. It's entirely
possible and sensible for a 'byte string', that is, a string containing
only characters <256 intended for raw IO, to happen to have SvUTF8 set
internally, with byte values >127 represented as 2 bytes.

Ben



------------------------------

Date: Mon, 05 Nov 2012 22:15:07 +0000
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Why "Wide character in print"?
Message-Id: <87ehk7r91w.fsf@sapphire.mobileactivedefense.com>

Ben Morrow <ben@morrow.me.uk> writes:

[...]

> (In practice it would break XS, so it probably won't happen, which is a
> shame. UTF-8 was a very bad choice of internal representation, in
> retrospect, though it seemed to make sense at the time. It makes a great
> many internal operations much more complicated than they need to be,
> because you can no longer index into an array to find a particular
> character in the string.)

The only way to provide that is to store all characters as integer
values large enough to encompass all conceivably existing Unicode
codepoints. Otherwise, you're going to have multibyte characters and
consequently, 'indexing into the array to find a particular character
in the string' won't work anymore.

Independently of this, the UTF-8 encoding was designed to have
represenation of the Unicode character set which was backwards
compatible with 'ASCII-based systems' and it is not only a widely
supported internet standard (http://tools.ietf.org/html/rfc3629) and
the method of choice for dealing with 'Unicode' for UNIX(*) and
similar system but formed the 'basic character encoding' of complete
operating systems as early as 1992
(http://plan9.bell-labs.com/plan9/about.html). As such, supporting it
natively in a programming language closely associated with UNIX(*), at
least at that time, should have been pretty much a no brainer. "But
Microsoft did it difffentely !!1" is the ultimate argument for some
people but - thankfully - these didn't get to piss into Perl until
very much later and thus, the damage they can still do is mostly
limited to 'propaganda'.



------------------------------

Date: Mon, 05 Nov 2012 22:22:07 +0000
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Why "Wide character in print"?
Message-Id: <87a9uvr8q8.fsf@sapphire.mobileactivedefense.com>

Rainer Weikusat <rweikusat@mssgmbh.com> writes:
> Ben Morrow <ben@morrow.me.uk> writes:
>
> [...]
>
>> (In practice it would break XS, so it probably won't happen, which is a
>> shame. UTF-8 was a very bad choice of internal representation, in
>> retrospect, though it seemed to make sense at the time. It makes a great
>> many internal operations much more complicated than they need to be,
>> because you can no longer index into an array to find a particular
>> character in the string.)
>
> The only way to provide that is to store all characters as integer
> values large enough to encompass all conceivably existing Unicode
> codepoints. Otherwise, you're going to have multibyte characters and
> consequently, 'indexing into the array to find a particular character
> in the string' won't work anymore.

I would also like to point out that this is an inherent deficiency of
the idea to represent all glyphs of all conceivable scripts with a
single encoding scheme at that the practial consequences of that are
mostly 'anything which restricts itself to the US typewriter character
set is fine' (and everyone else is going to have no end of problems
because of that).

I actually stopped using German characters like a-umlaut years ago
exactly because of this.


------------------------------

Date: Mon, 5 Nov 2012 23:42:11 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Why "Wide character in print"?
Message-Id: <slrnk9gg63.84i.hjp-usenet2@hrunkner.hjp.at>

On 2012-11-05 19:40, Ben Morrow <ben@morrow.me.uk> wrote:
> Quoth "Peter J. Holzer" <hjp-usenet2@hjp.at>:
>> whether the string is character string (with the UTF8 bit set) or a byte
>> string, 
>
> Careful. You're conflating the existing-only-in-the-programmer's-head
> concept of 'do I consider this string to contain bytes for IO or
> characters for manipulation' with the perl-internal SvUTF8 flag, which
> is exactly the mistake we have been trying to stop people making since
> 5.8.0 was released

Who is "we"? Before 5.12, you had to make the distinction.
Strings without the SvUTF8 flag simply didn't have Unicode semantics.
Now there is the unicode_strings feature, but

 1) it still isn't default
 2) it will be years before I can rely on perl 5.12+ being installed on 
    a sufficient number of machines to use it. I'm not even sure if most
    of our machines have 5.10 yet (the Debian machines have, but most of
    the RHEL machines have 5.8.x)

So, that distinction has at least existed for 8 years (2002-07-18 to
2010-04-12) and for many of us it will exist at for another few years.

So enforcing the concept I have my head in the Perl code is simply
defensive programming.

> and we realised the 3rd-Camel model where Perl keeps track of the
> characters/bytes distinction isn't workable.

It worked for me ;-).

> It's entirely possible and sensible for a 'byte string', that is, a
> string containing only characters <256 intended for raw IO, to happen
> to have SvUTF8 set internally, with byte values >127 represented as 2
> bytes.

Theoretically yes. In practice it almost always means that the
programmer forgot to call encode() somewhere. 

And the other way around didn't work at all: You couldn't keep a string
with characters > 127 but < 256 in a string without the SvUTF8 flag set
and expect it to work.

	hp


-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel


------------------------------

Date: Mon, 5 Nov 2012 22:48:12 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Why "Wide character in print"?
Message-Id: <caslm9-fl61.ln1@anubis.morrow.me.uk>


Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
> Ben Morrow <ben@morrow.me.uk> writes:
> 
> > (In practice it would break XS, so it probably won't happen, which is a
> > shame. UTF-8 was a very bad choice of internal representation, in
> > retrospect, though it seemed to make sense at the time. It makes a great
> > many internal operations much more complicated than they need to be,
> > because you can no longer index into an array to find a particular
> > character in the string.)
> 
> The only way to provide that is to store all characters as integer
> values large enough to encompass all conceivably existing Unicode
> codepoints. Otherwise, you're going to have multibyte characters and
> consequently, 'indexing into the array to find a particular character
> in the string' won't work anymore.

Yes. That's called 'a 32-bit int', and is the standard wchar_t C
representation of Unicode. A sensible alternative would be a 1/2/4-byte
upgrade scheme somewhat similar to the current Perl scheme, but with all
the alternatives being constant width; a smarter alternative would be to
represent a string as a series of pieces, each of which could make a
different choice (and, potentially, some of which could be shared or CoW
with other strings).

> Independently of this, the UTF-8 encoding was designed to have
> represenation of the Unicode character set which was backwards
> compatible with 'ASCII-based systems' and it is not only a widely
> supported internet standard (http://tools.ietf.org/html/rfc3629) and
> the method of choice for dealing with 'Unicode' for UNIX(*) and
> similar system but formed the 'basic character encoding' of complete
> operating systems as early as 1992
> (http://plan9.bell-labs.com/plan9/about.html).

There is a very big difference between a sensible *internal*
representation and a sensible *external* representation. UTF-8 was
designed as an external representation; it's extremely good (in my
narrow, Western, English-speaking opinion) for that purpose. It was
never intended to be used internally, except by applications which
didn't attempt to decode it to characters.

But then, you've never really understood the concept of abstraction,
have you?

> As such, supporting it
> natively in a programming language closely associated with UNIX(*), at
> least at that time, should have been pretty much a no brainer. "But
> Microsoft did it difffentely !!1" is the ultimate argument for some
> people but - thankfully - these didn't get to piss into Perl until
> very much later and thus, the damage they can still do is mostly
> limited to 'propaganda'.

I don't know what Win32's internal representation is (I suspect 32bit
int, the same as Unix), but its default external representation is
UTF-16, which is about the most braindead concoction anyone has ever
come up with. The only possible justification for its existence is
backwards-compatibility with systems which started implementing Unicode
before it was finished, and even then I'm *certain* they could have made
it less grotesquely ugly if they'd tried (a UTF-8-like scheme, for
instance).

So no, my comments about the unsuitability of UTF-8 as an internal
encoding have nothing whatever to do with Win32, and everything to do
with actually understanding how string operations work at the machine
level.

Ben



------------------------------

Date: Mon, 5 Nov 2012 20:29:14 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Why was suid support dropped in perl?
Message-Id: <q5klm9-5q41.ln1@anubis.morrow.me.uk>


Quoth "shrike@cyberspace.org" <shrike@cyberspace.org>:
> 
> Because then I would have to support public key rhost based
> authentication for sshd, which is an even worse proposition than
> supporting sudo. If I touch _anything_ else, I own it. All I can
> reasonably expect to secure or support is _my_ code. This is the basic
> reality of software support. 
> 
> My concern is not whether _I_ can use it. My concern is whether somebody
> else can use it by following a short set of instructions. "chmod +s"
> works. sudo or rhost+sshd is a 3 hour support call. And I'm not going to
> tell somebody to turn on remote access for the root account for sshd,
> when I have no reasonable expectation that they understand the
> consequences of doing so. 

If you are dealing with people that incompetent, you cannot reasonably
expect them to understand the security consequences of having suidperl
installed, either.

Ben



------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3812
***************************************


home help back first fref pref prev next nref lref last post