[32548] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 3813 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Nov 6 21:09:25 2012

Date: Tue, 6 Nov 2012 18:09:11 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Tue, 6 Nov 2012     Volume: 11 Number: 3813

Today's topics:
    Re: array <kst-u@mib.org>
    Re: array <ben@morrow.me.uk>
    Re: Clear the "Wide character in print" warning and lea jidanni@jidanni.org
    Re: lerning perl <graham.stow@stowassocs.co.uk>
        Problems with tabulations <luca.francesca01@gmail.com>
    Re: Problems with tabulations <glex_no-spam@qwest-spam-no.invalid>
    Re: Problems with tabulations <ben@morrow.me.uk>
        STDOUT and STDERR from command lines <juliani.moon@gmail.com>
    Re: STDOUT and STDERR from command lines <jimsgibson@gmail.com>
    Re: STDOUT and STDERR from command lines <ben@morrow.me.uk>
    Re: STDOUT and STDERR from command lines <Wasell@example.invalid>
    Re: Why "Wide character in print"? <rweikusat@mssgmbh.com>
    Re: Why "Wide character in print"? <ben@morrow.me.uk>
    Re: Why "Wide character in print"? <rweikusat@mssgmbh.com>
    Re: Why "Wide character in print"? <rweikusat@mssgmbh.com>
    Re: Why "Wide character in print"? <hjp-usenet2@hjp.at>
    Re: Why "Wide character in print"? <hjp-usenet2@hjp.at>
    Re: Why "Wide character in print"? <ben@morrow.me.uk>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Tue, 06 Nov 2012 11:12:46 -0800
From: Keith Thompson <kst-u@mib.org>
Subject: Re: array
Message-Id: <ln1ug6mtox.fsf@nuthaus.mib.org>

"Peter J. Holzer" <hjp-usenet2@hjp.at> writes:
> On 2012-11-04 16:12, Dr.Ruud <rvtol+usenet@xs4all.nl> wrote:
>> On 2012-11-04 06:18, Shmuel (Seymour J.) Metz wrote:
>>> In <5094feb7$0$6916$e4fe514c@news2.news.xs4all.nl>, on 11/03/2012
>>>     at 12:23 PM, "Dr.Ruud" <rvtol+usenet@xs4all.nl> said:
>>
>>>> In Perl, 'for' and 'foreach' are the same.
>>>
>>> "for loops" in perlsyn documents the C-style for loop as distint from
>>> foreach..
>>
>>  From perlsyn:
>>
>> The "foreach" keyword is actually a synonym for the "for" keyword, so
>> you can use "foreach" for readability or "for" for brevity.
>
> I very much suspect that Shmuel knows this.
>
> But the documentation still calls this style of loops "Foreach Loops"
> and the C-style loops "For loops". 
[...]

I've always found that to be a (rather minor) annoyance in Perl.

If I had designed the language, "for" would be used only for C-style
loops, and "foreach" only for the list form.  It's great that there's
more than one way to do it, but I don't see the benefit of having
more than one way to spell it.

Of course changing it now would break existing code.

(Then again, if I had designed the language you probably wouldn't
be using it.)

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
    Will write code for food.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"


------------------------------

Date: Tue, 6 Nov 2012 20:53:14 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: array
Message-Id: <qu9om9-igg.ln1@anubis.morrow.me.uk>


Quoth Keith Thompson <kst-u@mib.org>:
> 
> I've always found that to be a (rather minor) annoyance in Perl.
> 
> If I had designed the language, "for" would be used only for C-style
> loops, and "foreach" only for the list form.  It's great that there's
> more than one way to do it, but I don't see the benefit of having
> more than one way to spell it.

That's bad Huffman coding: for (LIST) is one of the more common control
structures, for (EXPR;EXPR;EXPR) is only there to keep C programmers
happy. If *I* were designing the language, C-style 'for' wouldn't be
present at all; the differences from 'while' are not, IMHO, significant
enough to warrant a separate control structure. I don't believe I have
ever used C-style 'for' in Perl.

(FWIW, Perl 1 had only C-style for. Perl 2 introduced Perl-style
foreach, with 'for' as an alias, but except for mentioning the
equivalence the documentation used 'foreach' throughout.)

Ben



------------------------------

Date: Tue, 06 Nov 2012 11:50:33 +0800
From: jidanni@jidanni.org
Subject: Re: Clear the "Wide character in print" warning and leave the output unmangled
Message-Id: <871ug7jsom.fsf@jidanni.org>

Thanks everybody. I have elected the simple solution, s/\N{U+200E}//g;
And I now removed any binmode, etc. so as to just deal with good old
fashioned bytes. Much simpler than attempting any 'correct' solution.

I suppose I should have realized that since there was only one wide
character warning, just one input line was causing the warning after
all... Thanks to Peter J. Holzer for waking me up to the fact!


------------------------------

Date: Tue, 6 Nov 2012 11:30:25 -0000
From: "Graham" <graham.stow@stowassocs.co.uk>
Subject: Re: lerning perl
Message-Id: <vbqdnT-weMZFaQXNnZ2dnUVZ7rudnZ2d@bt.com>


"Ben Morrow" <ben@morrow.me.uk> wrote in message 
news:7gklm9-5q41.ln1@anubis.morrow.me.uk...
>
> Quoth "Peter J. Holzer" <hjp-usenet2@hjp.at>:
>>
>> I don't think you can learn programming without a language (that would
>> be like learning to write novels without a language).
>
> Turing, Church and so on did; but most of us aren't in that category.
>
>> (I'm not even sure if everybody can learn to program: Some people just
>> don't seem to have the knack)
>
> No. It's just like being a musician, or an artist: some people can do
> it, some can't, and if you can't you might be able (with a lot of work)
> to learn enough of the basics to bang out something approximately
> reasonable, but you'll never do it well.
>
> Ben
>
Interesting. By profession and training, I'm a construction industry 
surveyor/estimator and I was trained to think logically (and I probably do 
naturally anyway). It's only a short jump from that to programming. I've 
played with Perl for the best part of 15 years now and, with a bit of help 
from people here have actually got half decent - see recent results at 
http://guitartabs2notes.cleverpages.co.uk (and particular thanks to Ben for 
his help with that).

Graham 




------------------------------

Date: Tue, 6 Nov 2012 13:30:46 -0800 (PST)
From: Luca Francesca <luca.francesca01@gmail.com>
Subject: Problems with tabulations
Message-Id: <51412fff-3bc0-4886-945b-c2b7bc2401f1@googlegroups.com>

Hello.
I've a program to extract some data from a file. ( http://pastebin.com/f3M6LuQw ).
The output of the program is not nice as i want (I used \t but isn't working well)

Any idea about a fix??

Thanks an good day all.

Luca


------------------------------

Date: Tue, 06 Nov 2012 15:47:35 -0600
From: "J. Gleixner" <glex_no-spam@qwest-spam-no.invalid>
Subject: Re: Problems with tabulations
Message-Id: <50998577$0$52253$815e3792@news.qwest.net>

On 11/06/12 15:30, Luca Francesca wrote:
> Hello.
> I've a program to extract some data from a file. ( http://pastebin.com/f3M6LuQw ).
> The output of the program is not nice as i want (I used \t but isn't working well)
>
> Any idea about a fix??


Pretty poor code.. but to answer your question.. modify the outout
produced by 'print' as needed.. e.g. use printf instead of relying on
tab.

e.g. this line:

print "Service \t $service \n";

might produce nicer output as

printf "%-15s %s\n", 'Service', $service;

For other options and to explain what the '-' does, see:
perldoc -f sprintf


------------------------------

Date: Tue, 6 Nov 2012 23:32:01 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Problems with tabulations
Message-Id: <h8jom9-e0i.ln1@anubis.morrow.me.uk>


Quoth "J. Gleixner" <glex_no-spam@qwest-spam-no.invalid>:
> On 11/06/12 15:30, Luca Francesca wrote:
> > Hello.
> > I've a program to extract some data from a file. (
> http://pastebin.com/f3M6LuQw ).
> > The output of the program is not nice as i want (I used \t but isn't
> working well)
> >
> > Any idea about a fix??
> 
> 
> Pretty poor code.. but to answer your question.. modify the outout
> produced by 'print' as needed.. e.g. use printf instead of relying on
> tab.
> 
> e.g. this line:
> 
> print "Service \t $service \n";
> 
> might produce nicer output as
> 
> printf "%-15s %s\n", 'Service', $service;

If you're producing formatted ASCII output, you might consider using
formats.

<duck>

(Or Perl6::Form, which is nicer and saner, and not really Perl-6-related
at all.)

Ben



------------------------------

Date: Mon, 5 Nov 2012 15:44:10 -0800 (PST)
From: Joe <juliani.moon@gmail.com>
Subject: STDOUT and STDERR from command lines
Message-Id: <ed1929a0-b112-4453-a747-ebc3651ef561@googlegroups.com>

I wish to have an easy way to capture STDERR and STDOUT from a command line=
 (system("$cmd") or `$cmd`).  The methods give on perl FAQ list is complica=
ted (http://perldoc.perl.org/perlfaq8.html#How-can-I-capture-STDERR-from-an=
-external-command).  However the "Net::SSH::Perl" module has a very nice wa=
y to capture them like in:

  my($stdout, $stderr, $exit) =3D $ssh->cmd($cmd);

I wonder is there a function/module to allow something like:

  my($stdout, $stderr, $exit) =3D `$cmd`;

I did search but fail to find so far.  Any advice would be appreciated.

joe


------------------------------

Date: Mon, 05 Nov 2012 16:30:47 -0800
From: Jim Gibson <jimsgibson@gmail.com>
Subject: Re: STDOUT and STDERR from command lines
Message-Id: <051120121630476402%jimsgibson@gmail.com>

In article <ed1929a0-b112-4453-a747-ebc3651ef561@googlegroups.com>, Joe
<juliani.moon@gmail.com> wrote:

> I wish to have an easy way to capture STDERR and STDOUT from a command line
> (system("$cmd") or `$cmd`).  The methods give on perl FAQ list is complicated
> (http://perldoc.perl.org/perlfaq8.html#How-can-I-capture-STDERR-from-an-extern
> al-command).  However the "Net::SSH::Perl" module has a very nice way to
> capture them like in:
> 
>   my($stdout, $stderr, $exit) = $ssh->cmd($cmd);
> 
> I wonder is there a function/module to allow something like:
> 
>   my($stdout, $stderr, $exit) = `$cmd`;
> 
> I did search but fail to find so far.  Any advice would be appreciated.

The methods given in the FAQ are complicated, but perhaps that
complexity is required. I would try one of the following and see how
far I got:

IPC::Open3
IPC::Run
IPC::Run3

-- 
Jim Gibson


------------------------------

Date: Tue, 6 Nov 2012 02:49:24 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: STDOUT and STDERR from command lines
Message-Id: <keamm9-as81.ln1@anubis.morrow.me.uk>


Quoth Joe <juliani.moon@gmail.com>:
> I wish to have an easy way to capture STDERR and STDOUT from a command
> line (system("$cmd") or `$cmd`).  The methods give on perl FAQ list is
> complicated
> (http://perldoc.perl.org/perlfaq8.html#How-can-I-capture-STDERR-from-an-external-command).  However the "Net::SSH::Perl" module has a very nice way to
> capture them like in:
> 
>   my($stdout, $stderr, $exit) = $ssh->cmd($cmd);
> 
> I wonder is there a function/module to allow something like:
> 
>   my($stdout, $stderr, $exit) = `$cmd`;

Capture::Tiny.

Ben



------------------------------

Date: Tue, 6 Nov 2012 10:30:31 +0100
From: Wasell <Wasell@example.invalid>
Subject: Re: STDOUT and STDERR from command lines
Message-Id: <MPG.2b02f700258714c2989694@news.eternal-september.org>

On Tue, 6 Nov 2012 02:49:24 +0000, in article <keamm9-as81.ln1
@anubis.morrow.me.uk>, Ben Morrow wrote:
> Quoth Joe <juliani.moon@gmail.com>:
[...]
> > I wonder is there a function/module to allow something like:
> > 
> >   my($stdout, $stderr, $exit) = `$cmd`;
> 
> Capture::Tiny.

Or Backticks: 
<http://search.cpan.org/~kilna/Backticks-v1.0.9/lib/Backticks.pm>

/Wasell


------------------------------

Date: Mon, 05 Nov 2012 23:30:13 +0000
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Why "Wide character in print"?
Message-Id: <87y5ifpr0a.fsf@sapphire.mobileactivedefense.com>

Ben Morrow <ben@morrow.me.uk> writes:
> Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
>> Ben Morrow <ben@morrow.me.uk> writes:

[...]

> But then, you've never really understood the concept of abstraction,
> have you?

This mostly means that I cannot possibly be a self-conscious human
being capable of interacting with the world in some kind of
'intelligent' (meaning, influencing it such that it changes according
to some desired outcome) way but must be some kind of lifeform below
the level of a dog or a bird. Yet, I'm capable of using written
language to communicate with you (with some difficulties), using a
computer connected to 'the internet' in order to run a program on a
completely different computer 9 miles away from my present location,
utilizing a server I have to pay for once a year from by bank account
which resides (AFAIK) in Berlin.

How can this possibly be?


------------------------------

Date: Mon, 5 Nov 2012 23:30:11 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Why "Wide character in print"?
Message-Id: <3pulm9-3871.ln1@anubis.morrow.me.uk>


Quoth "Peter J. Holzer" <hjp-usenet2@hjp.at>:
> On 2012-11-05 19:40, Ben Morrow <ben@morrow.me.uk> wrote:
> > Quoth "Peter J. Holzer" <hjp-usenet2@hjp.at>:
> >> whether the string is character string (with the UTF8 bit set) or a byte
> >> string, 
> >
> > Careful. You're conflating the existing-only-in-the-programmer's-head
> > concept of 'do I consider this string to contain bytes for IO or
> > characters for manipulation' with the perl-internal SvUTF8 flag, which
> > is exactly the mistake we have been trying to stop people making since
> > 5.8.0 was released
> 
> Who is "we"? 

TINW

> Before 5.12, you had to make the distinction.
> Strings without the SvUTF8 flag simply didn't have Unicode semantics.
> Now there is the unicode_strings feature, but
> 
>  1) it still isn't default
>  2) it will be years before I can rely on perl 5.12+ being installed on 
>     a sufficient number of machines to use it. I'm not even sure if most
>     of our machines have 5.10 yet (the Debian machines have, but most of
>     the RHEL machines have 5.8.x)
> 
> So, that distinction has at least existed for 8 years (2002-07-18 to
> 2010-04-12) and for many of us it will exist at for another few years.
> 
> So enforcing the concept I have my head in the Perl code is simply
> defensive programming.

That is all true, and was and is a major problem to those who cared.
However, I was referring to the other half of the problem: Perl no
longer attempts to make any guarantees about the state of the SvUTF8
flag. Any operation might in principle up- or downgrade a string, even
if it wasn't obvious it would need to. This means it isn't safe to store
user data like 'this string is supposed to represent bytes' in that flag
and expect that it will be preserved, and it isn't safe to assume
strings returned from arbitrary functions will have the flag set the way
you expect.

If you want reliable Unicode semantics for Latin-1 characters before
5.12 you have to explicitly utf8::upgrade before each potentially
Unicode-aware operation.

> > and we realised the 3rd-Camel model where Perl keeps track of the
> > characters/bytes distinction isn't workable.
> 
> It worked for me ;-).

There are just too many cases where a string gets upgraded by mistake,
and too many weird corner cases, particularly with pattern-matching.

Ben



------------------------------

Date: Tue, 06 Nov 2012 07:39:17 +0000
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Why "Wide character in print"?
Message-Id: <87txt3npsq.fsf@sapphire.mobileactivedefense.com>

Ben Morrow <ben@morrow.me.uk> writes:
> Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:

[...]

>> The only way to provide that is to store all characters as integer
>> values large enough to encompass all conceivably existing Unicode
>> codepoints. Otherwise, you're going to have multibyte characters and
>> consequently, 'indexing into the array to find a particular character
>> in the string' won't work anymore.
>
> Yes. That's called 'a 32-bit int', and is the standard wchar_t C
> representation of Unicode. A sensible alternative would be a 1/2/4-byte
> upgrade scheme somewhat similar to the current Perl scheme, but with all
> the alternatives being constant width; a smarter alternative would be to
> represent a string as a series of pieces, each of which could make a
> different choice (and, potentially, some of which could be shared or CoW
> with other strings).

With the most naive implementation, this would mean that moving 100G
of text data through Perl (and that's a small number for some jobs I'm
thinking of) requires copying 400G of data into Perl and 400G out of
it. What you consider 'smart' would only penalize people who actually
used non-ASCII-scripts to some (possibly serious) degree. 

>> Independently of this, the UTF-8 encoding was designed to have
>> represenation of the Unicode character set which was backwards
>> compatible with 'ASCII-based systems' and it is not only a widely
>> supported internet standard (http://tools.ietf.org/html/rfc3629) and
>> the method of choice for dealing with 'Unicode' for UNIX(*) and
>> similar system but formed the 'basic character encoding' of complete
>> operating systems as early as 1992
>> (http://plan9.bell-labs.com/plan9/about.html).
>
> There is a very big difference between a sensible *internal*
> representation and a sensible *external* representation.

This notion of 'internal' and 'external' representation is nonsense:
In order to cooperate sensibly, a number of different processes need
to use the same 'representation' for text data to avoid repeated
decoding and encoding whenever data needs to cross a process
boundary. And for 'external representation', using a proper
compression algorithm for data which doesn't need to be usable in its
stored form will yield better results than any 'encoding scheme'
biased towards making the important things (deal with US-english texts)
simple and resting comfortably on the notion that everything else is
someone else's problem.


------------------------------

Date: Tue, 06 Nov 2012 20:21:14 +0000
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Why "Wide character in print"?
Message-Id: <874nl2iith.fsf@sapphire.mobileactivedefense.com>

Rainer Weikusat <rweikusat@mssgmbh.com> writes:
> Ben Morrow <ben@morrow.me.uk> writes:
>> Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
>
> [...]
>
>>> The only way to provide that is to store all characters as integer
>>> values large enough to encompass all conceivably existing Unicode
>>> codepoints. Otherwise, you're going to have multibyte characters and
>>> consequently, 'indexing into the array to find a particular character
>>> in the string' won't work anymore.
>>
>> Yes. That's called 'a 32-bit int', and is the standard wchar_t C
>> representation of Unicode.

[...]

> With the most naive implementation, this would mean that moving 100G
> of text data through Perl (and that's a small number for some jobs I'm
> thinking of) requires copying 400G of data into Perl and 400G out of
> it.

And - of course - this still wouldn't help since a 'character'
as it appears in some script doesn't necessarily map 1:1 to a Unicode
codepoint. Eg, the German a-umlaut can either be represented as the
ISO-8859-1 code for that (IIRC) or as 'a' followed by a 'combining
diaresis' (and the policy of the Unicode consortium is actually to avoid
adding more 'precombined characters' in favor of 'grapheme
construction sequences', at least, that's what it was in 2005, when I
last had a closer look at this).


------------------------------

Date: Tue, 6 Nov 2012 21:51:25 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Why "Wide character in print"?
Message-Id: <slrnk9iu2d.b67.hjp-usenet2@hrunkner.hjp.at>

On 2012-11-05 22:15, Rainer Weikusat <rweikusat@mssgmbh.com> wrote:
> Ben Morrow <ben@morrow.me.uk> writes:
>
> [...]
>
>> (In practice it would break XS, so it probably won't happen, which is a
>> shame. UTF-8 was a very bad choice of internal representation, in
>> retrospect, though it seemed to make sense at the time. It makes a great
>> many internal operations much more complicated than they need to be,
>> because you can no longer index into an array to find a particular
>> character in the string.)
>
> The only way to provide that is to store all characters as integer
> values large enough to encompass all conceivably existing Unicode
> codepoints.

Not necessarily. As Ben already pointed out, not all strings have to
have the same representation. There is at least one programming language
(Pike) which uses 1, 2, or 4 bytes per character depending on the
"widest" character in the string. IIRC, Pike had Unicode code before
Perl, so Perl could have "stolen" that idea.


> Otherwise, you're going to have multibyte characters and
> consequently, 'indexing into the array to find a particular character
> in the string' won't work anymore.

There are other tradeoffs, too: UTF-8 is quite compact for latin text,
but it takes about 2 bytes per character for most other alphabetic
scripts (e.g. Cyrillic, Greek, Devanagari) and 3 for CJK and some other
alphabetic scripts (e.g. Hiragana and Katakana). So the size problem you
mentioned may be reversed if you are mainly processing Asian text.
Plus scanning a text may be quite a bit faster if you can do it in 16
bit quantities instead of 8 bit quantities.


> Independently of this, the UTF-8 encoding was designed to have
> represenation of the Unicode character set which was backwards
> compatible with 'ASCII-based systems' and it is not only a widely
> supported internet standard (http://tools.ietf.org/html/rfc3629) and
> the method of choice for dealing with 'Unicode' for UNIX(*) and
> similar system but formed the 'basic character encoding' of complete
> operating systems as early as 1992
> (http://plan9.bell-labs.com/plan9/about.html).

However, the Plan 9 C API has exactly the distinction you are
criticizing: Internally, strings are arrays of 16-bit quantities,
externally, they read and written as UTF-8.

From the well-known "Hello world" paper:

| All programs in Plan 9 now read and write text as UTF, not ASCII.
| This change breaks two deep-rooted symmetries implicit in most C
| programs:
| 
| 1. A character is no longer a char.
| 
| 2. The internal representation (Rune) of a character now differs from
| its external representation (UTF). 

(The paper was written before Unicode 2.0, so all characters were 16
bit. I don't know the current state of Plan 9)

	hp


-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel


------------------------------

Date: Tue, 6 Nov 2012 22:27:21 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Why "Wide character in print"?
Message-Id: <slrnk9j05p.b67.hjp-usenet2@hrunkner.hjp.at>

On 2012-11-05 22:48, Ben Morrow <ben@morrow.me.uk> wrote:
> Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
>> As such, supporting it natively in a programming language closely
>> associated with UNIX(*), at least at that time, should have been
>> pretty much a no brainer. "But Microsoft did it difffentely !!1" is
>> the ultimate argument for some people but - thankfully - these didn't
>> get to piss into Perl until very much later and thus, the damage they
>> can still do is mostly limited to 'propaganda'.
>
> I don't know what Win32's internal representation is (I suspect 32bit
> int, the same as Unix), but its default external representation is
> UTF-16, which is about the most braindead concoction anyone has ever
> come up with.

I guess you haven't seen Punycode ;-) [There seems to be no "barf"
emoticon in Unicode - I'm disappointed]

> The only possible justification for its existence is
> backwards-compatibility with systems which started implementing
> Unicode before it was finished,

What do you mean by "finished"? There is a new version of the Unicode
standard about once per year, so it probably won't be "finished" as long
as the unicode consortium exists.

Unicode was originally intended to be a 16 bit code, and Unicode 1.0
reflected this: It was 16 bit only and there was no intention to expand
it. That was only added in 2.0, about 4 years later (and at that time it
was theoretical: The first characters outside of the BMP were defined in
Unicode 3.1 in 2001, 9 years after the first release).

So of course anybody who implemented Unicode between 1992 and 1996
implemented it as a 16 bit code, because that was what the standard
said. Those early adopters include Plan 9, Windows NT, and Java.


> and even then I'm *certain* they could have made it less grotesquely
> ugly if they'd tried (a UTF-8-like scheme, for instance).

UTF-16 has a few things in common with UTF-8:

 * both are backward compatible with an existing shorter encoding 
   (UTF-8: US-ASCII, UTF-16: UCS-2)
 * both are variable width
 * both are self-terminating
 * Both use some high bits to distinguish between a single unit (8 resp.
   16 bits), the first unit and subsequent unit(s)

The main differences are 

 * UTF-16 is based on 16-bit units instead of bytes (well, duh!)
 * There was no convenient free block at the top of the value range,
   so the surrogate areas are somewhere in the middle.
 * and therefore ordering isn't preserved (but that wouldn't be
   meaningful anyway)

The main problem I have with UTF-16 is of a psychological nature: It is
extremely tempting to assume that it's a constant-width encoding because
"nobody uses those funky characters above U+FFFF anyway". Basically the
"all the world uses US-ASCII" trap reloaded.

	hp


-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel


------------------------------

Date: Tue, 6 Nov 2012 23:26:12 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Why "Wide character in print"?
Message-Id: <ktiom9-e0i.ln1@anubis.morrow.me.uk>


Quoth "Peter J. Holzer" <hjp-usenet2@hjp.at>:
> On 2012-11-05 22:48, Ben Morrow <ben@morrow.me.uk> wrote:
> >
> > I don't know what Win32's internal representation is (I suspect 32bit
> > int, the same as Unix), but its default external representation is
> > UTF-16, which is about the most braindead concoction anyone has ever
> > come up with.
> 
> I guess you haven't seen Punycode ;-) [There seems to be no "barf"
> emoticon in Unicode - I'm disappointed]

Oh, God, I'd forgotten about that. Thank you so very much for reminding
me. (And Google Translate says U+6D92 is Chinese for 'vomit'; will that
do?)

> > The only possible justification for its existence is
> > backwards-compatibility with systems which started implementing
> > Unicode before it was finished,
> 
> What do you mean by "finished"? There is a new version of the Unicode
> standard about once per year, so it probably won't be "finished" as long
> as the unicode consortium exists.
> 
> Unicode was originally intended to be a 16 bit code, and Unicode 1.0
> reflected this: It was 16 bit only and there was no intention to expand
> it. That was only added in 2.0, about 4 years later (and at that time it
> was theoretical: The first characters outside of the BMP were defined in
> Unicode 3.1 in 2001, 9 years after the first release).
> 
> So of course anybody who implemented Unicode between 1992 and 1996
> implemented it as a 16 bit code, because that was what the standard
> said. Those early adopters include Plan 9, Windows NT, and Java.

Yeah, fair enough, I suppose. It seems obvious in hindsight that 16 bits
weren't going to be enough, but maybe that isn't fair.

> > and even then I'm *certain* they could have made it less grotesquely
> > ugly if they'd tried (a UTF-8-like scheme, for instance).
> 
> UTF-16 has a few things in common with UTF-8:
> 
>  * both are backward compatible with an existing shorter encoding 
>    (UTF-8: US-ASCII, UTF-16: UCS-2)
>  * both are variable width
>  * both are self-terminating
>  * Both use some high bits to distinguish between a single unit (8 resp.
>    16 bits), the first unit and subsequent unit(s)
> 
> The main differences are 
> 
>  * UTF-16 is based on 16-bit units instead of bytes (well, duh!)

Which is one of its major problems: it has all the disadvantages of both
multibyte and wide encodings.

>  * There was no convenient free block at the top of the value range,
>    so the surrogate areas are somewhere in the middle.
>  * and therefore ordering isn't preserved (but that wouldn't be
>    meaningful anyway)
> 
> The main problem I have with UTF-16 is of a psychological nature: It is
> extremely tempting to assume that it's a constant-width encoding because
> "nobody uses those funky characters above U+FFFF anyway". Basically the
> "all the world uses US-ASCII" trap reloaded.

The main problem *I* have is the fact the surrogates are allocated out
of the Unicode character space, so everyone doing anything with Unicode
has to take account of them, even if they won't ever be touching UTF-16
data. UTF-8 doesn't do that: it has magic bits indicating the
variable-length sections, but they are kept away from the data bits
representing the actual characters encoded.

The same could have been done with UTF-16. If I'm reading the charts
right, Unicode 1.1.5 (the last version before the change) allocated
characters from 0000-9FA5 and from F900-FFFF, which leaves Axxx-Exxx
free to represent multi-word characters. So, for instance, they could
have used the following scheme: A word matching one of

    0xxxxxxxxxxxxxxx
    1001xxxxxxxxxxxx
    1111xxxxxxxxxxxx

is a single-word character. Other characters are represented as two
words, encoded as

    101ppppphhhhhhhh 110pppppllllllll

which represents the 26-bit character

    pppppppppphhhhhhhhllllllll

I know that at that point they were intending to extend the character
set to 31 bits, but IMHO reducing that to 26 would have been a lesser
evil than stuffing a whole lot of encoding rubbish into the application-
visible character set. Especially given (hindsight, again) that they
were going to eventually reduce the character range to 21 bits anyway.
(The scheme above could be made more implementation-efficient by
reducing the plane by two more bits, leaving byte-shifts but no
bit-shifts.)

Meh.

Ben



------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3813
***************************************


home help back first fref pref prev next nref lref last post