[32673] in Perl-Users-Digest


home	help	back	first	fref	pref	prev	next	nref	lref	last	post
Perl-Users Digest, Issue: 3949 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu May 16 06:09:22 2013

Date: Thu, 16 May 2013 03:09:08 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Thu, 16 May 2013     Volume: 11 Number: 3949

Today's topics:
        is agile scrum stuff just to make people mad? <visphatesjava@gmail.com>
    Re: utf8 <manfred.lotz@arcor.de>
    Re: utf8 <ben@morrow.me.uk>
    Re: utf8 <ben@morrow.me.uk>
    Re: utf8 <manfred.lotz@arcor.de>
    Re: utf8 <rweikusat@mssgmbh.com>
    Re: utf8 <hhr-m@web.de>
    Re: utf8 <rweikusat@mssgmbh.com>
    Re: utf8 <hhr-m@web.de>
    Re: utf8 <rvtol+usenet@xs4all.nl>
    Re: Why do Perl programmers make more money than Python <xhoster@gmail.com>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Wed, 15 May 2013 10:44:43 -0700 (PDT)
From: johannes falcone <visphatesjava@gmail.com>
Subject: is agile scrum stuff just to make people mad?
Message-Id: <73308200-2cfc-4900-98c1-0dce7d7bf9d1@googlegroups.com>

It seems to do nothing but creae a tps report generating class of idiots called scrum leads

total productivity killer


------------------------------

Date: Wed, 15 May 2013 15:24:33 +0200
From: Manfred Lotz <manfred.lotz@arcor.de>
Subject: Re: utf8
Message-Id: <20130515152433.4b5a490a@arcor.com>

On Wed, 15 May 2013 13:27:05 +0100
Ben Morrow <ben@morrow.me.uk> wrote:

>=20
> Quoth Manfred Lotz <manfred.lotz@arcor.de>:
> > On Tue, 14 May 2013 21:27:49 +0100
> > Ben Morrow <ben@morrow.me.uk> wrote:
> > >=20
> > > That is exactly what Peter was trying to explain. Because of the
> > > 'use utf8', perl has already decoded the UTF-8 in the source code
> > > file into Unicode characters, so $string does *not* contain
> > > "\x48\xc3\xa4": instead it contains "\x48\xe4". The e4 is because
> > > '=C3=A4', as a Unicode character, has ordinal 0x34. This string, which
> > > happens to contain only bytes though it could easily not have
> > > done, is not valid UTF-8, so decode croaks.
> > >=20
> >=20
> > Ok, I agree that perl decodes '=C3=A4' (which is utf8 x'c3a4' in the
> > file) to unicode \x{e4}.
> >=20
> > Nevertheless the =C3=A4 is a valid utf8 char.=20
>=20
> No, you're confused about the difference between 'UTF-8' and
> 'Unicode'.
>=20
> Unicode is a big list of characters, with names and associated
> semantics (like 'the lowercase of character 'A' is character 'a'').
> Each of these characters has been given a number; some of these
> numbers are >255, so it isn't possible to represent a string of
> Unicode characters directly with a string of bytes, the way you can
> with ASCII or Latin-1.
>=20
> This is a problem, given that files (on most systems) and TCP
> connections and so on are defined as strings of bytes, To solve it,
> various 'Unicode Transformation Formats' have been invented. The one
> usually used on Unix systems and in Internet protocols is called
> 'UTF-8'; if you feed a string of Unicode characters into a UTF-8
> encoder you get a string of bytes out, and if you feed a string of
> bytes into a UTF-8 decoder you either get a string of Unicode
> characters or you get an error, if the string of bytes wasn't valid
> UTF-8.
>=20
> Perl strings are always strings of Unicode characters[0]. If you want
> to represent a string of bytes in Perl, you do so by using a string of
> characters all of which happen to have an ordinal value less than 256.
> Perl does not make any attempt to keep track of whether a given string
> was supposed to be 'a string of bytes' or not: you have to do this
> yourself[1].=20
>=20
> If you read a string from a file (without doing anything special to
> the filehandle first), you will always get a string of bytes, because
> the Unix file-reading APIs only support files that consist of strings
> of bytes. If that string of bytes was supposed to be UTF-8, and you
> want to manipulate it as a string of Unicode characters, you have to
> pass it through Encode::decode. Since not all strings of bytes are
> valid UTF-8 this can function can fail; this is what Peter posted.
>=20
> If you write a string to a file (without...), the characters in the
> string are written out directly as bytes. If they all have ordinals
> below 256 this will effectively leave the file encoded in ISO8859-1,
> since the first 256 Unicode characters have the same numbers as the
> 256 ISO8859-1 characters. If you try to write a character with
> ordinal 256 or greater, you will get a warning and stupid behaviour,
> because there simply isn't any way to write a byte to a file with a
> value greater than 255[2]. If you want to write UTF-8 to a file, you
> have to encode your string of characters (which may have ordinals
> >255) using Encode::encode, which will return a string with all
> >ordinals <256 which
> you can write to the file.
>=20
>=20
> So "\x48\xc3\xa4" is valid UTF-8. If you decode it into Unicode
> characters, you get the string "\x48\xe4", which is *not* valid UTF-8.
>=20

I did not decode it.

> What are you actually trying to do here? That is, why do you think you
> need to check if a string is valid UTF-8?
>=20

I'm not trying anything. However, the OP asked if there is any easy way
to decide if a string is valid UTF-8. I answered him pointing to
Encode ::is_utf8() which as Peter rightly told me is the wrong way.

Peter said that $decoded =3D eval { decode('UTF-8', $string, FB_CROAK) };
is correct which I don't believe.

Let met repeat from my last example. '=C3=A4' is unicode point 0xe4 and
utf-8 0xc3a4. In the script file (which itself is an utf8 encoded file)
=C3=A4 is 0xc3a4. Why should perl kill this when I have specified 'use
utf8;'? My only statement is that $ae in the script below is a valid
utf8 string.


#!/usr/bin/perl

use strict;
use warnings;

use utf8;

use Test::utf8;
use Devel::Peek;

binmode STDOUT, ":utf8";

my $ae =3D '=C3=A4';

show_char($ae);

sub show_char {
	my $ch =3D shift;

	print  '-' x 80;
	print "\n";
	Dump $ch;
	print "Char: $ch\n";
	is_valid_string($ch);   # check the string is valid
	is_sane_utf8($ch);      # check not double encoded

	# check the string has certain attributes
	is_flagged_utf8($ch);   # has utf8 flag set
	is_within_ascii($ch);   # only has ascii chars in it
	is_within_latin_1($ch); # only has latin-1 chars in it
=09
}


then I get:


---------------------------------------------------------------------------=
-----
SV =3D PV(0x1b86dd0) at 0x1bd7470
  REFCNT =3D 1
  FLAGS =3D (PADMY,POK,pPOK,UTF8)
  PV =3D 0x1bb0ec0 "\303\244"\0 [UTF8 "\x{e4}"]
  CUR =3D 2
  LEN =3D 16
Char: =C3=A4
ok 1 - valid string test
ok 2 - sane utf8
ok 3 - flagged as utf8
not ok 4 - within ascii
#   Failed test 'within ascii'
#   at ./unicode05.pl line 29.
# Char 1 not ASCII (it's 228 dec / e4 hex)
ok 5 - within latin-1
# Tests were run but no plan was declared and done_testing() was not
seen.


This IMHO shows that $ae in above script is a valid utf8 string.
This is the only thing I state.

What is your argumentation to say $ae is not utf8? Then you should tell
me where above script is wrong or telling me how to interpret the
output of the script in a different way than I did.


--=20
Manfred




------------------------------

Date: Wed, 15 May 2013 14:28:15 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: utf8
Message-Id: <f4fc6a-6dk2.ln1@anubis.morrow.me.uk>


Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
> Manfred Lotz <manfred.lotz@arcor.de> writes:
> > On Tue, 14 May 2013 21:27:49 +0100
> 
> [...]
> 
> > My mistake was that I believed that perl's internal representation is
> > utf8 instead of unicode code point.
> 
> perl's internal representation is utf8 which is supposed to be decoded
> on demand as necessary. That's not an uncommon implementation choice
> for software supposed to interact with 'the real world' (here supposed
> to mean 'everything out there on the internet', have a look at the
> Mozilla Rust FAQ for a cogent and succinct explanation why this makes
> sense) but that's an implementation choice the people who presently
> work on this code strongly disagree with: They would prefer a model
> where, prior to each internal processing step, a pass over the
> complete input data has to be made in order to transform it into "the
> super-secret internal perl encoding" and after any internal processing
> has been completed, a second pass over all of the data has to be made
> in order to decode the 'super secrete internal perl encoding' into
> something which is useful for anyhing except being 'super secret' and
> 'internal to Perl'.

You are confusing semantics with internal representation. Encode is
privy to perl's internal representation; it knows that if you are
encoding into (loose) "utf8" and the string is internally represented as
SvUTF8 then all it has to do is flip the flag, and similarly that if you
are encoding into "ISO8859-1" and the string is not internally SvUTF8
that it doesn't need to do anything. Decoding is not quite so simple,
since it isn't safe to assume input which was supposed to be in UTF-8 is
actually valid, but decoding a non-SvUTF8 string from "utf8" still
doesn't do any actual decoding, it just validates the string and copies
it out.

If you are concerned about the copying overhead implied by the 'encode'
and 'decode' API, utf8::encode and utf8::decode will encode or decode in
place, without doing any copying unless they have to. Unlike ::upgrade
and ::downgrade, these are perfectly sensible functions to use if you
only need to encode or decode "utf8".

> This sort-of makes sense when assuming that perl is an island located
> in strange waters and that it will usually keep mostly to itself
> (figuratively spoken) and it makes absolutely no sense when 'some perl
> code' performs one step of a multi-stage processing pipeline which may
> possibly even include other perl code (since not even 'output of perl'
> is supposed to be suitable to become 'input of perl').

Unix IPC is defined in terms of bytes. There is no way to represent an
arbitrary Unicode character as a sequence of bytes without some sort of
encoding step. This is no different from the fact that you can't pass a
hash from one perl process to another without encoding it in some way
(for instance, with Storable).

Ben



------------------------------

Date: Wed, 15 May 2013 15:37:14 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: utf8
Message-Id: <q5jc6a-44l2.ln1@anubis.morrow.me.uk>


Quoth Manfred Lotz <manfred.lotz@arcor.de>:
> On Wed, 15 May 2013 13:27:05 +0100
> Ben Morrow <ben@morrow.me.uk> wrote:
> 
> > So "\x48\xc3\xa4" is valid UTF-8. If you decode it into Unicode
> > characters, you get the string "\x48\xe4", which is *not* valid UTF-8.
> 
> I did not decode it.

Yes you did. You passed Perl a file containing the bytes 0x22 0x48 0xc3
0xa4 0x22 (that is, "Hä", encoded in UTF-8), and you also said 'use
utf8;' which asks Perl to decode the rest of the file from UTF-8. Perl
did so, and so you ended up with the string "\x48\xe4" which, though it
happens to still be a string of bytes, is not valid UTF-8.

Until you understand this a bit better you should probably stay away
from the 'utf8' pragma. Write your source files in ASCII-only (that is,
don't use 8-bit ISO8859-1 characters either), and if you need strings
with Unicode in stick to "\x{...}" or "\N{...}".

> > What are you actually trying to do here? That is, why do you think you
> > need to check if a string is valid UTF-8?
> 
> I'm not trying anything. However, the OP asked if there is any easy way
> to decide if a string is valid UTF-8. I answered him pointing to
> Encode ::is_utf8() which as Peter rightly told me is the wrong way.

I thought you were the OP... oh God, this is a George Mpouras thread.
He's in my killfile for a reason...

> Peter said that $decoded = eval { decode('UTF-8', $string, FB_CROAK) };
> is correct which I don't believe.
> 
> Let met repeat from my last example. 'ä' is unicode point 0xe4 and
> utf-8 0xc3a4. In the script file (which itself is an utf8 encoded file)
> ä is 0xc3a4. Why should perl kill this when I have specified 'use
> utf8;'? My only statement is that $ae in the script below is a valid
> utf8 string.

Take out the 'use utf8;' and run the program again. Does that give you
the result you expected?

Now write the source file out in ISO8859-1 and run it again. Barring
bugs in perl, a source file written in ISO8859-1 *without* 'use utf8'
and the equivalent source file written in UTF-8 *with* 'use utf8' will
have exactly the same effect.

(In principle you can rewrite the file in any encoding you like, add an
equivalent 'use encoding' directive, and get the same effect. In
practice the implementation of 'encoding' is rather buggy, so that
doesn't entirely work.)

Perl does not remember that the string happened to come from a file
which happened to have been in UTF-8. All it knows is that the string
has two characters, "\x48\xe4", and that that string is *not* valid
UTF-8.

> SV = PV(0x1b86dd0) at 0x1bd7470
>   REFCNT = 1
>   FLAGS = (PADMY,POK,pPOK,UTF8)
>   PV = 0x1bb0ec0 "\303\244"\0 [UTF8 "\x{e4}"]
>   CUR = 2
>   LEN = 16
[...]
> 
> This IMHO shows that $ae in above script is a valid utf8 string.
> This is the only thing I state.

Which of these questions are you trying to answer?

    If I write this string to a file, will that file be valid UTF-8?
    Is the perl-internal SvUTF8 flag set?

Peter's answer is the correct answer to the first question, which is a
useful question to be able to answer. The correct answer to the second
is 'utf8::is_utf8', but the Right answer is 'except under exceptional
circumstances you don't need to know that, and in any case the answer is
not something you can rely on'.

Ben



------------------------------

Date: Wed, 15 May 2013 17:48:52 +0200
From: Manfred Lotz <manfred.lotz@arcor.de>
Subject: Re: utf8
Message-Id: <20130515174852.213842e5@arcor.com>

On Wed, 15 May 2013 15:37:14 +0100
Ben Morrow <ben@morrow.me.uk> wrote:

>=20
> Quoth Manfred Lotz <manfred.lotz@arcor.de>:
> > On Wed, 15 May 2013 13:27:05 +0100
> > Ben Morrow <ben@morrow.me.uk> wrote:
> >=20
> > > So "\x48\xc3\xa4" is valid UTF-8. If you decode it into Unicode
> > > characters, you get the string "\x48\xe4", which is *not* valid
> > > UTF-8.
> >=20
> > I did not decode it.
>=20
> Yes you did. You passed Perl a file containing the bytes 0x22 0x48
> 0xc3 0xa4 0x22 (that is, "H=C3=A4", encoded in UTF-8), and you also said
> 'use utf8;' which asks Perl to decode the rest of the file from
> UTF-8. Perl did so, and so you ended up with the string "\x48\xe4"
> which, though it happens to still be a string of bytes, is not valid
> UTF-8.
>=20
> Until you understand this a bit better you should probably stay away
> from the 'utf8' pragma. Write your source files in ASCII-only (that
> is, don't use 8-bit ISO8859-1 characters either), and if you need
> strings with Unicode in stick to "\x{...}" or "\N{...}".
>=20
> > > What are you actually trying to do here? That is, why do you
> > > think you need to check if a string is valid UTF-8?
> >=20
> > I'm not trying anything. However, the OP asked if there is any easy
> > way to decide if a string is valid UTF-8. I answered him pointing to
> > Encode ::is_utf8() which as Peter rightly told me is the wrong way.
>=20
> I thought you were the OP... oh God, this is a George Mpouras thread.
> He's in my killfile for a reason...
>=20
> > Peter said that $decoded =3D eval { decode('UTF-8', $string,
> > FB_CROAK) }; is correct which I don't believe.
> >=20
> > Let met repeat from my last example. '=C3=A4' is unicode point 0xe4 and
> > utf-8 0xc3a4. In the script file (which itself is an utf8 encoded
> > file) =C3=A4 is 0xc3a4. Why should perl kill this when I have specified
> > 'use utf8;'? My only statement is that $ae in the script below is a
> > valid utf8 string.
>=20
> Take out the 'use utf8;' and run the program again. Does that give you
> the result you expected?
>=20

In my opinion it makes no sense to leave out 'use:utf8;' if I have utf8
stuff in my script which is outside of ASCII. The only requirement I
have is that '=C3=A4' won't change whatever perl does with it internally.
This works fine so I have no complains.



> Now write the source file out in ISO8859-1 and run it again. Barring
> bugs in perl, a source file written in ISO8859-1 *without* 'use utf8'
> and the equivalent source file written in UTF-8 *with* 'use utf8' will
> have exactly the same effect.
>=20
> (In principle you can rewrite the file in any encoding you like, add
> an equivalent 'use encoding' directive, and get the same effect. In
> practice the implementation of 'encoding' is rather buggy, so that
> doesn't entirely work.)
>=20
> Perl does not remember that the string happened to come from a file
> which happened to have been in UTF-8. All it knows is that the string
> has two characters, "\x48\xe4", and that that string is *not* valid
> UTF-8.
>=20
> > SV =3D PV(0x1b86dd0) at 0x1bd7470
> >   REFCNT =3D 1
> >   FLAGS =3D (PADMY,POK,pPOK,UTF8)
> >   PV =3D 0x1bb0ec0 "\303\244"\0 [UTF8 "\x{e4}"]
> >   CUR =3D 2
> >   LEN =3D 16
> [...]
> >=20
> > This IMHO shows that $ae in above script is a valid utf8 string.
> > This is the only thing I state.
>=20
> Which of these questions are you trying to answer?
>=20
>     If I write this string to a file, will that file be valid UTF-8?

This was not asked by the OP. But if I write $ae to stdout using=20
binmode STDOUT, ":utf8" then I'm fine.=20


>     Is the perl-internal SvUTF8 flag set?
>=20

I only tried to answer the question if a string is valid utf8. After the
discussions we had the new question seems to be if the former is a
meaningful question at all. Because if the string would contain stuff
which is invalid utf8 (which can happen when there is some hex
garbage) then Emacs would have complained latest when I tried to save
the buffer.
=20


--=20
Manfred



------------------------------

Date: Wed, 15 May 2013 19:01:53 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: utf8
Message-Id: <87bo8c15ri.fsf@sapphire.mobileactivedefense.com>

Ben Morrow <ben@morrow.me.uk> writes:
> Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
>> Manfred Lotz <manfred.lotz@arcor.de> writes:
>> > On Tue, 14 May 2013 21:27:49 +0100
>> 
>> [...]
>> 
>> > My mistake was that I believed that perl's internal representation is
>> > utf8 instead of unicode code point.
>> 
>> perl's internal representation is utf8 which is supposed to be decoded
>> on demand as necessary. That's not an uncommon implementation choice
>> for software supposed to interact with 'the real world' (here supposed
>> to mean 'everything out there on the internet', have a look at the
>> Mozilla Rust FAQ for a cogent and succinct explanation why this makes
>> sense) but that's an implementation choice the people who presently
>> work on this code strongly disagree with: They would prefer a model
>> where, prior to each internal processing step, a pass over the
>> complete input data has to be made in order to transform it into "the
>> super-secret internal perl encoding" and after any internal processing
>> has been completed, a second pass over all of the data has to be made
>> in order to decode the 'super secrete internal perl encoding' into
>> something which is useful for anyhing except being 'super secret' and
>> 'internal to Perl'.
>
> You are confusing semantics with internal representation.

I'm not 'confusing' anything. I described this (AFAICT) correctly from
the abstract viewpoint a 'language user' is supposed to assume.

BTW: This 'stock reply' to any kind justified criticism, attack the person
who wrote it as 'clueless' by substituting an alternate, more-or-less
related topic, is really getting long in the tooth. 

> Encode is privy to perl's internal representation; it knows that if
> you are encoding into (loose) "utf8" and the string is internally
> represented as SvUTF8 then all it has to do is flip the flag, and
> similarly that if you are encoding into "ISO8859-1" and the string
> is not internally SvUTF8 that it doesn't need to do
> anything. Decoding is not quite so simple, since it isn't safe to
> assume input which was supposed to be in UTF-8 is actually valid,
> but decoding a non-SvUTF8 string from "utf8" still doesn't do any
> actual decoding, it just validates the string and copies it out.

The idea that the programmer should be forced to do useless stuff but
that otherwise useless code can be used to detect that the computer
can skip this useless request doesn't exactly make sense: Despite
being useless, the useless request code (uselessly) needs to be
written, debugged and maintained and human time is much more expensive
than computer time.

[...]

>> This sort-of makes sense when assuming that perl is an island located
>> in strange waters and that it will usually keep mostly to itself
>> (figuratively spoken) and it makes absolutely no sense when 'some perl
>> code' performs one step of a multi-stage processing pipeline which may
>> possibly even include other perl code (since not even 'output of perl'
>> is supposed to be suitable to become 'input of perl').
>
> Unix IPC is defined in terms of bytes. There is no way to represent an
> arbitrary Unicode character as a sequence of bytes without some sort of
> encoding step.

Quoting the document I already mentioned in the original posting:

	Why are strings UTF-8 by default? Why not UCS2 or UCS4?
        
	The str type is UTF-8 because we observe more text in the wild
	in this encoding -- particularly in network transmissions,
	which are endian-agnostic -- and we think it's best that the
	default treatment of I/O not involve having to recode
	codepoints in each direction.
        https://github.com/mozilla/rust/wiki/Doc-language-FAQ#why-are-strings-utf-8-by-default-why-not-ucs2-or-ucs4

NB: That's the exact argument I made and I guess the correct 'open
source response' should be that 'the Perl5 tribe' goes on the warpath
in order to exterminate 'the Mozilla Rust tribe' and thus, rid the
world of these "fundamentally mistaken" dissenting opinions ...


------------------------------

Date: Wed, 15 May 2013 21:52:52 +0200
From: Helmut Richter <hhr-m@web.de>
Subject: Re: utf8
Message-Id: <alpine.LNX.2.00.1305152142050.14656@badwlrz-clhri01.ws.lrz.de>

On Wed, 15 May 2013, Rainer Weikusat wrote:

> The idea that the programmer should be forced to do useless stuff but
> that otherwise useless code can be used to detect that the computer
> can skip this useless request doesn't exactly make sense: Despite
> being useless, the useless request code (uselessly) needs to be
> written, debugged and maintained and human time is much more expensive
> than computer time.

The idea is to separate things that belong to the interface from those 
that do not. The latter things may change at any time or from one 
implementation to another without doing any harm to people who have only 
used the documented interface and not arbitrary implementation decisions 
of one particular implementation. This is a wise way to proceed.

The internal representation of character strings in perl does *not* belong 
to the interface. If you happen to know how it is done (in particular that 
the same character string may have different representations in the same 
implementation), don't use it because it may change at any time without 
warning. This is so in all programming languages. If you try to exploit 
your knowledge of the bitwise representation of a Fortran real number your 
code may break when you go from one implementation to another.

By the way, this kind of defined interface made it possible to expand perl 
strings beyond ISO-8859-1 without breaking exising applications.

-- 
Helmut Richter


------------------------------

Date: Wed, 15 May 2013 21:39:29 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: utf8
Message-Id: <874ne40ygu.fsf@sapphire.mobileactivedefense.com>

Helmut Richter <hhr-m@web.de> writes:
> On Wed, 15 May 2013, Rainer Weikusat wrote:
>> The idea that the programmer should be forced to do useless stuff but
>> that otherwise useless code can be used to detect that the computer
>> can skip this useless request doesn't exactly make sense: Despite
>> being useless, the useless request code (uselessly) needs to be
>> written, debugged and maintained and human time is much more expensive
>> than computer time.
>
> The idea is to separate things that belong to the interface from those 
> that do not. The latter things may change at any time or from one 
> implementation to another without doing any harm to people who have only 
> used the documented interface and not arbitrary implementation decisions 
> of one particular implementation. This is a wise way to proceed.

That's a completely general statement about "good programming
practices". The sole purpose it is supposed to fulfil here is to
suggest that an opinion about something which happens to conflict with
some other opinion would somehow conflict with the mentioned 'good
programming practice' without detailing how exactly. 

> The internal representation of character strings in perl does *not* belong 
> to the interface.

The people who are presently concerned with this think that perl
should have a 'super-secret internal character representation' which
isn't useful for anything except 'perl-internal processing' (and not
compatible with anything, including different instances of perl
itself). As far as I know, the reason why they think this is that
'implementation convenience' trumps 'real-world usability'. Other
people working on similar stuff in other programming languages
(including older versions of Perl) think that the character string
representation used by $language should be documented and follow a
'sensibly chosen existing convention' even if this might cause
'implementation inconveniences'. 

[...]

> If you try to exploit your knowledge of the bitwise representation
> of a Fortran real number your code may break when you go from one
> implementation to another.

I have no knowledge about 'bitwise representation of
Fortran-anything' and 'Fortran floating-point data types' and
'representation of unicode strings' are two very much different things
(in particular, I doubt that many web pages or other exisiting 'text
files' contain 'Fortran floating point numbers' represented in
binary). Apart from that, there are standards for representing
'floating point values'.


------------------------------

Date: Thu, 16 May 2013 11:26:34 +0200
From: Helmut Richter <hhr-m@web.de>
Subject: Re: utf8
Message-Id: <alpine.LNX.2.00.1305161037130.5274@badwlrz-clhri01.ws.lrz.de>

On Wed, 15 May 2013, Rainer Weikusat wrote:

> Helmut Richter <hhr-m@web.de> writes:

> > The idea is to separate things that belong to the interface from those 
> > that do not. The latter things may change at any time or from one 
> > implementation to another without doing any harm to people who have only 
> > used the documented interface and not arbitrary implementation decisions 
> > of one particular implementation. This is a wise way to proceed.

> That's a completely general statement about "good programming
> practices".

Indeed. And it is meant as such.

Implementing something in a way that the arbirtrary choice of implementation
details becomes part of the interface and thus can never again be changed
would be a major blunder, and I am glad the perl implementers have not done
so.

> As far as I know, the reason why they think this is that
> 'implementation convenience' trumps 'real-world usability'. Other
> people working on similar stuff in other programming languages
> (including older versions of Perl) think that the character string
> representation used by $language should be documented and follow a
> 'sensibly chosen existing convention' even if this might cause
> 'implementation inconveniences'. 

You would have found it better programming practices if decades ago perl had
decided to publish as an interface that iso-8859-1 (the most advanced
character standard then), one byte per character, is the internal
representation for all future? Or should they have taken such a decision at
the time when character code points were restricted to 16 bits? Why shall they
do it just now?

It is by no means mandatory to do it the way the perl people did. They could
have chosen a *more* strict separation between character strings and byte
strings so that all input/output is to and from byte strings, only byte
strings can be decoded and only character strings can be encoded. This would
have disallowed some programming mistakes people are now doing. I, too, have
doubts that they chose the best solution. But allowing the programmer access
to the internal representation would have been a major design blunder.

And what do you positively get from direct acces to the internal
representation? You talked about efficiency. Is it really a major efficiency
issue to let perl decide by inspection of one bit whether the internal
representation of a particular string happens to be already utf-8 so that the
encoding/decoding is practically a null operation?

-- 
Helmut Richter


------------------------------

Date: Thu, 16 May 2013 11:34:15 +0200
From: "Dr.Ruud" <rvtol+usenet@xs4all.nl>
Subject: Re: utf8
Message-Id: <5194a817$0$15907$e4fe514c@news2.news.xs4all.nl>

On 15/05/2013 17:48, Manfred Lotz wrote:

> In my opinion it makes no sense to leave out 'use:utf8;' if I have utf8
> stuff in my script which is outside of ASCII.

Sure, if your source file is "in 'utf8' format" (and of course a fully 
ASCII file is 'utf8' (and 'UTF-8') as well), then it shouldn't harm.

But still be aware of the consequences. If you save the file as latin1 
at some point, you break it, exactly because of the "use utf8;".


I prefer my source files to be ASCII, so I use code like "\x{1234}".


Now read what the module's documentation states:

utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source 
code [...]

The "use utf8" pragma tells the Perl parser to allow UTF-8 in the 
program text in the current lexical scope [...]

Do not use this pragma for anything else than telling Perl that your 
script is written in UTF-8.

-- 
Ruud



------------------------------

Date: Wed, 15 May 2013 17:53:35 -0700
From: Xho Jingleheimerschmidt <xhoster@gmail.com>
Subject: Re: Why do Perl programmers make more money than Python programmers
Message-Id: <kn1ar9$n6s$1@dont-email.me>

On 05/10/2013 01:01 PM, johannes falcone wrote:
> On Monday, May 6, 2013 5:05:40 PM UTC-7, Ben Morrow wrote:

>>
>> Thank God for small mercies...
>>
>
> polytheism is best
>

In the absence of sigils, how could one know what was meant?

thank($_) foreach keys %God;

Or is that

thank($_) foreach values %God;

Xho


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3949
***************************************

home	help	back	first	fref	pref	prev	next	nref	lref	last	post
[32673] in Perl-Users-Digest

Perl-Users Digest, Issue: 3949 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)Thu May 16 06:09:22 2013

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu May 16 06:09:22 2013