[32672] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 3948 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Wed May 15 09:09:35 2013

Date: Wed, 15 May 2013 06:09:09 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Wed, 15 May 2013     Volume: 11 Number: 3948

Today's topics:
    Re: Iterating hashes <rvtol+usenet@xs4all.nl>
    Re: Iterating hashes <rweikusat@mssgmbh.com>
    Re: Iterating hashes <rweikusat@mssgmbh.com>
    Re: Iterating hashes <ben@morrow.me.uk>
    Re: Iterating hashes <rweikusat@mssgmbh.com>
    Re: Iterating hashes <ben@morrow.me.uk>
    Re: Iterating hashes <nospam.gravitalsun.noadsplease@hotmail.noads.com>
    Re: Iterating hashes <rweikusat@mssgmbh.com>
    Re: Iterating hashes <dave@invalid.invalid>
    Re: Iterating hashes <ben@morrow.me.uk>
        so how much money do perl programmers make? <visphatesjava@gmail.com>
    Re: utf8 <manfred.lotz@arcor.de>
    Re: utf8 <ben@morrow.me.uk>
    Re: utf8 <manfred.lotz@arcor.de>
    Re: utf8 <manfred.lotz@arcor.de>
    Re: utf8 <ben@morrow.me.uk>
    Re: utf8 <rweikusat@mssgmbh.com>
    Re: utf8 <ben@morrow.me.uk>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Tue, 14 May 2013 20:35:03 +0200
From: "Dr.Ruud" <rvtol+usenet@xs4all.nl>
Subject: Re: Iterating hashes
Message-Id: <519283d7$0$15869$e4fe514c@news2.news.xs4all.nl>

On 14/05/2013 18:40, Ben Morrow wrote:

> I'm not sure why
> 'values' is cheaper than 'keys', but I suspect it has something to do
> with the fact that hash keys are shared.

perldoc -f keys:

The returned values are copies of the original keys in the hash, so 
modifying them will not affect the original hash.


perldoc -f values:

Note that the values are not copied, which means modifying them will 
modify the contents of the hash

-- 
Ruud



------------------------------

Date: Tue, 14 May 2013 19:41:51 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Iterating hashes
Message-Id: <871u998kuo.fsf@sapphire.mobileactivedefense.com>

Willem <willem@turtle.stack.nl> writes:
> Rainer Weikusat wrote:
> ) Why would this be 'a fair test' when the keys are copied for no
> ) particular reason while no attempt is made to determine the values
> ) except for the 'values' and 'each in list context' cases? Iterating
> ) over the keys of a hash while not looking at the values associated
> ) with those keys at all seems to be a rather bizarre idea of a use
> ) case.
>
> Perl doesn't have a 'set' type, and typically a hash is used for that, and
> that is a perfectly legitimate use case for using only the keys of a
> hash.

While you're of course right on that, I regard this is 'rather
bizarre' use case for a hash.


------------------------------

Date: Tue, 14 May 2013 20:26:01 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Iterating hashes
Message-Id: <87wqr1748m.fsf@sapphire.mobileactivedefense.com>

"Dave Saville" <dave@invalid.invalid> writes:
> On Tue, 14 May 2013 14:49:46 UTC, Rainer Weikusat 
> <rweikusat@mssgmbh.com> wrote:

[...]

>> use Benchmark qw(cmpthese);

[...]

>>     cmpthese(-4,

[...]

> ===
> 100 keys
> ===
>                Rate sorted_keys scalar_each   list_each        keys   
>   values
> sorted_keys  5715/s          --        -28%        -36%        -46%   
>     -85%
> scalar_each  7956/s         39%          --        -10%        -25%   
>     -80%
> list_each    8863/s         55%         11%          --        -16%   
>     -77%
> keys        10602/s         85%         33%         20%          --   
>     -73%
> values      39161/s        585%        392%        342%        269%   
>       --
>
> Care to explain the numbers please?

The first column is the 'speed' in terms of 'executions per second',
the other columns show the 'speed differences' of the code of the
current row relative to the others, expressed as percentage of the
speed of the 'column algorithm'. Eg, for 'keys', the sorted_keys
column says '85%'. This is calculated as

int((10602 - 5715) / 5715 * 100)



------------------------------

Date: Tue, 14 May 2013 21:04:47 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Iterating hashes
Message-Id: <vvha6a-o4q1.ln1@anubis.morrow.me.uk>


Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
> Ben Morrow <ben@morrow.me.uk> writes:
> > Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
> >> 
> >> 	       sorted_keys => sub {
> >> 		   my $v;
> >> 		
> >> 		   $v = $h{$_} for sort(keys(%h));
> >> 	       },
> >> 
> >> 	       values => sub {
> >> 		   1 for values(%h);
> >> 	       },
> >
> > That's hardly a fair comparison. In fact, 'values' coming out faster is
> > a red herring as well: it's only happening because the values are all 1
> > which is much faster to copy than a string.
> >
> > With a fairer test like
> >
> >     use Benchmark qw/cmpthese/;
> >
> >     my %h = map +("$_", "$_"), 1..60_000;
> >     my $x;
> >
> >     cmpthese -5, {
> >         keys    => sub { $x = $_ for keys %h },
> >         sort    => sub { $x = $_ for sort keys %h },
> >         values  => sub { $x = $_ for values %h },
> >         keach   => sub { 1 while $x = each %h },
> >         veach   => sub { 1 while $x = (each %h)[1] },
> >     };
> 
> Why would this be 'a fair test' when the keys are copied for no
> particular reason while no attempt is made to determine the values
> except for the 'values' and 'each in list context' cases? Iterating
> over the keys of a hash while not looking at the values associated
> with those keys at all seems to be a rather bizarre idea of a use
> case.

Reading just the keys of a hash is somewhat common, for instance if the
has is being used to implement uniq or if the values are not relevant at
the moment. Dave's use case (doing a join over two hashes) starts by
determining a list of the keys present in both hashes, which means first
getting a list of the keys in one hash. 

Iterating over just the values, on the other hand, is much less common;
the only use I can remember having made of it is when dealing with a
data structure like

    {
        foo     => { name => "foo", ... },
        bar     => { name => "bar", ... },
    }

where the key is already duplicated inside the value.

In any case, I thought the purpose here was to benchmark 'for (keys)'
and 'while (each)' as methods of iterating over a hash. The difference
between them is so small that a single extra assignment, or an
assignment that copies a string rather than a number, will completely
swamp the difference you are trying to measure, at least until you get
to the point where the extra memory allocated by keys causes the process
to start swapping. 

Ben



------------------------------

Date: Tue, 14 May 2013 22:29:07 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Iterating hashes
Message-Id: <87li7h6yjg.fsf@sapphire.mobileactivedefense.com>

Ben Morrow <ben@morrow.me.uk> writes:

[...]

> In any case, I thought the purpose here was to benchmark 'for (keys)'
> and 'while (each)' as methods of iterating over a hash. The difference
> between them is so small that a single extra assignment, or an
> assignment that copies a string rather than a number, will completely
> swamp the difference you are trying to measure, at least until you get
> to the point where the extra memory allocated by keys causes the process
> to start swapping.

Testing just this, the result is (as I already knew from a past
thread) that keys is (on the computer where I tested this)
significantly faster than each for hashes with up to 10,000 keys.

----------------
use Benchmark qw(cmpthese);

sub traversal_bench
{
    my %h = map { $_, 1; } 0 .. $_[0];
    

    print("\n===\n$_[0] keys\n===\n");

    cmpthese(-5,
	      {
	       keys => sub {
		   1 for keys(%h);
	       },

	       each => sub {
		   my $k;

		   1 while $k = each(%h);
	       }});
}

traversal_bench($_) for 10, 100, 1000, 10000, 100000, 1000000;
-----------------

I'm somewhat uncertain what "it is possible to keep adding unrelated
code to each example until that dominates the execution time so
overwhelmingly that this difference can't be measured anymore" is
supposed to communicate in this context ...


------------------------------

Date: Wed, 15 May 2013 11:32:37 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Iterating hashes
Message-Id: <5r4c6a-9ci2.ln1@anubis.morrow.me.uk>


Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
> Ben Morrow <ben@morrow.me.uk> writes:
> 
> [...]
> 
> > In any case, I thought the purpose here was to benchmark 'for (keys)'
> > and 'while (each)' as methods of iterating over a hash. The difference
> > between them is so small that a single extra assignment, or an
> > assignment that copies a string rather than a number, will completely
> > swamp the difference you are trying to measure, at least until you get
> > to the point where the extra memory allocated by keys causes the process
> > to start swapping.
> 
> Testing just this, the result is (as I already knew from a past
> thread) that keys is (on the computer where I tested this)
> significantly faster than each for hashes with up to 10,000 keys.
> 
> ----------------
> use Benchmark qw(cmpthese);
> 
> sub traversal_bench
> {
>     my %h = map { $_, 1; } 0 .. $_[0];
>     
> 
>     print("\n===\n$_[0] keys\n===\n");
> 
>     cmpthese(-5,
> 	      {
> 	       keys => sub {
> 		   1 for keys(%h);
> 	       },
> 
> 	       each => sub {
> 		   my $k;
> 
> 		   1 while $k = each(%h);

Did you actually read what I wrote? That 'each' benchmark does an extra
assignment, so of *course* it's slower.

Ben



------------------------------

Date: Wed, 15 May 2013 14:03:54 +0300
From: George Mpouras <nospam.gravitalsun.noadsplease@hotmail.noads.com>
Subject: Re: Iterating hashes
Message-Id: <kmvq1d$1mbo$1@news.ntua.gr>

Στις 14/5/2013 16:37, ο/η Dave Saville έγραψε:
> I am trying to do inner and outer joins amongst other things. The

if you explain what to you mean with "inner and outer joins" I could 
provide a small piece of code.



------------------------------

Date: Wed, 15 May 2013 12:51:38 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Iterating hashes
Message-Id: <87txm4bgvp.fsf@sapphire.mobileactivedefense.com>

Ben Morrow <ben@morrow.me.uk> writes:
> Quoth Rainer Weikusat <rweikusat@mssgmbh.com>:
>> Ben Morrow <ben@morrow.me.uk> writes:

[...]

>> Testing just this, the result is (as I already knew from a past
>> thread) that keys is (on the computer where I tested this)
>> significantly faster than each for hashes with up to 10,000 keys.
>> 
>> ----------------
>> use Benchmark qw(cmpthese);
>> 
>> sub traversal_bench
>> {
>>     my %h = map { $_, 1; } 0 .. $_[0];
>>     
>> 
>>     print("\n===\n$_[0] keys\n===\n");
>> 
>>     cmpthese(-5,
>> 	      {
>> 	       keys => sub {
>> 		   1 for keys(%h);
>> 	       },
>> 
>> 	       each => sub {
>> 		   my $k;
>> 
>> 		   1 while $k = each(%h);
>
> Did you actually read what I wrote? That 'each' benchmark does an extra
> assignment, so of *course* it's slower.

The 'each' benchmark does not do 'an extra assignment': In order to
use the key (similar to the earlier one which purposely included an
operation extracting the value in every subroutine iterating over the
keys) in a loop, it has to be stored somewhere (aka 'assigned to
something') which is part of the cost of using each, whereas (term?)
keys can be used with for like any other 'term' (term?) returning a
list. Because of this

1. If iterating over the keys and (usually) examining the values is
desired, keys is the sensible choice in almost all cases and each
becomes the sensible choice once the hashes become 'large'.

2. If iterating over the values is sufficient, values is the way to
go, except possibly for extremly large hashes (> 1,000,000 keys for
this example).

3. If a predictable 'iteration order' is desired/ required, there's
not much which can be done except sorting the keylist beforehand (But
when comparing hashes, this can be avoided by iterating over the keys
of one hash, looking for each key in the other hash and deleteing it
if it was found. If the loop didn't terminate early because of a
mismatch, whatever is left in the second after it has was stuff not
contained in the first).


------------------------------

Date: Wed, 15 May 2013 12:13:59 +0000 (UTC)
From: "Dave Saville" <dave@invalid.invalid>
Subject: Re: Iterating hashes
Message-Id: <fV45K0OBJxbE-pn2-6RPduI2Bxdz1@paddington.bear.den>

On Wed, 15 May 2013 11:03:54 UTC, George Mpouras 
<nospam.gravitalsun.noadsplease@hotmail.noads.com> wrote:

>  14/5/2013 16:37, / Dave Saville :
> > I am trying to do inner and outer joins amongst other things. The
> 
> if you explain what to you mean with "inner and outer joins" I could 
> provide a small piece of code.
> 

Well my SQl is a bit rusty :-)

But the code is easy enough.

I have three hashes S,  C & H holding 60K+ keys of 70 characters and 
differing only slightly in that much of the keys are the same. Think 
of a list of fully qualified filenames on your hard drive.

I need (at least) 

In S in H not in C
in C in H not in S
in S not in H
in C not in H
in S in C

With that number of keys I was trying to avoid any unnecessary 
extraction of keys to arrays like you can't avoid when you use "sort 
keys".

Ben: I think a DB would be overkill - and don't forget I would not be 
able to install any modules anyway. As you can see above my 
requirements are not really joins in the traditional sense. I thought 
saying joins would be close enough when the real problem is 
efficiently iterating over multpile hashes.

Thanks all for your thoughts.
-- 
Regards
Dave Saville


------------------------------

Date: Wed, 15 May 2013 13:44:39 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Iterating hashes
Message-Id: <nicc6a-9uj2.ln1@anubis.morrow.me.uk>


Quoth "Dave Saville" <dave@invalid.invalid>:
> 
> I have three hashes S,  C & H holding 60K+ keys of 70 characters and 
> differing only slightly in that much of the keys are the same. Think 
> of a list of fully qualified filenames on your hard drive.
> 
> I need (at least) 
> 
> In S in H not in C
> in C in H not in S
> in S not in H
> in C not in H
> in S in C

You don't need to sort to get these, at which point whether you use
'keys' or 'each' is not terribly important.

    for (keys %S) {
        if (exists $C{$_}) {
            # in S in C
        }
        else {
            if (exists $H{$_}) {
                # in S in H not in C
            }
        }
        # and so on...
    }

By the looks of it you will have to iterate S once and C once; this
should not be a performance problem. If you're using an itty-bitty
machine (I don't know what OS/2 runs on these days) you may find

    while (my $k = each %S) {

is faster; the break-even point is well above 60_000 on my machine, but
might not be on yours.

> Ben: I think a DB would be overkill

SQLite isn't really a 'DB' in the traditional sense, it's more of a
lightweight SQL library.

> - and don't forget I would not be able to install any modules anyway.

Grrrmph. There's *almost* no point using Perl at all if you can't use
CPAN...

> As you can see above my 
> requirements are not really joins in the traditional sense. I thought 
> saying joins would be close enough when the real problem is 
> efficiently iterating over multpile hashes.

I would exactly describe them as joins, and the advantage of using an
SQL system is that it has data structures designed to make joins easy. 

Ben



------------------------------

Date: Tue, 14 May 2013 16:42:23 -0700 (PDT)
From: johannes falcone <visphatesjava@gmail.com>
Subject: so how much money do perl programmers make?
Message-Id: <5de4c4de-c965-444d-b762-cc2ad769b04f@googlegroups.com>

curious?

rate get to 75/h w2 easily?

or no?


------------------------------

Date: Tue, 14 May 2013 21:31:26 +0200
From: Manfred Lotz <manfred.lotz@arcor.de>
Subject: Re: utf8
Message-Id: <20130514213126.06dc926f@arcor.com>

On Tue, 14 May 2013 01:10:59 +0200
"Peter J. Holzer" <hjp-usenet3@hjp.at> wrote:

> On 2013-05-13 12:51, Manfred Lotz <manfred.lotz@arcor.de> wrote:
> > On Mon, 13 May 2013 14:05:00 +0300
> > George Mpouras <nospam.gravitalsun.noadsplease@hotmail.noads.com>
> > wrote:
> >> Is there any easy way to decice if a string is valid UTF-8 ?
> >
> > Minimal example:
> >
> > #! /usr/bin/perl
> >
> > use strict;
> > use warnings;
> >
> > use utf8;
> > use Encode;
> >
> > my $string =3D 'H=C3=A4';
>=20
> This string is not UTF-8 in any useful sense. It consists of two
> characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
> LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
> consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
> string has length 2, the latter has length 3.
>

This is only the email. In my test script it is this:

00000050  20 27 48 c3 a4 27 3b 0a  0a 45 6e 63 6f 64 65 3a =20
| 'H..';..Encode:|




> > Encode::is_utf8($string) or die "bad string";
>=20
> This tests whether the internal representation of the string is
> utf-8-like, which you almost never want to know in a Perl program. It
> also tells you whether the string has character semantics (unless you
> use a rather new version of perl with the unicode_strings feature),
> which is sometimes useful.
>=20
> If you want to know whether a string is a correctly encoded UTF-8
> sequence, try to decode it:
>=20
>     $decoded =3D eval { decode('UTF-8', $string, FB_CROAK) };
>=20
> (decode(..., FB_CROAK) will die if $string is not UTF-8, so you need
> to catch that. All other check parameters are even less convenient).
>=20

Aaah, thanks. Didn't know that.=20

#! /usr/bin/perl
use strict;
use warnings;

use utf8;
use 5.010;

use Encode  qw( decode FB_CROAK );

my $string =3D 'H=C3=A4'; # =3D 0x48c3a4


my $decoded =3D decode('utf8', $string, FB_CROAK);


Nevertheless, I'm confused. Above script where 'H=C3=A4' is definitely
0x48c3a4 (verified by hexdump) croaks. Why?

At any rate I have to read perlunitut, perluniintro etc. to understand
what's going on.


--=20
Manfred



------------------------------

Date: Tue, 14 May 2013 21:27:49 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: utf8
Message-Id: <4bja6a-aiq1.ln1@anubis.morrow.me.uk>


Quoth Manfred Lotz <manfred.lotz@arcor.de>:
> On Tue, 14 May 2013 01:10:59 +0200
> "Peter J. Holzer" <hjp-usenet3@hjp.at> wrote:
> > On 2013-05-13 12:51, Manfred Lotz <manfred.lotz@arcor.de> wrote:
> > >
> > > use utf8;
> > > use Encode;
> > >
> > > my $string = 'Hä';
> > 
> > This string is not UTF-8 in any useful sense. It consists of two
> > characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
> > LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
> > consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
> > string has length 2, the latter has length 3.
[...]
> 
> use utf8;
> use 5.010;
> 
> use Encode  qw( decode FB_CROAK );
> 
> my $string = 'Hä'; # = 0x48c3a4
> 
> 
> my $decoded = decode('utf8', $string, FB_CROAK);
> 
> 
> Nevertheless, I'm confused. Above script where 'Hä' is definitely
> 0x48c3a4 (verified by hexdump) croaks. Why?

That is exactly what Peter was trying to explain. Because of the 'use
utf8', perl has already decoded the UTF-8 in the source code file into
Unicode characters, so $string does *not* contain "\x48\xc3\xa4":
instead it contains "\x48\xe4". The e4 is because 'ä', as a Unicode
character, has ordinal 0x34. This string, which happens to contain only
bytes though it could easily not have done, is not valid UTF-8, so
decode croaks.

If you had read the same string from a file, this would not have
happened (unless you asked for it with an :encoding layer), nor would it
have happened if you hadn't had 'use utf8'.

Try running these both with and without 'use utf8':

    "\x48\xc3\xa4" eq "Hä"      or warn "unequal";
    "\x48\xe4" eq "Hä"          and warn "equal";
    warn length "Hä";

    "\x48\xc4\x81" eq "Hā"      or warn "unequal";
    "\x48\x{101}" eq "Hā"       or warn "equal";
    warn length "Hā";

(that character is a-macron).

Ben



------------------------------

Date: Wed, 15 May 2013 06:18:52 +0200
From: Manfred Lotz <manfred.lotz@arcor.de>
Subject: Re: utf8
Message-Id: <20130515061852.40bc50d0@arcor.com>

On Tue, 14 May 2013 21:27:49 +0100
Ben Morrow <ben@morrow.me.uk> wrote:

>=20
> Quoth Manfred Lotz <manfred.lotz@arcor.de>:
> > On Tue, 14 May 2013 01:10:59 +0200
> > "Peter J. Holzer" <hjp-usenet3@hjp.at> wrote:
> > > On 2013-05-13 12:51, Manfred Lotz <manfred.lotz@arcor.de> wrote:
> > > >
> > > > use utf8;
> > > > use Encode;
> > > >
> > > > my $string =3D 'H=C3=A4';
> > >=20
> > > This string is not UTF-8 in any useful sense. It consists of two
> > > characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
> > > LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
> > > consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
> > > string has length 2, the latter has length 3.
> [...]
> >=20
> > use utf8;
> > use 5.010;
> >=20
> > use Encode  qw( decode FB_CROAK );
> >=20
> > my $string =3D 'H=C3=A4'; # =3D 0x48c3a4
> >=20
> >=20
> > my $decoded =3D decode('utf8', $string, FB_CROAK);
> >=20
> >=20
> > Nevertheless, I'm confused. Above script where 'H=C3=A4' is definitely
> > 0x48c3a4 (verified by hexdump) croaks. Why?
>=20
> That is exactly what Peter was trying to explain. Because of the 'use
> utf8', perl has already decoded the UTF-8 in the source code file into
> Unicode characters, so $string does *not* contain "\x48\xc3\xa4":

My mistake was that I believed that perl's internal representation is
utf8 instead of unicode code point. I thought I had read this in some
perl man page.=20


--=20
Manfred



------------------------------

Date: Wed, 15 May 2013 10:29:57 +0200
From: Manfred Lotz <manfred.lotz@arcor.de>
Subject: Re: utf8
Message-Id: <20130515102957.7f806c03@arcor.com>

On Tue, 14 May 2013 21:27:49 +0100
Ben Morrow <ben@morrow.me.uk> wrote:

>=20
> Quoth Manfred Lotz <manfred.lotz@arcor.de>:
> > On Tue, 14 May 2013 01:10:59 +0200
> > "Peter J. Holzer" <hjp-usenet3@hjp.at> wrote:
> > > On 2013-05-13 12:51, Manfred Lotz <manfred.lotz@arcor.de> wrote:
> > > >
> > > > use utf8;
> > > > use Encode;
> > > >
> > > > my $string =3D 'H=C3=A4';
> > >=20
> > > This string is not UTF-8 in any useful sense. It consists of two
> > > characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
> > > LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
> > > consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
> > > string has length 2, the latter has length 3.
> [...]
> >=20
> > use utf8;
> > use 5.010;
> >=20
> > use Encode  qw( decode FB_CROAK );
> >=20
> > my $string =3D 'H=C3=A4'; # =3D 0x48c3a4
> >=20
> >=20
> > my $decoded =3D decode('utf8', $string, FB_CROAK);
> >=20
> >=20
> > Nevertheless, I'm confused. Above script where 'H=C3=A4' is definitely
> > 0x48c3a4 (verified by hexdump) croaks. Why?
>=20
> That is exactly what Peter was trying to explain. Because of the 'use
> utf8', perl has already decoded the UTF-8 in the source code file into
> Unicode characters, so $string does *not* contain "\x48\xc3\xa4":
> instead it contains "\x48\xe4". The e4 is because '=C3=A4', as a Unicode
> character, has ordinal 0x34. This string, which happens to contain
> only bytes though it could easily not have done, is not valid UTF-8,
> so decode croaks.
>=20

Ok, I agree that perl decodes '=C3=A4' (which is utf8 x'c3a4' in the file) =
to
unicode \x{e4}.

Nevertheless the =C3=A4 is a valid utf8 char.=20

This means that the test to check for valid utf8 which Peter proposed
is wrong as it croaks.

The following snippet:

#!/usr/bin/perl

use strict;
use warnings;

use utf8;

use Test::utf8;

binmode STDOUT, ":utf8";

my $ae =3D '=C3=A4';

show_char($ae);

sub show_char {
	my $ch =3D shift;

	print  '-' x 80;
	print "\n";
	print "Char: $ch\n";
	is_valid_string($ch);   # check the string is valid
	is_sane_utf8($ch);      # check not double encoded

	# check the string has certain attributes
	is_flagged_utf8($ch);   # has utf8 flag set
	is_within_ascii($ch);   # only has ascii chars in it
	is_within_latin_1($ch); # only has latin-1 chars in it
=09
}

yields:
---------------------------------------------------------------------------=
-----
Char: =C3=A4
ok 1 - valid string test
ok 2 - sane utf8
ok 3 - flagged as utf8
not ok 4 - within ascii
#   Failed test 'within ascii'
#   at ./unicode04.pl line 27.
# Char 1 not ASCII (it's 228 dec / e4 hex)
ok 5 - within latin-1
# Tests were run but no plan was declared and done_testing() was not
seen.

which is what I would have assumed.


--=20
Manfred



------------------------------

Date: Wed, 15 May 2013 12:10:22 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: utf8
Message-Id: <u17c6a-joi2.ln1@anubis.morrow.me.uk>


Quoth Manfred Lotz <manfred.lotz@arcor.de>:
> On Tue, 14 May 2013 21:27:49 +0100
> Ben Morrow <ben@morrow.me.uk> wrote:
> > 
> > That is exactly what Peter was trying to explain. Because of the 'use
> > utf8', perl has already decoded the UTF-8 in the source code file into
> > Unicode characters, so $string does *not* contain "\x48\xc3\xa4":
> 
> My mistake was that I believed that perl's internal representation is
> utf8 instead of unicode code point. I thought I had read this in some
> perl man page. 

If you're writing XS (that is, C) then perl's internal representation is
(sometimes) UTF-8. However, if you're writing Perl, you can't see that
(that's what 'internal' means), since Perl presents all strings,
regardless of their internal representation, as sequences of Unicode
characters. Perl's Unicode support wouldn't be much use if it didn't.

Ben



------------------------------

Date: Wed, 15 May 2013 13:03:35 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: utf8
Message-Id: <87ppwsbgbs.fsf@sapphire.mobileactivedefense.com>

Manfred Lotz <manfred.lotz@arcor.de> writes:
> On Tue, 14 May 2013 21:27:49 +0100

[...]

> My mistake was that I believed that perl's internal representation is
> utf8 instead of unicode code point.

perl's internal representation is utf8 which is supposed to be decoded
on demand as necessary. That's not an uncommon implementation choice
for software supposed to interact with 'the real world' (here supposed
to mean 'everything out there on the internet', have a look at the
Mozilla Rust FAQ for a cogent and succinct explanation why this makes
sense) but that's an implementation choice the people who presently
work on this code strongly disagree with: They would prefer a model
where, prior to each internal processing step, a pass over the
complete input data has to be made in order to transform it into "the
super-secret internal perl encoding" and after any internal processing
has been completed, a second pass over all of the data has to be made
in order to decode the 'super secrete internal perl encoding' into
something which is useful for anyhing except being 'super secret' and
'internal to Perl'.

This sort-of makes sense when assuming that perl is an island located
in strange waters and that it will usually keep mostly to itself
(figuratively spoken) and it makes absolutely no sense when 'some perl
code' performs one step of a multi-stage processing pipeline which may
possibly even include other perl code (since not even 'output of perl'
is supposed to be suitable to become 'input of perl').


------------------------------

Date: Wed, 15 May 2013 13:27:05 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: utf8
Message-Id: <phbc6a-0oj2.ln1@anubis.morrow.me.uk>


Quoth Manfred Lotz <manfred.lotz@arcor.de>:
> On Tue, 14 May 2013 21:27:49 +0100
> Ben Morrow <ben@morrow.me.uk> wrote:
> > 
> > That is exactly what Peter was trying to explain. Because of the 'use
> > utf8', perl has already decoded the UTF-8 in the source code file into
> > Unicode characters, so $string does *not* contain "\x48\xc3\xa4":
> > instead it contains "\x48\xe4". The e4 is because 'ä', as a Unicode
> > character, has ordinal 0x34. This string, which happens to contain
> > only bytes though it could easily not have done, is not valid UTF-8,
> > so decode croaks.
> > 
> 
> Ok, I agree that perl decodes 'ä' (which is utf8 x'c3a4' in the file) to
> unicode \x{e4}.
> 
> Nevertheless the ä is a valid utf8 char. 

No, you're confused about the difference between 'UTF-8' and 'Unicode'.

Unicode is a big list of characters, with names and associated semantics
(like 'the lowercase of character 'A' is character 'a''). Each of these
characters has been given a number; some of these numbers are >255, so
it isn't possible to represent a string of Unicode characters directly
with a string of bytes, the way you can with ASCII or Latin-1.

This is a problem, given that files (on most systems) and TCP
connections and so on are defined as strings of bytes, To solve it,
various 'Unicode Transformation Formats' have been invented. The one
usually used on Unix systems and in Internet protocols is called
'UTF-8'; if you feed a string of Unicode characters into a UTF-8 encoder
you get a string of bytes out, and if you feed a string of bytes into a
UTF-8 decoder you either get a string of Unicode characters or you get
an error, if the string of bytes wasn't valid UTF-8.

Perl strings are always strings of Unicode characters[0]. If you want to
represent a string of bytes in Perl, you do so by using a string of
characters all of which happen to have an ordinal value less than 256.
Perl does not make any attempt to keep track of whether a given string
was supposed to be 'a string of bytes' or not: you have to do this
yourself[1]. 

If you read a string from a file (without doing anything special to the
filehandle first), you will always get a string of bytes, because the
Unix file-reading APIs only support files that consist of strings of
bytes. If that string of bytes was supposed to be UTF-8, and you want to
manipulate it as a string of Unicode characters, you have to pass it
through Encode::decode. Since not all strings of bytes are valid UTF-8
this can function can fail; this is what Peter posted.

If you write a string to a file (without...), the characters in the
string are written out directly as bytes. If they all have ordinals
below 256 this will effectively leave the file encoded in ISO8859-1,
since the first 256 Unicode characters have the same numbers as the 256
ISO8859-1 characters. If you try to write a character with ordinal 256
or greater, you will get a warning and stupid behaviour, because there
simply isn't any way to write a byte to a file with a value greater than
255[2]. If you want to write UTF-8 to a file, you have to encode your
string of characters (which may have ordinals >255) using
Encode::encode, which will return a string with all ordinals <256 which
you can write to the file.


So "\x48\xc3\xa4" is valid UTF-8. If you decode it into Unicode
characters, you get the string "\x48\xe4", which is *not* valid UTF-8.

What are you actually trying to do here? That is, why do you think you
need to check if a string is valid UTF-8?

Ben


[0, 1] Historical footnotes: Perl's Unicode support was started in
perl 5.6, and first became usable in 5.8. In the beginning the intention
was that Perl should keep track of whether a given string was a string
of bytes or a string of Unicode characters, and treat the string
differently (for some operations) in each case. This turned out to be a
nightmare, because Perl's dynamic typing system meant that strings kept
being unexpectedly converted from one type to the other, making it very
difficult to predict which behaviour a given operator would actually
use.

After a great deal of argument, the design was eventually changed to the
one I described above, and any remnants of the old design were
designated 'The Unicode Bug'. I believe the first version of perl which
properly fixed the Unicode Bug is 5.14, though there are still functions
in the API which shouldn't really be there. As a rule of thumb, any
function which mentions 'the UTF8 flag' is not a function you should be
using, unless you're trying to work around bugs in an XS module.

[2] The behaviour is stupider than in ought to be: what in fact happens
is that Perl encodes the character as UTF-8 and writes that out. This
will almost certainly make the file unreadable, since some parts will be
in UTF-8 and some parts will not. Properly perl ought to either give a
fatal error or write nothing at all.



------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3948
***************************************


home help back first fref pref prev next nref lref last post