[31663] in Perl-Users-Digest


home	help	back	first	fref	pref	prev	next	nref	lref	last	post
Perl-Users Digest, Issue: 2926 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Apr 29 14:09:24 2010

Date: Thu, 29 Apr 2010 11:09:07 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Thu, 29 Apr 2010     Volume: 11 Number: 2926

Today's topics:
        length in (utf8) characters ? <peter@www.pjb.com.au>
    Re: length in (utf8) characters ? <hhr-m@web.de>
    Re: length in (utf8) characters ? <hhr-m@web.de>
    Re: length in (utf8) characters ? <peter@www.pjb.com.au>
    Re: length in (utf8) characters ? <hhr-m@web.de>
    Re: length in (utf8) characters ? <peter@www.pjb.com.au>
    Re: length in (utf8) characters ? <smallpond@juno.com>
    Re: length in (utf8) characters ? sln@netherlands.com
        thanks <robin1@cnsp.com>
    Re: thanks <uri@StemSystems.com>
    Re: use constant behavior <ben@morrow.me.uk>
    Re: use constant behavior <cartercc@gmail.com>
    Re: use constant behavior <tadmc@seesig.invalid>
    Re: XML Replace <klaus03@gmail.com>
    Re: XML Replace <jurgenex@hotmail.com>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: 29 Apr 2010 11:36:03 GMT
From: Peter Billam <peter@www.pjb.com.au>
Subject: length in (utf8) characters ?
Message-Id: <slrnhtirp4.3o6.peter@box8.pjb.com.au>

I'm confused... in "perldoc length" it says

   if the EXPR is in Unicode, you will get the
   number of characters, not the number of bytes.

which is what I would want.  But (in a one-line demo
of a problem I have in a much larger module):

$> perl -e '$l=length "Ã¶"; print "length=$l\n";'
length=2

But I want to see length=1 here...  (in case your news-client
doesn't do utf8, that string was a o-umlaut) I'm using v5.10.1
on debian squeeze and everything else works fine in utf8.

Regards,  Peter

-- 
Peter Billam       www.pjb.com.au    www.pjb.com.au/comp/contact.html


------------------------------

Date: Thu, 29 Apr 2010 14:59:21 +0200
From: Helmut Richter <hhr-m@web.de>
Subject: Re: length in (utf8) characters ?
Message-Id: <Pine.LNX.4.64.1004291453300.4411@lxhri01.lrz.lrz-muenchen.de>

On Thu, 29 Apr 2010, Peter Billam wrote:

> $> perl -e '$l=length "ö"; print "length=$l\n";'
> length=2
> 
> But I want to see length=1 here...  (in case your news-client
> doesn't do utf8, that string was a o-umlaut) I'm using v5.10.1
> on debian squeeze and everything else works fine in utf8.

What happens:

What you input is two bytes long, and perl does not know that the two 
bytes are meant as one character. perl sees the two characters "Ã¶". If 
you output them as Unicode, you will even see them:

  perl -e '$l=length "ö"; binmode (STDOUT, "utf8"); print "length=$l === ö\n";'

yields

  length=2 === Ã¶

That is, the binary output of the binary string "ö" was two errors that 
compensated each other.

What you mean:

The input file is already to be interpreted as UTF-8. You should tell perl so:

  perl -e 'use utf8; $l=length "ö"; print "length=$l\n";'

-- 
Helmut Richter


------------------------------

Date: Thu, 29 Apr 2010 15:00:46 +0200
From: Helmut Richter <hhr-m@web.de>
Subject: Re: length in (utf8) characters ?
Message-Id: <Pine.LNX.4.64.1004291500210.4411@lxhri01.lrz.lrz-muenchen.de>

On Thu, 29 Apr 2010, Helmut Richter wrote:

> The input file is already to be interpreted as UTF-8. You should tell perl so:

Better: The source file ...

-- 
Helmut Richter


------------------------------

Date: 29 Apr 2010 14:54:29 GMT
From: Peter Billam <peter@www.pjb.com.au>
Subject: Re: length in (utf8) characters ?
Message-Id: <slrnhtj7d5.41r.peter@box8.pjb.com.au>

On 2010-04-29, Helmut Richter <hhr-m@web.de> wrote:
> On Thu, 29 Apr 2010, Helmut Richter wrote:
>> The input file is already to be interpreted as UTF-8.
>> You should tell perl so:
>
> Better: The source file ...

But if I tell perl that the source file is in utf8, then though
it gets the length right :-) it can't print the string out :-(

  $> perl -e 'use utf8; $s="Ã¶"; $l=length $s; print "length $s =$l\n";'
  length  =1

( likewise if I use the code-point: '$s="\x{00f6}"; )

OTOH if I don't use "use utf8" then perl prints correctly :-)
but gets the length wrong :-(

  $> perl -e '$s="Ã¶"; $l=length $s; print "length $s =$l\n";'
  length Ã¶ =2

I can't really afford to set the binmode explicitly; the "length"
code and some "print"s are actually in a module, and the strings
are passed to it from some calling program.  So when I code the
module I don't know in advance from what program is going to
be calling it, and whether it's printing into a utf environment.
Does the module really have to test every string and inspect
$ENV{LANG} and $ENV{LC_TYPE} and change binmode accordingly ?
I had been reading perldoc perluniintro:

   Starting from Perl 5.8.0, the use of "use utf8" is needed only in
   much more restricted circumstances. In earlier releases the "utf8"
   pragma was used to declare that operations in the current block or
   file would be Unicode-aware.  This model was found to be wrong,
   or at least clumsy: the "Unicodeness" is now carried with the data,
   instead of being attached to the operations.

so why is the "print" wrong, if the "Unicodeness" is carried with
the data ?  Perl should know if it's in a utf environment and
printing to a utf8 device; python does, and so does vi, less,
slrn, alpine, firefox and everything else I use (except fmt).

Sorry for being so confused, I realise this must be old stuff :-(
Peter

-- 
Peter Billam       www.pjb.com.au    www.pjb.com.au/comp/contact.html


------------------------------

Date: Thu, 29 Apr 2010 18:02:06 +0200
From: Helmut Richter <hhr-m@web.de>
Subject: Re: length in (utf8) characters ?
Message-Id: <Pine.LNX.4.64.1004291720070.4411@lxhri01.lrz.lrz-muenchen.de>

On Thu, 29 Apr 2010, Peter Billam wrote:

> I can't really afford to set the binmode explicitly; the "length"
> code and some "print"s are actually in a module, and the strings
> are passed to it from some calling program.  So when I code the
> module I don't know in advance from what program is going to
> be calling it, and whether it's printing into a utf environment.
> Does the module really have to test every string and inspect
> $ENV{LANG} and $ENV{LC_TYPE} and change binmode accordingly ?
> I had been reading perldoc perluniintro:
> 
>    Starting from Perl 5.8.0, the use of "use utf8" is needed only in
>    much more restricted circumstances. In earlier releases the "utf8"
>    pragma was used to declare that operations in the current block or
>    file would be Unicode-aware.  This model was found to be wrong,
>    or at least clumsy: the "Unicodeness" is now carried with the data,
>    instead of being attached to the operations.
> 
> so why is the "print" wrong, if the "Unicodeness" is carried with
> the data ? 

I find the term "Unicodeness" confusing, much more than the distinction of
"character strings" vs. "byte strings" (as in
http://perldoc.perl.org/perlunitut.html). It is *you*, the programmer, who has
to know whether strings are meant as strings of characters or a strings of
bytes. Obviously, your strings are strings of characters. Whether perl stores
them as Unicode or as anything else is not your problem, you cannot know and
you need not know.

Now, when you read from a file or write to a file, it is suddenly important
that you know what encoding is to be used in that file, because the character
strings whose internal encoding you do not know must be constructed from the
bytes in the file (or, on writing, they must be stored as bytes in the file).
As the code used in the file cannot be determined reliably from the name or
the contents of the file, it is you who has to tell perl, either by explicitly
decoding/encoding the strings from/to the code, or by specifying the code as a
layer on open/binmode.

This is *also* true for STDIN/STDOUT/STDERR. The open pragma
<http://perldoc.perl.org/open.html> might assist you in selecting the right
layers depending on the locale -- if the locale correctly specifies the code
which is by no means guaranteed (e.g. the code may change from one window to
another without being reflected in the locale environment variables). I have
no experience with the open pragma, though, so you have to find your way
through it.

The utf8 pragma has no effect whatsoever on what the program does. It affects
only the interpretation of the bytes in the source code. If your source code
is in UTF-8 and contains "ö", you should use the utf8 pragma if this "ö" means
one character, and you should not use it if it means two bytes (which in turn
will be interpreted as two characters when you (ab)use this byte string in a
context where a character string is needed).

> Perl should know if it's in a utf environment and
> printing to a utf8 device; python does, and so does vi, less,
> slrn, alpine, firefox and everything else I use (except fmt).

Whether the choice of perl that it does not guess the code without being told
so is a good one, is a matter of opinion. It can be tedious in environments
where the same code is used everywhere, including all files and all databases,
but can save your application if this requirement is not met.

I hope that was of some help.

-- 
Helmut Richter


------------------------------

Date: 29 Apr 2010 16:54:30 GMT
From: Peter Billam <peter@www.pjb.com.au>
Subject: Re: length in (utf8) characters ?
Message-Id: <slrnhtjee7.47f.peter@box8.pjb.com.au>

On 2010-04-29, Helmut Richter <hhr-m@web.de> wrote:
> Now, when you read from a file or write to a file, it is suddenly
> important that you know what encoding is to be used in that file, ...
> ... it is you who has to tell perl, ...
> This is *also* true for STDIN/STDOUT/STDERR. The open pragma
> <http://perldoc.perl.org/open.html> might assist you in selecting
> the right layers depending on the locale -- if the locale correctly
> specifies the code which is by no means guaranteed ...
>
> I hope that was of some help.

Thank you Helmut, for explaining so clearly. It also confirms what
I was beginning to work out for myself. So now back to the code...

Thanks for your help, 
Peter

-- 
Peter Billam       www.pjb.com.au    www.pjb.com.au/comp/contact.html


------------------------------

Date: Thu, 29 Apr 2010 13:10:32 -0400
From: Steve C <smallpond@juno.com>
Subject: Re: length in (utf8) characters ?
Message-Id: <hrceim$c28$1@news.eternal-september.org>

Peter Billam wrote:
> I'm confused... in "perldoc length" it says
> 
>    if the EXPR is in Unicode, you will get the
>    number of characters, not the number of bytes.
> 
> which is what I would want.  But (in a one-line demo
> of a problem I have in a much larger module):
> 
> $> perl -e '$l=length "ö"; print "length=$l\n";'
> length=2
> 
> But I want to see length=1 here...  (in case your news-client
> doesn't do utf8, that string was a o-umlaut) I'm using v5.10.1
> on debian squeeze and everything else works fine in utf8.
> 


When I paste the character in my newsreader, I am using ISO-8859-1, not UTF-8.
This works fine:

5.8.8 gives:
perl -e '$s="ö"; $l=length $s; print "length $s =$l\n";'
length ö =1

5.10.0 gives:
perl -e '$s="ö"; $l=length $s; print "length $s =$l\n";'
length ö =1


------------------------------

Date: Thu, 29 Apr 2010 10:13:12 -0700
From: sln@netherlands.com
Subject: Re: length in (utf8) characters ?
Message-Id: <u6fjt55knibpp6p8fi4d4guna876spk88i@4ax.com>

On 29 Apr 2010 11:36:03 GMT, Peter Billam <peter@www.pjb.com.au> wrote:

>I'm confused... in "perldoc length" it says
>
>   if the EXPR is in Unicode, you will get the
>   number of characters, not the number of bytes.
>
>which is what I would want.  But (in a one-line demo
>of a problem I have in a much larger module):
>
>$> perl -e '$l=length "ö"; print "length=$l\n";'
>length=2
>
>But I want to see length=1 here...  (in case your news-client
>doesn't do utf8, that string was a o-umlaut) I'm using v5.10.1
>on debian squeeze and everything else works fine in utf8.
>

I don't see that result, but it could be I'm on Windows.
Check your default PerlIO layers.


c:\temp>perl -e "print qq(length ö = ),length('ö'),qq(\n);"
length ÷ = 1

c:\temp>perl -e "print qq(length \x{00f6} = ),length(qq(\x{00f6})),qq(\n);"
length ÷ = 1

c:\temp>

-sln


------------------------------

Date: Wed, 28 Apr 2010 22:21:43 -0700 (PDT)
From: Robin <robin1@cnsp.com>
Subject: thanks
Message-Id: <15ee85d6-6882-4d13-a9eb-55c2df730cc7@g21g2000yqk.googlegroups.com>

email me with perl questions or if you want me to post your perl
program or source code on a decent web site.....
send mail to robin1@cnsp.com

or robin1@cnsp.com
thanks, ea....
-robin


------------------------------

Date: Thu, 29 Apr 2010 01:55:15 -0400
From: "Uri Guttman" <uri@StemSystems.com>
Subject: Re: thanks
Message-Id: <87r5lypzuk.fsf@quad.sysarch.com>


no thanks.

troll!

-- 
Uri Guttman  ------  uri@stemsystems.com  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------


------------------------------

Date: Wed, 28 Apr 2010 19:49:49 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: use constant behavior
Message-Id: <djcma7-bf72.ln1@osiris.mauzo.dyndns.org>


Quoth "Uri Guttman" <uri@StemSystems.com>:
> >>>>> "c" == ccc31807  <cartercc@gmail.com> writes:
> 
>   c> On Apr 27, 5:35 pm, "Uri Guttman" <u...@StemSystems.com> wrote:
>   >> perl constants are really subs with a prototype of no arguments. they
>   >> are converted at compile to their value (and constant folded if
>   >> possible).
> 
>   c> Uri, thanks for your explanation.
> 
>   c> I would then assume that
>   c>  - use constant DIR => 'log_files';
>   c> would be more or less equivalent to
>   c>  - sub DIR { return 'log_files'; }
> 
> sub DIR() { 'log_files' }
> 
> the important difference is the prototype of () so it can be compile
> time converted to a constant. the return isn't needed (i dunno if it
> affects it becoming a proper constant but i doubt it).

    ~/src/perl/perl% perl -MO=Concise -e'sub FOO(){"bar"} FOO'
    3  <@> leave[1 ref] vKP/REFC ->(end)
    1     <0> enter ->2
    2     <;> nextstate(main 2 -e:1) v:{ ->3
    -     <0> ex-const v ->3
    -e syntax OK
    ~/src/perl/perl% perl -MO=Concise -e'sub FOO(){return "bar"} FOO'
    6  <@> leave[1 ref] vKP/REFC ->(end)
    1     <0> enter ->2
    2     <;> nextstate(main 2 -e:1) v:{ ->3
    5     <1> entersub[t1] vKS/TARG,1 ->6
    -        <1> ex-list K ->5
    3           <0> pushmark s ->4
    -           <1> ex-rv2cv sK/128 ->-
    4              <$> gv(*FOO) s ->5
    -e syntax OK

So, yes, it does affect it, at least in 5.10.

Ben




------------------------------

Date: Thu, 29 Apr 2010 05:54:56 -0700 (PDT)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: use constant behavior
Message-Id: <f86349fb-37de-4d12-a03a-f7b5da4905df@i9g2000yqi.googlegroups.com>

On Apr 28, 2:49=A0pm, Ben Morrow <b...@morrow.me.uk> wrote:
> =A0 =A0 ~/src/perl/perl% perl -MO=3DConcise -e'sub FOO(){"bar"} FOO'

Ben,

I assume that -MO means to use the O module that provides a generic
interface to Perl Compiler backends, and that =3DConcise avoids verbose
output. Where would I find the documentation on the options for O and
a description of the output?

Thanks, CC.


------------------------------

Date: Thu, 29 Apr 2010 08:13:43 -0500
From: Tad McClellan <tadmc@seesig.invalid>
Subject: Re: use constant behavior
Message-Id: <slrnhtj17u.5k1.tadmc@tadbox.sbcglobal.net>

ccc31807 <cartercc@gmail.com> wrote:
> On Apr 28, 2:49Â pm, Ben Morrow <b...@morrow.me.uk> wrote:
>> Â  Â  ~/src/perl/perl% perl -MO=Concise -e'sub FOO(){"bar"} FOO'
>
> Ben,
>
> I assume that -MO means to use the O module that provides a generic
> interface to Perl Compiler backends, and that =Concise avoids verbose
> output. Where would I find the documentation on the options for O and
> a description of the output?


You find the docs for the O module the same way you find
the docs for the constant pragma.

In short, you find the docs for any properly installed module with "perldoc".


    perldoc constant

    perldoc O

The docs for O tell how to get the docs for backends too:

    perldoc B::Concise


-- 
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.


------------------------------

Date: Thu, 29 Apr 2010 06:18:37 -0700 (PDT)
From: Klaus <klaus03@gmail.com>
Subject: Re: XML Replace
Message-Id: <88f727c0-4b98-4d7d-ad82-e97021d9b850@t21g2000yqg.googlegroups.com>

On 28 avr, 19:01, Trev <trevor.do...@gmail.com> wrote:
> I'm trying to use Perl to replace a line in a few XML files I have.
>
> Example XML below, I'm wanting to change the Id=3D part from =A0Id=3D"/Lo=
cal/
> App/App1" to Id=3D/App1". I know there's an easy way to do this with
> perl alone however

I don't think that processing XML with Perl alone (i.e. without any
module) is easy.

> I'm trying to use XML::Simple
> or any XML plugin for perl.

Have a look first at the excellent web site
Ways to Rome: Processing XML with Perl
http://xmltwig.com/article/ways_to_rome/ways_to_rome.html
(original version by Ingo Macherius, maintained by Michel Rodriguez)

If you don't find a solution there,
then you can always employ a combination of the CPAN modules
XML::Reader and XML::Writer
http://search.cpan.org/~keichner/XML-Reader-0.34/lib/XML/Reader.pm
http://search.cpan.org/~josephw/XML-Writer-0.611/Writer.pm

A sample program would look as follows:

use strict;
use warnings;

use XML::Reader;
use XML::Writer;

my $rdr =3D XML::Reader->newhd(\*DATA, {filter =3D> 3});
my $wrt =3D XML::Writer->new(OUTPUT =3D> \*STDOUT,
  NEWLINES =3D> 0, DATA_MODE =3D> 1, DATA_INDENT =3D> 2);

# If, with XML::Writer, you write mixed content XML (that
# is tags and characters in the same level, such as, for ex.:
# <data>abc<sub>def</sub>ghi</data>
# then XML::Writer will abort with message "Mixed content
# not allowed". To allow XML::Writer in this case, you
# will have to alter the parameters to
# XML::Writer->new(NEWLINES=3D>0, DATA_MODE=3D>0, DATA_INDENT=3D>0);
# or even to
# XML::Writer->new(NEWLINES=3D>1, DATA_MODE=3D>0, DATA_INDENT=3D>0);

$wrt->xmlDecl('UTF-8', 'no');

while ($rdr->iterate) {
    my $tag =3D $rdr->tag;
    my $val =3D $rdr->value;
    my %att =3D %{$rdr->att_hash};

    if ($rdr->path eq '/Profile/Application'
    and defined $att{Id}) {
        # change '/../../zzz' into 'zzz'
        $att{Id} =3D~ s{\A .* /}''xms;
    }

    if ($rdr->is_start) { $wrt->startTag($tag, %att); }
    if ($val ne '')     { $wrt->characters($val);     }
    if ($rdr->is_end)   { $wrt->endTag($rdr->tag);    }
}

$wrt->end();

__DATA__
<?xml version=3D"1.0" encoding=3D"UTF-8" standalone=3D"no" ?>
<Profile
  xmlns=3D"xxxxxxxxx"
  name=3D""
  version=3D"1.1"
  xmlns:xsi=3D"http://www.w3.org/2001/XMLSchema-instance">

  <Application
    Name=3D"App1" Id=3D"/Local/App/App1" Services=3D"1"
    policy=3D""   StartApp=3D""  Bal=3D"5" sessInt=3D"500"
    WaterMark=3D"1.0"/>

  <Application
    Name=3D"App99" Id=3D"/Dummy/Test/iii" Services=3D"3"
    policy=3D"99"  StartApp=3D"2" Bal=3D"7" sessInt=3D"27"
    WaterMark=3D"4.3"/>

  <Application
    Name=3D"Yyee"  Id=3D"/Dat/Inp/Out"    Services=3D"5"
    policy=3D"88"  StartApp=3D""  Bal=3D"1" sessInt=3D"8"
    WaterMark=3D"2.1"/>

  <AppProfileGuid>586e3456dt</AppProfileGuid>
  <AppProfileGuid>a46y2hktt7</AppProfileGuid>
  <AppProfileGuid>mi6j77mae6</AppProfileGuid>
</Profile>


------------------------------

Date: Thu, 29 Apr 2010 07:42:00 -0700
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: Re: XML Replace
Message-Id: <mf6jt5hbqtr8ugc18v2di4stdmkv320dd7@4ax.com>

Klaus <klaus03@gmail.com> wrote:
>I don't think that processing XML with Perl alone (i.e. without any
>module) is easy.

Well, XML is a rather straightforward, well structured language. If you
are familar with compiler construction then it should be no big deal. At
least much easier to parse than let's say C or Perl itself or even HTML
(there are too many special cases in HTML).

jue


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 2926
***************************************

home	help	back	first	fref	pref	prev	next	nref	lref	last	post
[31663] in Perl-Users-Digest

Perl-Users Digest, Issue: 2926 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)Thu Apr 29 14:09:24 2010

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Apr 29 14:09:24 2010