[32539] in Perl-Users-Digest
Perl-Users Digest, Issue: 3804 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sun Oct 28 21:09:23 2012
Date: Sun, 28 Oct 2012 18:09:08 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Sun, 28 Oct 2012 Volume: 11 Number: 3804
Today's topics:
perl and indent <billcun@suddenlink.net>
Re: perl and indent <hjp-usenet2@hjp.at>
Re: perl and indent <ignoramus21219@NOSPAM.21219.invalid>
Re: perl and indent <news@lawshouse.org>
Re: perl and indent <jurgenex@hotmail.com>
Simple (Rookie) Question <cashdirect7@gmail.com>
Re: Simple (Rookie) Question <jwcarlton@gmail.com>
Re: Why "Wide character in print"? <ben@morrow.me.uk>
Re: Why "Wide character in print"? <hjp-usenet2@hjp.at>
Re: Why "Wide character in print"? <whynot@pozharski.name>
Re: Why "Wide character in print"? <hjp-usenet2@hjp.at>
Re: Why "Wide character in print"? <hhr-m@web.de>
Re: Why "Wide character in print"? <rweikusat@mssgmbh.com>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Sat, 27 Oct 2012 15:39:28 -0400
From: "Bill Cunningham" <billcun@suddenlink.net>
Subject: perl and indent
Message-Id: <k6hd9e$3b8$1@dont-email.me>
I use indent when I'm writing C code but I just started looking at perl.
I am in the perlintro document. Is there a way to use indent with perl?
Bill
------------------------------
Date: Sat, 27 Oct 2012 23:21:23 +0200
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: perl and indent
Message-Id: <slrnk8ok2j.6nn.hjp-usenet2@hrunkner.hjp.at>
On 2012-10-27 19:39, Bill Cunningham <billcun@suddenlink.net> wrote:
> I use indent when I'm writing C code but I just started looking at perl.
> I am in the perlintro document. Is there a way to use indent with perl?
Use perltidy.
hp
--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | hjp@hjp.at | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
------------------------------
Date: Sat, 27 Oct 2012 16:44:56 -0500
From: Ignoramus21219 <ignoramus21219@NOSPAM.21219.invalid>
Subject: Re: perl and indent
Message-Id: <q66dnUt4w9dFyBHNnZ2dnUVZ_sqdnZ2d@giganews.com>
On 2012-10-27, Peter J. Holzer <hjp-usenet2@hjp.at> wrote:
> On 2012-10-27 19:39, Bill Cunningham <billcun@suddenlink.net> wrote:
>> I use indent when I'm writing C code but I just started looking at perl.
>> I am in the perlintro document. Is there a way to use indent with perl?
>
> Use perltidy.
>
perltidy is a very awesome utility, always doing a first rate job.
i
------------------------------
Date: Sun, 28 Oct 2012 11:12:03 +0000
From: Henry Law <news@lawshouse.org>
Subject: Re: perl and indent
Message-Id: <TuKdnecjEbaYjhDNnZ2dnUVZ8vadnZ2d@giganews.com>
On 27/10/12 20:39, Bill Cunningham wrote:
> I use indent when I'm writing C code but I just started looking at perl.
> I am in the perlintro document. Is there a way to use indent with perl?
I suspect there may be a precise technical meaning of "I use indent"; if
so then I'm about to sound like an idiot ...
My experience of using indentation for Perl programming prompts these
musings, which may be of help to you:
* Your editor, if it's any good, probably has a mode for Perl. I use
Emacs, which definitely does, and also Eclipse via the specifically-Perl
"Epic" plug-in. The editor should handle indentation within braces, and
aligning closing braces properly.
* I find two characters of indentation plenty; three at most. You may
need to customise your editor to get that.
* I advocate using "soft tabs" (spaces) rather than hard tabs.
Occasionally I'll "cat" (Linux) or "type" (Windows) a Perl file just to
have a look at it and the built-in tab is usually far too big (like 8).
* I indent continuations, and also complex "if" logic
die "some long error message"
unless $somecondition;
if ( ($ff =~ /blah/) ||
( $condx || $condy ) ||
$some_other_condition
) {
etc();
}
* There are different styles of indenting, especially in respect of
braces in if statements. For example there's this:
if ( $foo )
{ do_stuff();
more_stuff();
}
and also this (which is more common)
if ( $foo ) {
do_stuff();
more_stuff();
}
For my part I like "else" on its own line
if ( $foo ) {
do_stuff();
more_stuff();
}
else {
other_stuff();
}
... but there are those who hate it!
Other posters have suggested perltidy. I hear good things of it but
have never bothered to look. I suppose if you're hacking someone else's
code ...
--
Henry Law Manchester, England
------------------------------
Date: Sun, 28 Oct 2012 07:56:06 -0700
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: Re: perl and indent
Message-Id: <dohq885cv901ncj8b3afhpe7vjubsrhk70@4ax.com>
"Bill Cunningham" <billcun@suddenlink.net> wrote:
> I use indent when I'm writing C code but I just started looking at perl.
>I am in the perlintro document. Is there a way to use indent with perl?
Any editor used for programming should automatically indent program code
as it fits for the particular programming language. If it doesn't then
probably it misses many other useful features, too, and you may want to
consider using a different editor.
jue
------------------------------
Date: Sun, 28 Oct 2012 21:33:16 +0000 (UTC)
From: Annie <cashdirect7@gmail.com>
Subject: Simple (Rookie) Question
Message-Id: <XnsA0FA9E2CF6981HatinSpam@88.198.244.100>
A kind volunteer wrote an effective perl script for my website ten or
more years back.
The script was called from header.html, and worked flawlessly for
years...until someone decided to mess with the headers and removed the
call to the cgi script.
This is the script - it delivered custom headers based on the visitor's
domain, and worked perfectly:
#!/usr/bin/env perl
use strict;
sub do_default {
open (HTMLIN, "header-default.html") or die "Error opening default
header";
while (<HTMLIN>) {
print;
}
close HTMLIN;
}
sub do_header {
my $domain;
$domain = shift(@_);
if (open (HTMLIN, "header-$domain.html")) {
while (<HTMLIN>) {
print;
}
close HTMLIN;
} else {
do_default;
}
}
MAIN: {
my($domain, $html, $toplevel);
$domain=$ENV{"REMOTE_HOST"};
$domain =~ /.+\.(.{1,3})$/;
$toplevel = $1;
print "Content-Type: text/html\n\n";
do_header($toplevel);
}
Can some kind soul provide me with a line of code that will call this
script?
------------------------------
Date: Sun, 28 Oct 2012 17:48:55 -0700 (PDT)
From: Jason C <jwcarlton@gmail.com>
Subject: Re: Simple (Rookie) Question
Message-Id: <108cd766-19fb-49bf-b29b-f1a2e1e65f33@googlegroups.com>
On Sunday, October 28, 2012 5:33:17 PM UTC-4, Annie wrote:
> A kind volunteer wrote an effective perl script for my website ten or
>
> more years back.
>
>
>
> The script was called from header.html, and worked flawlessly for
>
> years...until someone decided to mess with the headers and removed the
>
> call to the cgi script.
>
>
>
> This is the script - it delivered custom headers based on the visitor's
>
> domain, and worked perfectly:
>
>
>
> #!/usr/bin/env perl
>
> use strict;
>
>
>
> sub do_default {
>
> open (HTMLIN, "header-default.html") or die "Error opening default
>
> header";
>
> while (<HTMLIN>) {
>
> print;
>
> }
>
> close HTMLIN;
>
> }
>
>
>
> sub do_header {
>
> my $domain;
>
> $domain = shift(@_);
>
>
>
> if (open (HTMLIN, "header-$domain.html")) {
>
>
>
> while (<HTMLIN>) {
>
> print;
>
> }
>
>
>
> close HTMLIN;
>
> } else {
>
> do_default;
>
> }
>
> }
>
>
>
> MAIN: {
>
> my($domain, $html, $toplevel);
>
>
>
> $domain=$ENV{"REMOTE_HOST"};
>
>
>
> $domain =~ /.+\.(.{1,3})$/;
>
> $toplevel = $1;
>
>
>
> print "Content-Type: text/html\n\n";
>
>
>
> do_header($toplevel);
>
> }
>
>
>
> Can some kind soul provide me with a line of code that will call this
>
> script?
From your description, I think you are using SSI. It's been a LONG time since I've worked with that, but I think you're looking for:
<!--#exec cgi="cgi-bin/script_name.cgi"-->
The path "cgi-bin/script_name.cgi" would vary based on the name and location of the script.
HTH.
------------------------------
Date: Sun, 28 Oct 2012 00:37:07 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Why "Wide character in print"?
Message-Id: <3q7ul9-l7s.ln1@anubis.morrow.me.uk>
Quoth Eric Pozharski <whynot@pozharski.name>:
> with <vi7pl9-ui71.ln1@anubis.morrow.me.uk> Ben Morrow wrote:
>
> > (In theory you can 'use encoding' to specify a different source
> > character encoding, but in practice that pragma has always been buggy
> > and is better avoided.)
>
> Stop spreading FUD.
That was certainly not my intention. My understanding is that 'use
encoding' is liable to cause incorrect behaviour and segfaults; see for
instance
https://rt.perl.org/rt3/Public/Bug/Display.html?id=31923
https://rt.perl.org/rt3/Public/Bug/Display.html?id=36248
https://rt.perl.org/rt3/Public/Bug/Display.html?id=37526
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2009-09/msg00669.html
Incidentally, while looking for those I also found
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2011-03/msg00255.html
which suggests that 'use utf8' is also broken; I didn't know that until
just now, and I'm not sure I entirely believe it.
If you have newer information than me, I'd be happy to change my opinion.
> They need
>
> use encoding ENCNAME Filter => 1;
That installs a source filter; I'm not sure what the effects of that
are, but I wouldn't be surprised if you get the union of any bugs in
'use encoding' and any bugs in 'use utf8'.
> (what I<ENCNAME> could possibly be?) but
>
> * "use utf8" is implicitly declared so you no longer have to "use
> utf8" to "${"\x{4eba}"}++".
I don't believe this is safe either. The pad code (which handles 'my'
variables) isn't utf8-safe, so you can't create 'my' variables with
Unicode names. (The above is a symref to a global; I don't know if the
code handling the names of globals is utf8-safe, but even if it is that
isn't terribly useful.)
Looking at the code in git, it's possible this has been fixed in 5.16; I
haven't been keeping up with core changes recently. However, this isn't
mentioned in perl5160delta, so I suspect that whatever core changes have
been made aren't considered sufficient for full utf8 identifier support.
> > The lexer converts the "Ã¥" into a 1-character string which eventually
> > gets passed to 'say', which appends a newline (that is, a character
> > with ordinal 0a) and passes it to the STDOUT filehandle for writing.
>
> That's not a whole story.
>
> {2754:13} [0:0]% perl -Mutf8 -MDevel::Peek -wle '$aa = "а" ; Dump $aa'
> SV = PV(0x927a750) at 0x9295fac
> REFCNT = 1
> FLAGS = (POK,pPOK,UTF8)
> PV = 0x9291a08 "\320\260"\0 [UTF8 "\x{430}"]
> CUR = 2
> LEN = 12
(Note: that is not a latin lowercase 'a' in the source, but U+0430
CYRILLIC SMALL LETTER A. On my terminal they look identical, which
confused me for a moment.)
In any case, the result is exactly what I said: the string contains one
(logical) character. If you apply length() to that string it will return
1. (This character happens to be represented internally as two bytes;
that is none of your business.) What do you think I omitted from the
story?
Ben
------------------------------
Date: Sun, 28 Oct 2012 13:32:46 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Why "Wide character in print"?
Message-Id: <slrnk8q9fe.2cj.hjp-usenet2@hrunkner.hjp.at>
On 2012-10-27 23:37, Ben Morrow <ben@morrow.me.uk> wrote:
> Quoth Eric Pozharski <whynot@pozharski.name>:
>> with <vi7pl9-ui71.ln1@anubis.morrow.me.uk> Ben Morrow wrote:
>>
>> > (In theory you can 'use encoding' to specify a different source
>> > character encoding, but in practice that pragma has always been buggy
>> > and is better avoided.)
>>
>> Stop spreading FUD.
>
> That was certainly not my intention. My understanding is that 'use
> encoding' is liable to cause incorrect behaviour and segfaults; see for
> instance
>
> https://rt.perl.org/rt3/Public/Bug/Display.html?id=31923
> https://rt.perl.org/rt3/Public/Bug/Display.html?id=36248
> https://rt.perl.org/rt3/Public/Bug/Display.html?id=37526
> http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2009-09/msg00669.html
>
> Incidentally, while looking for those I also found
>
> http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2011-03/msg00255.html
>
> which suggests that 'use utf8' is also broken; I didn't know that until
> just now, and I'm not sure I entirely believe it.
That doesn't look like a bug in "use utf8" to me, but like a bug in the
code which generates the warnings.
It doesn't help that Tom just dumped a load of gibberish into his mail
without specifying which encoding he was using. I had to guess that he
was using CP1252.
Anyway, with use utf8, the qw[] section of his program is parsed correcly as
("élite", "Ævar", "μῦθος", "mÃo")
In the error message each character (even those in the printable ASCII
range U+0020 ... U+007E) is "helpfully" given in hex which I agree is
... suboptimal.
> If you have newer information than me, I'd be happy to change my opinion.
Me too, although frankly I see no reason to use encoding even if it
works. It mixes up encoding of the source code and the I/O, which is not
a good idea, IMSHO, and my editor handles UTF-8 just fine, so I don't
see why I should write my perl scripts in a different encoding than
UTF-8. I/O can be handled explicitely by I/O layers or implicitely by
"use open".
>> (what I<ENCNAME> could possibly be?) but
>>
>> * "use utf8" is implicitly declared so you no longer have to "use
>> utf8" to "${"\x{4eba}"}++".
>
> I don't believe this is safe either. The pad code (which handles 'my'
> variables) isn't utf8-safe, so you can't create 'my' variables with
> Unicode names. (The above is a symref to a global; I don't know if the
> code handling the names of globals is utf8-safe, but even if it is that
> isn't terribly useful.)
I'm puzzled about this part of the documentation, too. Why would anybody
want to use a variable ${"\x{4eba}"} ? I am guessing that the variable
is really supposed to be $人, i.e., there is a Han character in the
source code, not a symref.
Is this unsafe? I have occasionally used non-ascii characters in
variable names (mostly Greek characters in physical formulas) together
with use utf8 since 5.8.x and I never noticed a problem. (The only
"problem" I noticed is that the euro sign isn't a word character, so you
can't have a variable $amount_in_€. But then you can't have a variable
$amount_in_$ either, so I guess this is fair ;-))
hp
--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | hjp@hjp.at | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
------------------------------
Date: Sun, 28 Oct 2012 13:45:49 +0200
From: Eric Pozharski <whynot@pozharski.name>
Subject: Re: Why "Wide character in print"?
Message-Id: <slrnk8q6nd.khg.whynot@orphan.zombinet>
with <3q7ul9-l7s.ln1@anubis.morrow.me.uk> Ben Morrow wrote:
> Quoth Eric Pozharski <whynot@pozharski.name>:
>> with <vi7pl9-ui71.ln1@anubis.morrow.me.uk> Ben Morrow wrote:
>>
>>> (In theory you can 'use encoding' to specify a different source
>>> character encoding, but in practice that pragma has always been
>>> buggy and is better avoided.)
>>
>> Stop spreading FUD.
>
> That was certainly not my intention. My understanding is that 'use
> encoding' is liable to cause incorrect behaviour and segfaults; see
> for instance
>
> https://rt.perl.org/rt3/Public/Bug/Display.html?id=31923
C<use threads;> and C<use encoding 'utf8';>. Unexpected(?) edge case?
> https://rt.perl.org/rt3/Public/Bug/Display.html?id=36248
C<use utf8;>, C<use encoding 'utf8';>, and C<use Encode;>. Panic mode?
> https://rt.perl.org/rt3/Public/Bug/Display.html?id=37526
Double encoding.
> http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2009-09/msg00669.html
Monkey wrench.
> http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2011-03/msg00255.html
Works just as expected, see below.
> which suggests that 'use utf8' is also broken; I didn't know that
> until just now, and I'm not sure I entirely believe it. If you have
> newer information than me, I'd be happy to change my opinion.
Probably that's not safe to state things like this below unprivately,
but:
not perl->isa( 'fool-proof' ) or die
(I'm trying to speak Perl here). IOW, Perl has an entry level. And
it's quite high. And one of steps to get behind is ability to read. I
don't mind ability to read code, I mean ability to RTFM. Three former
examples are clearly (for me) of that type. I have a couple of scripts
that have C<use encoding 'utf8';> (I<STDIN>, I<STDOUT>, and quote-like
operators) and C<use open ':locale';> (other filehandles, quite risky,
but those scripts are not for distribution thus I'm safe here). Those
scripts were started 4.5 years ago (according to logs, I can't believe
it was sarge (thus 5.8.8?)). Anyway, 5.10.0, 5.10.1, 5.14.2 -- because
I've made those right. Because I've read carefully, all the unicode
documentation that comes with perl (namely perluniitro.pod,
perlunicode.pod, utf8.pod, encoding.pm, Encdoe.pm (perlunifaq.pod,
perlunitut, and perluniprops.pod weren't distributed five years ago,
should read them too)). I've found that I don't need utf8.pm (those
scripts and modules should be us-ascii anyway).
I feel utf8-safe because, first of all, I can read. If I can, they can
too, can't they? Apparently, they don't, maybe because they can't.
>> They need
>>
>> use encoding ENCNAME Filter => 1;
>
> That installs a source filter; I'm not sure what the effects of that
> are, but I wouldn't be surprised if you get the union of any bugs in
> 'use encoding' and any bugs in 'use utf8'.
>
>> (what I<ENCNAME> could possibly be?) but
>>
>> * "use utf8" is implicitly declared so you no longer have to
>> "use utf8" to "${"\x{4eba}"}++".
BTW, I've checked. There's no C<use utf8>. It's B<require utf8> and no
import. A whole different story.
> I don't believe this is safe either. The pad code (which handles 'my'
> variables) isn't utf8-safe, so you can't create 'my' variables with
> Unicode names. (The above is a symref to a global; I don't know if the
> code handling the names of globals is utf8-safe, but even if it is
> that isn't terribly useful.)
Let me rephrase one famous proverb:
If an answer you've got is 'filter', you probably asking wrong
question.
*SKIP*
> In any case, the result is exactly what I said: the string contains
> one (logical) character. If you apply length() to that string it will
> return 1. (This character happens to be represented internally as two
> bytes; that is none of your business.) What do you think I omitted
> from the story?
Right. And that's closely related to your last example (the one about
utf8.pm being unsafe). I've tried to make a point that *characters*
from different *ranges* happen to be of different length in bytes.
{9829:45} [0:0]% perl -Mutf8 -MDevel::Peek -wle '$aa = "aà а" ; Dump $aa'
SV = PV(0xa06f750) at 0xa08afac
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0xa086a08 "a\303\240\320\260"\0 [UTF8 "a\x{e0}\x{430}"]
CUR = 5
LEN = 12
*Characters* of latin1 aren't wide (even if they are characters, they
are still one byte long)
{10406:65} [0:0]% perl -Mutf8 -wle 'print "[Ã ]"'
[Ã ]
{10415:66} [0:0]% perl -Mutf8 -wle 'print "[а]"'
Wide character in print at -e line 1.
[а]
I must have added those braces, because:
{10421:67} [0:0]% perl -wle 'print "Ã "' # no problmes, just a byte
Ã
{10477:68} [0:0]% perl -Mutf8 -wle 'print "Ã "' # oops
{10520:69} [0:0]% perl -Mutf8 -wle 'print "Ã "' # stupid
Ã
{10522:70} [0:0]% perl -Mutf8 -wle 'print "\x{E0}"' # oops
{10532:71} [0:0]% perl -Mutf8 -wle 'print "\x{E0} "' # stupid
Ã
{10602:79} [0:0]% perl -Mutf8 -wle 'print "\N{U+00E0}"' # oops
{10608:80} [0:0]% perl -Mutf8 -wle 'print "\N{U+00E0} "' # stupid
Ã
But watch this:
{10613:81} [0:0]% perl -Mencoding=utf8 -wle 'print "Ã "' # hooray!
Ã
{10645:82} [0:0]% perl -Mencoding=utf8 -wle 'print "\x{E0}"' # oops
�
{10654:83} [0:0]% perl -Mencoding=utf8 -wle 'print "\N{U+00E0}"' # hooray!
Ã
Except the middle one (what I should think about), I think encoding.pm
wins again.
--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom
------------------------------
Date: Sun, 28 Oct 2012 21:06:52 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Why "Wide character in print"?
Message-Id: <slrnk8r42s.2s7.hjp-usenet2@hrunkner.hjp.at>
On 2012-10-28 11:45, Eric Pozharski <whynot@pozharski.name> wrote:
> with <3q7ul9-l7s.ln1@anubis.morrow.me.uk> Ben Morrow wrote:
>> In any case, the result is exactly what I said: the string contains
>> one (logical) character. If you apply length() to that string it will
>> return 1. (This character happens to be represented internally as two
>> bytes; that is none of your business.) What do you think I omitted
>> from the story?
>
> Right. And that's closely related to your last example (the one about
> utf8.pm being unsafe). I've tried to make a point that *characters*
> from different *ranges* happen to be of different length in bytes.
Then maybe you shouldn't have chosen two examples which both are same
length in bytes.
>
> {9829:45} [0:0]% perl -Mutf8 -MDevel::Peek -wle '$aa = "aà а" ; Dump $aa'
> SV = PV(0xa06f750) at 0xa08afac
> REFCNT = 1
> FLAGS = (POK,pPOK,UTF8)
> PV = 0xa086a08 "a\303\240\320\260"\0 [UTF8 "a\x{e0}\x{430}"]
> CUR = 5
> LEN = 12
>
> *Characters* of latin1 aren't wide (even if they are characters, they
> are still one byte long)
In UTF-8, latin-1 characters >= 0x80 are 2 bytes, the same as cyrillic
characters. Your example shows this: "Ã " (LATIN SMALL LETTER A WITH
GRAVE) is "\303\240" and "а" (CYRILLIC SMALL LETTER A) is "\320\260".
But this isn't what "wide character" in the warning means. In the
warning, it means a string element with a code > 255. For string
elements <= 255, perl can assume that they are supposed to be bytes, not
characters, when you try to write them to a byte stream. It could be
argued that this assumption is a mistake, but for better or worse we are
stuck with that decision. But for string elements > 255, that just isn't
possible. It can't be a byte, it must be a character, and to convert a
character into bytes, the encoding needs to known.
> {10406:65} [0:0]% perl -Mutf8 -wle 'print "[Ã ]"'
> [Ã ]
> {10415:66} [0:0]% perl -Mutf8 -wle 'print "[а]"'
> Wide character in print at -e line 1.
> [а]
... as these examples demonstrate.
> I must have added those braces, because:
>
> {10421:67} [0:0]% perl -wle 'print "Ã "' # no problmes, just a byte
> Ã
Assuming you use a UTF-8 terminal here: No, this isn't one byte. These are
two bytes, \303\240.
> {10477:68} [0:0]% perl -Mutf8 -wle 'print "Ã "' # oops
>
Now you have one character (because of -Mutf8, the two bytes \303\240
are decoded to the character U+00e0), but you are trying to write it to a byte
stream without specifying the encoding. Perl writes the single byte
0xE0, which your UTF-8 terminal cannot interpret. (Mine displays a
question mark in a dark circle)
> {10520:69} [0:0]% perl -Mutf8 -wle 'print "Ã "' # stupid
> Ã
Huh? What version of Perl on what platform is this? The string is
"\x{E0}\x{20}". All elements of the string are <= 255, so the string is
output as a byte string. This isn't valid UTF-8, and your terminal
shouldn't be able to interpret it as "Ã " anymore than it was able to
interpret "\x{E0}\x{0A}" above.
[more equivalent examples snipped]
If your program does character I/O, you *need* to specify the encoding
of the I/O channels. For one-liners, the -C option is sufficent:
hrunkner:~/tmp 20:40 :-) 195% perl -CS -Mutf8 -wle 'print "Ã "'
Ã
For scripts you would use binmode or 'use open'.
(Didn't you praise yourself on your ability to read? This is documented
and it has been repeated by several people in this newsgroup for years)
> But watch this:
>
> {10613:81} [0:0]% perl -Mencoding=utf8 -wle 'print "Ã "' # hooray!
> Ã
> {10645:82} [0:0]% perl -Mencoding=utf8 -wle 'print "\x{E0}"' # oops
> �
> {10654:83} [0:0]% perl -Mencoding=utf8 -wle 'print "\N{U+00E0}"' # hooray!
> Ã
>
> Except the middle one (what I should think about), I think encoding.pm
> wins again.
Excellent example, it shows exactly one of the pitfalls of using "use
encoding". One would expect "\x{E0}" to result in a string with a single
element with code 0xE0. At least you seem to have expected it, and for a
moment I was confused, too. But 'use encoding' doesn't work that way. It
was designed to convert string constants from the specified encoding to
Unicode, so it tries to interpret "\x{E0}" as UTF-8, but of course this
isn't valid UTF-8. So you get "\x{FFFD}" instead (U+FFFD is the
REPLACEMENT CHARACTER used to mark invalid characters).
If you use a correct UTF-8 encoded string, it works as expected (well,
expected by somebody who's read the documentation and remembers that
little pitfall):
hrunkner:~/tmp 20:47 :-) 197% perl -Mencoding=utf8 -wle 'print "\303\240"'
Ã
For one-liners like this, using the same encoding for the script and the
I/O is useful ("-CS -Mutf8" is even shorter than "-Mencoding=utf8", but
maybe you don't have a UTF-8 capable terminal). However, for real
programs, I think tying the encoding of the source code to the encoding
of I/O-streams the script is supposed to handle is foolish. My scripts
are always encoded in UTF-8, but they frequently have to handle files in
CP-1252.
hp
--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | hjp@hjp.at | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
------------------------------
Date: Sun, 28 Oct 2012 21:57:20 +0100
From: Helmut Richter <hhr-m@web.de>
Subject: Re: Why "Wide character in print"?
Message-Id: <alpine.LNX.2.00.1210282128030.15993@badwlrz-clhri01.ws.lrz.de>
On Sun, 28 Oct 2012, Peter J. Holzer wrote:
> But this isn't what "wide character" in the warning means. In the
> warning, it means a string element with a code > 255. For string
> elements <= 255, perl can assume that they are supposed to be bytes, not
> characters, when you try to write them to a byte stream.
You have to distinguish what may work sometimes or always, and what is
part of the interface which *should* work. If it does nor work in the
latter case, it is an error; if it does not work in the former case you
have made a bad guess about how it is implemented. So do not rely on your
guesses but use the documented interface.
There are two ways to use the interface:
- You regard all strings, both during the run of the script and on
input/output, as bytes (=groups of 8 bits) without any meaning as
characters (=member of an alphabet for writing text). This will work if
all devices, and the script itself, use the same character code, which
must not have bytes with value >255. This *can* be a viable option if
you can either guarantee this restriction, or if your bytes do not
have a character meaning.
In this case, strings in the program text with characters that are not
contained in the common character code are meaningless, and will yield
errors.
- You regard the data during the run of the script as sequences of
characters, and the data on onput and output as sequences of bytes. Then
you have to convert bytes into textstrings on input and textstrings into
bytes on output -- in both cases you can specify the conversion once and
for all for each file. This is the only working way when the restrictions
of the last item are not fulfilled.
In this case, strings in the program text may contain any characters
whether or not they are representable in the codes used in input/output.
The "use utf8" pragma tells perl to interpret the program text itself as a
sequence of UTF-8 characters which will make a difference only for literal
strings in the program.
A third way does *not* work:
- You do input and output on strings of bytes and assume that perl will guess
correctly what characters these byte represent in your opinion.
Unfortunately that will *often* work (because perl assumes ISO-8859-1 on
many systems which may be what you are actually using), but it will also
often break (if you use other codes, or if you mix strings which happen to
contain only ISO-8859-1 characters with string containing also other
characters). But if it breaks, it is your fault: it is nowhere guaranteed
how text strings map to byte strings and vice versa, the sole exception
being the documented encode and decode functions.
This is fairly well explained in
http://search.cpan.org/~dom/perl-5.14.3/pod/perlunitut.pod
--
Helmut Richter
------------------------------
Date: Sun, 28 Oct 2012 21:39:53 +0000
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Why "Wide character in print"?
Message-Id: <87y5iq8edy.fsf@sapphire.mobileactivedefense.com>
Helmut Richter <hhr-m@web.de> writes:
[...]
> - You regard the data during the run of the script as sequences of
> characters, and the data on onput and output as sequences of bytes. Then
> you have to convert bytes into textstrings on input and textstrings into
> bytes on output -- in both cases you can specify the conversion once and
> for all for each file. This is the only working way when the restrictions
> of the last item are not fulfilled.
This is the only 'working way' when the assumption that perl uses a
'secret mystery encoding' different from any other encoding known to
man is taken for granted. But this assumption is wrong and the concept
makes preciously little sense since it requires an additional copy of
all input data and all output data (possibly, times the number of perl
processes in a 'long' pipeline since not even perl is supposed to be
able to talk to perl natively). Considering the way perl is
implemented, this is a real problem for users of Windows (and Mac OS
X, AFAIK) because in both cases, perl uses something other than the
native encoding. That some people would like to inflict the same
damage onto users of platforms where the problem doesn't exist is
certainly very laudable but IMNSHO, best ignored.
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 3804
***************************************