[32540] in Perl-Users-Digest
Perl-Users Digest, Issue: 3805 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Mon Oct 29 16:09:23 2012
Date: Mon, 29 Oct 2012 13:09:07 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Mon, 29 Oct 2012 Volume: 11 Number: 3805
Today's topics:
array <nospam@nspam.invalid>
basic perl question <dviswa@gmail.com>
Converting www to http://www (regex negative look-behin <jwcarlton@gmail.com>
Re: Converting www to http://www (regex negative look-b <derykus@gmail.com>
Re: Converting www to http://www (regex negative look-b <jwcarlton@gmail.com>
Re: Converting www to http://www (regex negative look-b <dave@invalid.invalid>
Re: Converting www to http://www (regex negative look-b <rweikusat@mssgmbh.com>
Get the decimal separator from Windows mathieu.hedard@gmail.com
Re: Get the decimal separator from Windows <jimsgibson@gmail.com>
Parallel execution framework? <ignoramus13803@NOSPAM.13803.invalid>
Re: perl and indent <billcun@suddenlink.net>
Re: perl and indent <news@lawshouse.org>
Re: Simple (Rookie) Question <intergroup@grosvenor.net>
Re: Simple (Rookie) Question <jwcarlton@gmail.com>
Re: Simple (Rookie) Question <contratrick@126.com>
Re: Why "Wide character in print"? <hjp-usenet2@hjp.at>
Re: Why "Wide character in print"? <hjp-usenet2@hjp.at>
Re: Why "Wide character in print"? <hhr-m@web.de>
Re: Why "Wide character in print"? <rweikusat@mssgmbh.com>
Re: Why "Wide character in print"? <whynot@pozharski.name>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Mon, 29 Oct 2012 15:56:57 -0500
From: "Bill Cunningham" <nospam@nspam.invalid>
Subject: array
Message-Id: <k6mn24$t73$1@dont-email.me>
What's wrong in this code?
use strict;
use warnings;
my @cats=["striper","snowball"];
print $cats[0];
print $cats[1];
Bill
------------------------------
Date: Mon, 29 Oct 2012 15:01:20 -0500
From: vis29 <dviswa@gmail.com>
Subject: basic perl question
Message-Id: <RJedneSgtfYNfRPNnZ2dnUVZ_jidnZ2d@giganews.com>
I am going through a perl program, i have following lines which I dont understand, please explain what those lines does and why.
if (/^$hash{server}/) {
$name="not accessible";
}
------------------------------
Date: Sun, 28 Oct 2012 20:01:30 -0700 (PDT)
From: Jason C <jwcarlton@gmail.com>
Subject: Converting www to http://www (regex negative look-behind)
Message-Id: <50b0bcb8-20f4-4734-8e3b-9c390d9dadce@googlegroups.com>
I'm using URI::Find to create links when a user submits a website address. Meaning, this:
http://www.google.com
becomes:
<a href="http://www.google.com">http://www.google.com</a>
Every once in awhile, someone will submit a link without the http://, though, so I'm trying to add it using regex:
$text =~ s#(?<!http://)www\.#http://www\.#gi;
This works, but adds another new problem; when someone submits a secure link, the https isn't matched, so this:
https://www.google.com
becomes:
https://http//www.google.com
(yes, there's not a colon after the http; I'm not sure why)
I tried making the s optional, but then have a syntax error regarding the look-behind:
$text =~ s#(?<!http(s)*://)www\.#http$2://www\.#gi;
Any suggestions on how I might modify the look-behind to recognize both https and http?
------------------------------
Date: Sun, 28 Oct 2012 22:09:43 -0700 (PDT)
From: "C.DeRykus" <derykus@gmail.com>
Subject: Re: Converting www to http://www (regex negative look-behind)
Message-Id: <959fb049-b316-49a8-9304-6893adc67c8a@googlegroups.com>
On Sunday, October 28, 2012 8:01:30 PM UTC-7, Jason C wrote:
> I'm using URI::Find to create links when a user submits a website address. Meaning, this:
>
>
>
> http://www.google.com
>
>
>
> becomes:
>
>
>
> <a href="http://www.google.com">http://www.google.com</a>
>
>
>
> Every once in awhile, someone will submit a link without the http://, though, so I'm trying to add it using regex:
>
>
>
> $text =~ s#(?<!http://)www\.#http://www\.#gi;
>
>
>
> This works, but adds another new problem; when someone submits a secure link, the https isn't matched, so this:
>
>
>
> https://www.google.com
>
>
>
> becomes:
>
>
>
> https://http//www.google.com
>
>
>
> (yes, there's not a colon after the http; I'm not sure why)
>
>
>
> I tried making the s optional, but then have a syntax error regarding the look-behind:
>
>
>
> $text =~ s#(?<!http(s)*://)www\.#http$2://www\.#gi;
>
>
>
> Any suggestions on how I might modify the look-behind to recognize both https and http?
Have you considered just filtering https out:
if ( $text !~ m{^ https:// }x ) {
... your regex...
}
But, I'd suggest using URI or maybe methods within
URI::Find to identify the input's scheme and other
uri components and then reconstruct what you want
from those pieces.
--
Charles DeRykus
to
------------------------------
Date: Mon, 29 Oct 2012 02:15:58 -0700 (PDT)
From: Jason C <jwcarlton@gmail.com>
Subject: Re: Converting www to http://www (regex negative look-behind)
Message-Id: <707c5cf9-c53c-4d25-8501-8c903297ddcd@googlegroups.com>
On Monday, October 29, 2012 1:09:43 AM UTC-4, C.DeRykus wrote:
> On Sunday, October 28, 2012 8:01:30 PM UTC-7, Jason C wrote:
> > Any suggestions on how I might modify the look-behind to recognize both=
https and http?
>=20
> Have you considered just filtering https out:
>=20
> if ( $text !~ m{^ https:// }x ) {
> ... your regex...
> }
No, my alternative was a second "fix" regex afterward, to correct the https=
://http//:
$text =3D~ s#(?<!http://)www\.#http://www\.#gi;
$text =3D~ s#https://http(:)*//#https://#gi;
Which seems very messy to me, but it's a temporary fix in the hopes of maki=
ng the original regex better. If nothing else comes up, your filter looks s=
marter than my fix, though, so thanks for the idea.
> But, I'd suggest using URI or maybe methods within=20
> URI::Find to identify the input's scheme and other=20
> uri components and then reconstruct what you want
> from those pieces.
URI::Find does have a ::Schemeless method that is less strict:
http://search.cpan.org/~mschwern/URI-Find-20111103/lib/URI/Find/Schemeless.=
pm
It's a little too loose, though, since I don't really want to capture all p=
otential links (like references to command.com or ftp://whatever). Further,=
I made a note a few years ago that it didn't catch weird TLD's like youtu.=
be, so it's still not a perfect fix.
For now, the filter you suggested seems like my best bet, unless someone ca=
n suggest a modification to my regex that might process faster. Thanks, Cha=
rles :-)
------------------------------
Date: Mon, 29 Oct 2012 10:53:08 +0000 (UTC)
From: "Dave Saville" <dave@invalid.invalid>
Subject: Re: Converting www to http://www (regex negative look-behind)
Message-Id: <fV45K0OBJxbE-pn2-QlJfnyFQzvs4@localhost>
On Mon, 29 Oct 2012 03:01:30 UTC, Jason C <jwcarlton@gmail.com> wrote:
> I'm using URI::Find to create links when a user submits a website address. Meaning, this:
>
> http://www.google.com
>
> becomes:
>
> <a href="http://www.google.com">http://www.google.com</a>
>
> Every once in awhile, someone will submit a link without the http://, though, so I'm trying to add it using regex:
>
> $text =~ s#(?<!http://)www\.#http://www\.#gi;
$text = 'http://'.$text unless $text =~ m/^https*:/;
--
Regards
Dave Saville
------------------------------
Date: Mon, 29 Oct 2012 14:53:07 +0000
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Converting www to http://www (regex negative look-behind)
Message-Id: <87boflxrcc.fsf@sapphire.mobileactivedefense.com>
"Dave Saville" <dave@invalid.invalid> writes:
> On Mon, 29 Oct 2012 03:01:30 UTC, Jason C <jwcarlton@gmail.com> wrote:
>
>> I'm using URI::Find to create links when a user submits a website address. Meaning, this:
>>
>> http://www.google.com
>>
>> becomes:
>>
>> <a href="http://www.google.com">http://www.google.com</a>
>>
>> Every once in awhile, someone will submit a link without the http://, though, so I'm trying to add it using regex:
>>
>> $text =~ s#(?<!http://)www\.#http://www\.#gi;
>
> $text = 'http://'.$text unless $text =~ m/^https*:/;
Slightly more general:
$text =~ /^[A-Za-z][-+.A-Za-z0-9]*:\/\// or $text = 'http://'.$text;
This will prepend http:// if the string does not start with 'scheme'
as defined by RFC2396, followed by the two characters (//) signalling
the start of a 'net path'.
------------------------------
Date: Mon, 29 Oct 2012 06:45:17 -0700 (PDT)
From: mathieu.hedard@gmail.com
Subject: Get the decimal separator from Windows
Message-Id: <6491f4ad-e658-4ebd-9508-9cea9b14df03@googlegroups.com>
Hi,
i write a script to process files on different computer with various language.
And i wonder if there is a way to know which is the decimal separator for the current user in my script.
Thanks for your answers.
------------------------------
Date: Mon, 29 Oct 2012 11:38:00 -0700
From: Jim Gibson <jimsgibson@gmail.com>
Subject: Re: Get the decimal separator from Windows
Message-Id: <291020121138006287%jimsgibson@gmail.com>
In article <6491f4ad-e658-4ebd-9508-9cea9b14df03@googlegroups.com>,
<mathieu.hedard@gmail.com> wrote:
> Hi,
>
> i write a script to process files on different computer with various language.
> And i wonder if there is a way to know which is the decimal separator for the
> current user in my script.
I don't really know, but a search for 'perl decimal separator' led to:
perldoc -f sprintf
and this:
"If "use locale" is in effect and POSIX::setlocale() has been
called, the character used for the decimal separator in
formatted floating-point numbers is affected by the LC_NUMERIC
locale. See perllocale and POSIX."
--
Jim Gibson
------------------------------
Date: Sun, 28 Oct 2012 22:57:17 -0500
From: Ignoramus13803 <ignoramus13803@NOSPAM.13803.invalid>
Subject: Parallel execution framework?
Message-Id: <xK2dnaVWb-QAYxDNnZ2dnUVZ_sSdnZ2d@giganews.com>
I have some tasks that are best done in parallel (because they involve
a lot of waiting for the remote servers and networks).
For the last 7 years or so, I have been using
Parallel::ForkManager. It is nice because it is so simple, however, it
has its limitations.
I would like to see if there are any suggestions for a parallel
framework that stays within one perl machine, but uses multithreading.
By framework, I mean something that takes care of parallel stuff in
some straightforward way, kind of like Parallel::ForkManager.
All suggestions will be appreciated. Say, has anyone used Parallel::Loops?
Thanks
i
------------------------------
Date: Mon, 29 Oct 2012 12:35:58 -0500
From: "Bill Cunningham" <billcun@suddenlink.net>
Subject: Re: perl and indent
Message-Id: <k6mepn$21d$1@dont-email.me>
Henry Law wrote:
> I suspect there may be a precise technical meaning of "I use indent";
> if so then I'm about to sound like an idiot ...
>
> My experience of using indentation for Perl programming prompts these
> musings, which may be of help to you:
>
> * Your editor, if it's any good, probably has a mode for Perl. I use
> Emacs, which definitely does, and also Eclipse via the
> specifically-Perl "Epic" plug-in. The editor should handle
> indentation within braces, and aligning closing braces properly.
>
> * I find two characters of indentation plenty; three at most. You may
> need to customise your editor to get that.
>
> * I advocate using "soft tabs" (spaces) rather than hard tabs.
> Occasionally I'll "cat" (Linux) or "type" (Windows) a Perl file just
> to have a look at it and the built-in tab is usually far too big
> (like 8).
[snip]
I use nano. I'll check more into it. I'm new to Perl. I'm just now looking
at it as a second language along with C. So far it looks C like.
Bill
------------------------------
Date: Mon, 29 Oct 2012 18:51:24 +0000
From: Henry Law <news@lawshouse.org>
Subject: Re: perl and indent
Message-Id: <88-dnQGTVuasTRPNnZ2dnUVZ7tednZ2d@giganews.com>
On 29/10/12 17:35, Bill Cunningham wrote:
> I use nano. I'll check more into it. I'm new to Perl. I'm just now looking
> at it as a second language along with C. So far it looks C like.
Bill, nice to have you aboard. Yes, it's C-like, and many of the
differences are (IMO) improvements. But there are people who say "Don't
write C in Perl; learn to write Perl" and if you read people's code
you'll see -- at least partly -- what they mean.
For example (and I'm only an amateur at this), the C structure for an
"if" statement translates to Perl as this:
if ($some_condition) {
do_stuff();
}
That works, of course, but you'll find that
do_stuff() if $some_condition;
is far easier to code and (IMO again) easier to understand. Better
still, C has no direct analogue to:
do_stuff() unless $some_other_condition;
... which is clearer if the normal value for $some_other_condition is
FALSE.
--
Henry Law Manchester, England
------------------------------
Date: Sun, 28 Oct 2012 20:52:10 -0600
From: William Humpboys <intergroup@grosvenor.net>
Subject: Re: Simple (Rookie) Question
Message-Id: <5nrr88h7c1mof2eq410mvc386dkq08c46k@4ax.com>
On Sun, 28 Oct 2012 17:48:55 -0700 (PDT), Jason C
<jwcarlton@gmail.com> wrote:
>From your description, I think you are using SSI. It's been a LONG time since I've worked with that, but I think you're looking for:
>
><!--#exec cgi="cgi-bin/script_name.cgi"-->
>
>The path "cgi-bin/script_name.cgi" would vary based on the name and location of the script.
In my case, that becomes:
<!--#exec cgi="/find_header"-->
and generates an apache error
"[an error occurred while processing this directive]"
http://nizkor.org/test.html
------------------------------
Date: Sun, 28 Oct 2012 21:16:28 -0700 (PDT)
From: Jason C <jwcarlton@gmail.com>
Subject: Re: Simple (Rookie) Question
Message-Id: <2606eba8-776e-472a-a2f1-5fba43b00954@googlegroups.com>
On Sunday, October 28, 2012 10:52:35 PM UTC-4, William Humpboys wrote:
> On Sun, 28 Oct 2012 17:48:55 -0700 (PDT), Jason C
>
> >From your description, I think you are using SSI. It's been a LONG time since I've worked with that, but I think you're looking for:
>
> >
>
> ><!--#exec cgi="cgi-bin/script_name.cgi"-->
>
> >
>
> >The path "cgi-bin/script_name.cgi" would vary based on the name and location of the script.
>
>
>
> In my case, that becomes:
>
>
>
> <!--#exec cgi="/find_header"-->
>
>
>
> and generates an apache error
>
>
>
> "[an error occurred while processing this directive]"
>
>
>
> http://nizkor.org/test.html
Are you sure that the name of the script is "find_header", and not "find_header.cgi" or "find_header.pl"? You might have your program set to not show extensions.
Do you have access to your site's error log? If so, that will probably give you more insight on why you're getting this error.
------------------------------
Date: Mon, 29 Oct 2012 04:57:08 +0000 (UTC)
From: Annie <contratrick@126.com>
Subject: Re: Simple (Rookie) Question
Message-Id: <XnsA0FAE96C6FE98proggiessuck@88.198.244.100>
Jason C <jwcarlton@gmail.com> wrote in
news:2606eba8-776e-472a-a2f1-5fba43b00954@googlegroups.com:
>> http://nizkor.org/test.html
>
> Are you sure that the name of the script is "find_header", and not
> "find_header.cgi" or "find_header.pl"? You might have your program set
> to not show extensions.
>
> Do you have access to your site's error log? If so, that will probably
> give you more insight on why you're getting this error.
Stupid me - thanks for pointing that out. I've changed it to correct the
omission, and now get a blank page. Thanks for the help - I'll keep
playing with it and see if I can figure out the rest.
Cheers
--
Obama Voters Are Ignorami:
http://spectator.org/blog/2012/09/25/obama-voters-ignoramuses
In His Own Words: Barack Obama Reviewed
http://www.youtube.com/watch?v=o8R5GvwUFU8
------------------------------
Date: Mon, 29 Oct 2012 10:16:37 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Why "Wide character in print"?
Message-Id: <slrnk8sibl.p8i.hjp-usenet2@hrunkner.hjp.at>
On 2012-10-28 20:57, Helmut Richter <hhr-m@web.de> wrote:
> On Sun, 28 Oct 2012, Peter J. Holzer wrote:
>
>> But this isn't what "wide character" in the warning means. In the
>> warning, it means a string element with a code > 255. For string
>> elements <= 255, perl can assume that they are supposed to be bytes, not
>> characters, when you try to write them to a byte stream.
>
> You have to distinguish what may work sometimes or always, and what is
> part of the interface which *should* work. If it does nor work in the
> latter case, it is an error; if it does not work in the former case you
> have made a bad guess about how it is implemented. So do not rely on your
> guesses but use the documented interface.
I was careful to use the term "string element" and avoid the terms
"byte" and "character" when talking about the things a string is
composed of.
Perl has two types of strings: Character strings (often called utf8
strings in the documentation) and byte strings. Character strings are
composed of 32-bit entities, each denoting a unicode code point. So
"\x{1f42a}" is a string with the single character DROMEDARY CAMEL.
Byte strings are just that: Strings of uninterpreted bytes. Any
semantics assigned to them is semantics of the program, not of the Perl
language (this isn't quite correct: character oriented functions like lc
or character classes in regexps do work on them, but only for ASCII).
These differences are documented, and I consider them part of the
interface, although some members of p5p consider the distinction a bug
and try to remove it.
However, for the warning "Wide character in print" this is irrelevant.
Perl doesn't distinguish between character and byte strings when writing
them to a file handle. For both the strings "\x{E0}" (a byte string) and
"\N{U+00E0}" (a character string), if you write them to a raw file
handle, the single byte 0xE0 will be written. Both will be converted to
two bytes 0xC3 0xA0 if you write them the a file handle with the
":encoding(UTF-8)" layer. And so on. But for strings with elements >
255, it simply isn't possible, to write a single byte with this value to
a byte stream, because a byte has only 8 bits (on the platforms we care
about). So Perl prints a warning and encodes the string in UTF-8 (or
just copies its internal representation, which happens to be the same
thing). I would argue that perl should die() instead, but this has been
the observed and documented behaviour since 5.8.0, so I doubt it will
change.
[Rest snipped. All true, but IMHO not very relevant to this thread].
hp
--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | hjp@hjp.at | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
------------------------------
Date: Mon, 29 Oct 2012 10:43:44 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Why "Wide character in print"?
Message-Id: <slrnk8sjug.p8i.hjp-usenet2@hrunkner.hjp.at>
On 2012-10-28 21:39, Rainer Weikusat <rweikusat@mssgmbh.com> wrote:
> Helmut Richter <hhr-m@web.de> writes:
>> - You regard the data during the run of the script as sequences of
>> characters, and the data on onput and output as sequences of bytes. Then
>> you have to convert bytes into textstrings on input and textstrings into
>> bytes on output -- in both cases you can specify the conversion once and
>> for all for each file. This is the only working way when the restrictions
>> of the last item are not fulfilled.
>
> This is the only 'working way' when the assumption that perl uses a
> 'secret mystery encoding' different from any other encoding known to
> man is taken for granted.
The encoding isn't a 'secret mystery'. It is well documented that it
is Unicode.
perl -CS -MEncode -E 'say ord(Encode::decode("utf-8", "\xE2\x82\xAC"))'
is defined to print "8364".
It is a 'secret mystery' (wink, wink, nudge, nudge) how this is
represented internally, just like the representation of numbers is a
'secret mystery'.
However, for most programs you don't have to know that Perl character
strings are Unicode strings. It is sufficient to know that Perl has the
concept of a "character" which is different from the concept of a
"byte", that a character has certain properties (e.g. it can be a letter
or an ideograph, it may have an associated uppercase or lowercase
letter, ...) and to convert a sequence of characters into a sequence of
bytes you have to encode them. Whether the Euro sign has the numeric
code 8364 or 4711 is rarely significant.
> But this assumption is wrong and the concept
> makes preciously little sense since it requires an additional copy of
> all input data and all output data
This is an unsubstantiated claim. It is possible that the current
implementation of I/O layers does indeed perform an additional copy (I
haven't checked the code), but this is certainly not required.
And even if it is true, it is almost certainly lost in the noise as soon
as your script does something more complex than "cat" with your input -
almost any string operation in perl performs a copy.
> (possibly, times the number of perl processes in a 'long' pipeline
> since not even perl is supposed to be able to talk to perl natively).
> Considering the way perl is implemented, this is a real problem for
> users of Windows (and Mac OS X, AFAIK) because in both cases, perl
> uses something other than the native encoding.
Why is this a real problem?
> That some people would like to inflict the same damage onto users of
> platforms where the problem doesn't exist is certainly very laudable
> but IMNSHO, best ignored.
Whatever "the problem" may be. The problem that characters and bytes
aren't the same and that most programmers prefer to think of text as a
sequence of characters, not a sequence of bytes exists on every
platform.
hp
--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | hjp@hjp.at | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
------------------------------
Date: Mon, 29 Oct 2012 11:47:32 +0100
From: Helmut Richter <hhr-m@web.de>
Subject: Re: Why "Wide character in print"?
Message-Id: <alpine.LNX.2.00.1210291140310.5285@badwlrz-clhri01.ws.lrz.de>
On Mon, 29 Oct 2012, Peter J. Holzer wrote:
> However, for most programs you don't have to know that Perl character
> strings are Unicode strings.
Are they? They are strings of characters that are contained in Unicode. They
are not necessarily internally encoded as Unicode. People run into problems
when they make assumptions about the way they are implemented. I would have
worded:
For all programs you must not pretend to know that Perl character strings
are Unicode strings.
It may be true, it may be false -- either way, it is not part of the
documented interface. Hence, it must not be used even if it be true.
--
Helmut Richter
------------------------------
Date: Mon, 29 Oct 2012 13:40:07 +0000
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Why "Wide character in print"?
Message-Id: <87fw4xxuq0.fsf@sapphire.mobileactivedefense.com>
Helmut Richter <hhr-m@web.de> writes:
> On Mon, 29 Oct 2012, Peter J. Holzer wrote:
>> However, for most programs you don't have to know that Perl character
>> strings are Unicode strings.
[...]
> For all programs you must not pretend to know that Perl character strings
> are Unicode strings.
>
> It may be true, it may be false -- either way, it is not part of the
> documented interface. Hence, it must not be used even if it be true.
At best, that's a part of the interface which was meanwhile
'undocumented' because the implementation choices which were made
weren't the implementation choices that should have been made,
according to the opinions of some people who didn't make the
descision. But indepedently of that, inventing the 'Perl is an
island!' character encoding - no matter how hypothetical - remains a
stupid idea. Perl is not an island and it has to interact with code
written in other programming languages, although maybe not in the
fantasy universe of people who implement 'wepp fremmwuergs' and
'ohpscheckt suesstemms' who are generally not troubled by the minor
consideration of making their stuff do something actually useful in
the real world. Conseqently, Perl should be compatible with some
existing convention, ideally, with all existing 'local'
conventions. If this isn't possible, the next best choice is not 'make
everyone bleed'.
------------------------------
Date: Mon, 29 Oct 2012 14:52:06 +0200
From: Eric Pozharski <whynot@pozharski.name>
Subject: Re: Why "Wide character in print"?
Message-Id: <slrnk8suvm.788.whynot@orphan.zombinet>
with <slrnk8r42s.2s7.hjp-usenet2@hrunkner.hjp.at> Peter J. Holzer wrote:
> On 2012-10-28 11:45, Eric Pozharski <whynot@pozharski.name> wrote:
>> with <3q7ul9-l7s.ln1@anubis.morrow.me.uk> Ben Morrow wrote:
>>> In any case, the result is exactly what I said: the string contains
>>> one (logical) character. If you apply length() to that string it
>>> will return 1. (This character happens to be represented internally
>>> as two bytes; that is none of your business.) What do you think I
>>> omitted from the story?
>> Right. And that's closely related to your last example (the one
>> about utf8.pm being unsafe). I've tried to make a point that
>> *characters* from different *ranges* happen to be of different length
>> in bytes.
> Then maybe you shouldn't have chosen two examples which both are same
> length in bytes.
(Last night I've reread loads of perlunicode and friends, I feel much
better now) No, they are the same length *if* encoding of stream is set:
{7453:22} [0:0]% perl -CS -Mutf8 -wle 'print "Ã "' | xxd
0000000: c3a0 0a ...
{7459:23} [0:0]% perl -CS -Mutf8 -wle 'print "а"' | xxd
0000000: d0b0 0a ...
{7466:24} [0:0]%
But latin1 is special (I've reread perlunicode and friends), *if*
there's no reason (printing isn't reason) to upgrade to utf8 then
*characters* of latin1 script (and latin1 only) stay *bytes*:
{7466:24} [0:0]% perl -Mutf8 -wle 'print "Ã "' | xxd
0000000: e00a ..
{7795:25} [0:0]% perl -Mutf8 -wle 'print "а"' | xxd
Wide character in print at -e line 1.
0000000: d0b0 0a ...
But even if encoding of stream isn't set concatenation with non-latin1
script upgrades latin1 too:
{7800:26} [0:0]% perl -Mutf8 -wle 'print "[à ][а]"' | xxd
Wide character in print at -e line 1.
0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].
Please rewind the thread. That's exactly what happened couple of posts
ago (specifically: <eli$1210251546@qz.little-neck.ny.us> and
<vi7pl9-ui71.ln1@anubis.morrow.me.uk>).
>>
>> {9829:45} [0:0]% perl -Mutf8 -MDevel::Peek -wle '$aa = "aà а" ; Dump $aa'
>> SV = PV(0xa06f750) at 0xa08afac
>> REFCNT = 1
>> FLAGS = (POK,pPOK,UTF8)
>> PV = 0xa086a08 "a\303\240\320\260"\0 [UTF8 "a\x{e0}\x{430}"]
>> CUR = 5
>> LEN = 12
>>
>> *Characters* of latin1 aren't wide (even if they are characters, they
>> are still one byte long)
> In UTF-8, latin-1 characters >= 0x80 are 2 bytes, the same as cyrillic
> characters. Your example shows this: "Ã " (LATIN SMALL LETTER A WITH
> GRAVE) is "\303\240" and "а" (CYRILLIC SMALL LETTER A) is "\320\260".
No. Because it's not UTF-8, it's utf8. As long as utf8 semantics isn't
set, anything scalar stays plain bytes:
{2786:10} [0:0]% perl -MDevel::Peek -wle 'Dump "Ã "'
SV = PV(0x9d0e878) at 0x9d29f28
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK)
PV = 0x9d2ddc8 "\303\240"\0
CUR = 2
LEN = 12
However, when utf8 semantics is set, then those codepoints that fit
latin1 script become special Perl-latin1:
{5930:11} [0:0]% perl -MDevel::Peek -Mutf8 -wle 'Dump "Ã "'
SV = PV(0x9b92880) at 0x9badf10
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8)
PV = 0x9bb1eb0 "\303\240"\0 [UTF8 "\x{e0}"]
CUR = 2
LEN = 12
Upgrade to UTF-8 encoding or staying with latin1 encoding depends on
concatation with already upgraded to UTF-8 codepoints and/or encoding of
output stream.
*SKIP*
>> {10477:68} [0:0]% perl -Mutf8 -wle 'print "Ã "' # oops
> Now you have one character (because of -Mutf8, the two bytes \303\240
> are decoded to the character U+00e0), but you are trying to write it
> to a byte stream without specifying the encoding. Perl writes the
> single byte 0xE0, which your UTF-8 terminal cannot interpret. (Mine
> displays a question mark in a dark circle)
{42:1} [0:0]% perl -Mutf8 -wle 'print "Ã "'
Ã
{1903:2} [0:0]% perl -Mutf8 -wle 'print "Ã "'
{1933:3} [0:0]% perl -Mutf8 -wle 'print "Ã "' | xxd
0000000: e00a
Instead it does. Once. It wasn't typeing, it was search through
history. Now I'm bothered. Does anyone here know how to list
extensions enabled in running instance of urxvt?
*SKIP*
> For one-liners like this, using the same encoding for the script and
> the I/O is useful ("-CS -Mutf8" is even shorter than
> "-Mencoding=utf8", but maybe you don't have a UTF-8 capable terminal).
{14999:29} [0:0]% perl -mencoding -wle 'print "[à ][а]"' | xxd
0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].
{15017:30} [0:0]% perl -CS -Mutf8 -wle 'print "[à ][а]"' | xxd
0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].
Golf?
> However, for real programs, I think tying the encoding of the source
> code to the encoding of I/O-streams the script is supposed to handle
> is foolish. My scripts are always encoded in UTF-8, but they
> frequently have to handle files in CP-1252.
Mine are us-ascii, I have open.pm for rest.
--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 3805
***************************************