[23924] in Perl-Users-Digest
Perl-Users Digest, Issue: 6125 Volume: 10
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Feb 12 18:10:33 2004
Date: Thu, 12 Feb 2004 15:10:09 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Thu, 12 Feb 2004 Volume: 10 Number: 6125
Today's topics:
Re: Replacing unicode characters <eriks@operamail.com>
Re: Replacing unicode characters <usenet@morrow.me.uk>
Re: Replacing unicode characters <twhu@lucent.com>
Re: Replacing unicode characters <eriks@operamail.com>
Re: Replacing unicode characters <flavell@ph.gla.ac.uk>
Re: Replacing unicode characters <usenet@morrow.me.uk>
Re: Test message, please ignore. (Sorry post) <zak@SDF.LONESTAR.ORG>
Re: Test message, please ignore. (Sorry post) <dwall@fastmail.fm>
Re: Test message, please ignore. <dwall@fastmail.fm>
Re: tying hashes (Anno Siegel)
Re: tying hashes <yshtil@cisco.com>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Thu, 12 Feb 2004 21:10:04 +0100
From: Erik Sandblom <eriks@operamail.com>
Subject: Re: Replacing unicode characters
Message-Id: <BC519C2C.14A77%eriks@operamail.com>
i artikel c0gghj$f34$5@wisteria.csv.warwick.ac.uk, skrev Ben Morrow på
usenet@morrow.me.uk den 04-02-12 19.26:
> Erik Sandblom <eriks@operamail.com> wrote:
>> I'm trying to replace double quotation marks in a UTF-8 document:
>>
>> $string =~ s#\x{201D}#”#g;
>
> How have you read in $string? If the file is UTF8, you need to tell
> Perl so, or it will assume Latin1:
Really? The system I'm on is RedHat 8, and I understand they have some
default variable somewhere saying that "everything is UTF-8 unless otherwise
specified". And Perl then follows that. But I'm not sure.
> open my $FH, '<:encoding(utf8)', $filename or die...;
Thank you, this solved my problem. I also had to unset "use bytes;" by
putting in "no bytes;". Apparently "use bytes" makes Perl treat characters
as being two-digit rather than four-digit. Wrong terminology I'm sure, but
it may help someone else in my position. I had previously set "use bytes" to
be able to use accented characters in good-old latin-1.
Thanks again.
--
Erik Sandblom
my site is EriksRailNews.com
for those who don't believe, no explanation is possible
for those who do, no explanation is necessary
------------------------------
Date: Thu, 12 Feb 2004 20:45:02 +0000 (UTC)
From: Ben Morrow <usenet@morrow.me.uk>
Subject: Re: Replacing unicode characters
Message-Id: <c0goke$l8b$2@wisteria.csv.warwick.ac.uk>
Erik Sandblom <eriks@operamail.com> wrote:
> i artikel c0gghj$f34$5@wisteria.csv.warwick.ac.uk, skrev Ben Morrow på
> usenet@morrow.me.uk den 04-02-12 19.26:
> > Erik Sandblom <eriks@operamail.com> wrote:
> >> I'm trying to replace double quotation marks in a UTF-8 document:
> >>
> >> $string =~ s#\x{201D}#”#g;
> >
> > How have you read in $string? If the file is UTF8, you need to tell
> > Perl so, or it will assume Latin1:
>
> Really? The system I'm on is RedHat 8, and I understand they have some
> default variable somewhere saying that "everything is UTF-8 unless otherwise
> specified". And Perl then follows that. But I'm not sure.
If you are using 5.8.0 and have your LC_ALL environment var set to
something with UTF8 in, perl will push a :utf8 onto all filehandles by
default. This behaviour was disabled in 5.8.1, as it caused *lots* of
compatability problems.
> > open my $FH, '<:encoding(utf8)', $filename or die...;
>
> Thank you, this solved my problem. I also had to unset "use bytes;"
> by putting in "no bytes;". Apparently "use bytes" makes Perl treat
> characters as being two-digit rather than four-digit. Wrong
> terminology I'm sure,
Yes... are the digits you are refering to hex digits, so you actually
mean 8-bit (eg. \x12) rather than 16-bit (eg. \x{1234})? In this case,
you are under a misapprehension: the more recent versions of Unicode
are in fact 21-bit character encodings, not 16-bit: that is, \x{12345}
is a valid Unicode character number (currently not assigned a
character).
> but it may help someone else in my position. I had previously set
> "use bytes" to be able to use accented characters in good-old
> latin-1.
You shouldn't need to do this: if you're mixing character sets, I'd
strongly recommend you convert everything to Perl's internal Unicode
using Encode. If you want latin1 literals in your Perl source, put
use encoding 'latin1';
at the top; and don't try to mix encodings (ie. have both latin1 and
utf8 literals) in one source file.
Ben
--
don't get my sympathy hanging out the 15th floor. you've changed the locks 3
times, he still comes reeling though the door, and soon he'll get to you, teach
you how to get to purest hell. you do it to yourself and that's what really
hurts is you do it to yourself just you, you and noone else * ben@morrow.me.uk
------------------------------
Date: Thu, 12 Feb 2004 16:14:02 -0500
From: "Tulan W. Hu" <twhu@lucent.com>
Subject: Re: Replacing unicode characters
Message-Id: <c0gqar$h6c@netnews.proxy.lucent.com>
"Ben Morrow" <usenet@morrow.me.uk> wrote in
[snip]
> If you are using 5.8.0 and have your LC_ALL environment var set to
> something with UTF8 in, perl will push a :utf8 onto all filehandles by
> default. This behaviour was disabled in 5.8.1, as it caused *lots* of
> compatability problems.
>
> > > open my $FH, '<:encoding(utf8)', $filename or die...;
Ben,
How about perl 5.8.2?
I got an utf8 file, but I just use regular open to read it
and I print the string out after I read them. It seems ok.
I use Unicode::String to convert the lines to latin1.
The following code seems to work ok.
use File::Slurp;
use Unicode::String qw(utf8 latin1);
my @l2 = ();
@l2 = read_file('filename');
foreach my $nline (@l2) {
my $l = utf8($nline);
print "$l";
print $l->latin1;
}
Do you see any problem with this?
------------------------------
Date: Thu, 12 Feb 2004 22:33:48 +0100
From: Erik Sandblom <eriks@operamail.com>
Subject: Re: Replacing unicode characters
Message-Id: <BC51AFCC.14A86%eriks@operamail.com>
i artikel c0goke$l8b$2@wisteria.csv.warwick.ac.uk, skrev Ben Morrow på
usenet@morrow.me.uk den 04-02-12 21.45:
> Erik Sandblom <eriks@operamail.com> wrote:
>> i artikel c0gghj$f34$5@wisteria.csv.warwick.ac.uk, skrev Ben Morrow på
>> usenet@morrow.me.uk den 04-02-12 19.26:
>>> open my $FH, '<:encoding(utf8)', $filename or die...;
>>
>> Thank you, this solved my problem. I also had to unset "use bytes;"
>> by putting in "no bytes;". Apparently "use bytes" makes Perl treat
>> characters as being two-digit rather than four-digit. Wrong
>> terminology I'm sure,
>
> Yes... are the digits you are refering to hex digits, so you actually
> mean 8-bit (eg. \x12) rather than 16-bit (eg. \x{1234})?
Yes, that's right. I'm finally getting the hang of hexadecimal and I've
deduced 16-bit comes from that 2 to the power of four is 16. But what does
that really mean, considering each "digit", as I still call them, can have
16 different numbers, and not 2? That would be 16 to the power of four which
is a large number, about 66 000 unless I'm mistaken.
> In this case,
> you are under a misapprehension: the more recent versions of Unicode
> are in fact 21-bit character encodings, not 16-bit: that is, \x{12345}
> is a valid Unicode character number (currently not assigned a
> character).
Oh my goodness, that's a lot of characters. Why doesn't everyone just learn
English? ;-)
>> but it may help someone else in my position. I had previously set
>> "use bytes" to be able to use accented characters in good-old
>> latin-1.
>
> You shouldn't need to do this: if you're mixing character sets, I'd
> strongly recommend you convert everything to Perl's internal Unicode
> using Encode. If you want latin1 literals in your Perl source, put
> use encoding 'latin1';
> at the top; and don't try to mix encodings (ie. have both latin1 and
> utf8 literals) in one source file.
Well, what I've done is used latin-1 literals and saved the file in latin-1
encoding. Then I have used utf8 codes like \x{201D} to represent utf8
characters. I've written "use bytes" at the top of my perl script. Forgive
my ignorance but how would it behave differently with "use encoding latin1"
at the top?
Thanks for all your help,
Erik Sandblom
--
my site is EriksRailNews.com
for those who don't believe, no explanation is possible
for those who do, no explanation is necessary
------------------------------
Date: Thu, 12 Feb 2004 21:31:47 +0000
From: "Alan J. Flavell" <flavell@ph.gla.ac.uk>
Subject: Re: Replacing unicode characters
Message-Id: <Pine.LNX.4.53.0402122111070.12426@ppepc56.ph.gla.ac.uk>
On Thu, 12 Feb 2004, Erik Sandblom wrote:
> Really? The system I'm on is RedHat 8, and I understand they have some
> default variable somewhere saying that "everything is UTF-8 unless otherwise
> specified".
Sort-of. The default locale has utf-8 in it, is the key.
> And Perl then follows that.
Specifically, 5.8.0 follows that. But it confused too many people, as
Ben Morrow has already mentioned.
> Thank you, this solved my problem. I also had to unset "use bytes;" by
> putting in "no bytes;".
Who put "use bytes" in there in the first place? IMHO it's offered as
a quick fix for those who had made a tacit assumption in their coding
(roughly speaking, that character data could be handled identically to
binary data, without giving any thought to the difference. The old
unix-hardened Perl hackers used to be very bad about that, but, with
Perl's increasing claim to be platform-portable, that stance no longer
held water, if I could mix a metaphor).
> Apparently "use bytes" makes Perl treat characters
> as being two-digit rather than four-digit.
there's something in what you say, though I don't think I'd quite have
put it like that...
> Wrong terminology I'm sure,
I can only confirm your assumption! (SCNR ;)
> I had previously set "use bytes" to
> be able to use accented characters in good-old latin-1.
That's the kind of situation where I'd gripe about it being the wrong
solution, even if - in the limited circumstances you needed it - it
gave the impression of doing the right thing.
If you're going to be processing text (as opposed to binary data),
then I think in the long term it will pay off to be honest with Perl
(>= 5.8) about that, and tell it frankly what coding is involved.
By the way, don't confuse the processing of character data on
input/output streams with how Perl deals with characters that are
specified within the source code. They're two different topics, and
need to be grasped accordingly.
The unicode introduction and spec in the Perl documentation is pretty
good, although it's rather silent about a few areas where the
implementation falls short of what the documentation might lead one to
expect (previous discussions here will show some detail about that).
But for the most part I've found it does what it says it does: the key
part is to approach the documentation with a fairly open mind, rather
than assuming that it's sure to be more or less what one had expected.
OK, fair enough, I don't know what it was that *you* expected, but
I've met several people who thought it was obvious and didn't bother
to RTFM, and then were astonished that they could make no sense of
what seemed to be happening.
good luck
--
This is done without sending any information to Microsoft.
------------------------------
Date: Thu, 12 Feb 2004 23:02:57 +0000 (UTC)
From: Ben Morrow <usenet@morrow.me.uk>
Subject: Re: Replacing unicode characters
Message-Id: <c0h0n1$rda$1@wisteria.csv.warwick.ac.uk>
Erik Sandblom <eriks@operamail.com> wrote:
> i artikel c0goke$l8b$2@wisteria.csv.warwick.ac.uk, skrev Ben Morrow på
> usenet@morrow.me.uk den 04-02-12 21.45:
> > Erik Sandblom <eriks@operamail.com> wrote:
> >>
> >> Thank you, this solved my problem. I also had to unset "use bytes;"
> >> by putting in "no bytes;". Apparently "use bytes" makes Perl treat
> >> characters as being two-digit rather than four-digit. Wrong
> >> terminology I'm sure,
> >
> > Yes... are the digits you are refering to hex digits, so you actually
> > mean 8-bit (eg. \x12) rather than 16-bit (eg. \x{1234})?
>
> Yes, that's right. I'm finally getting the hang of hexadecimal and I've
> deduced 16-bit comes from that 2 to the power of four is 16. But what does
> that really mean, considering each "digit", as I still call them, can have
> 16 different numbers, and not 2? That would be 16 to the power of four which
> is a large number, about 66 000 unless I'm mistaken.
:) Not quite. 'Bit's refer to the binary representation (base 2, as
hex is base 16) of a number. A 2-digit hex number, say 0x82, can also
be written as an 8-digit binary number (an 8-bit number: 'bit' is
short for 'binary digit'): 0b1000_0010. The 0x here indicates hex, and
the 0b binary; the _s are just put in to make the number easier to
read.
Hexadecimal has 16 different digits, binary but 2; and as you say, 2^4
= 16, so each hex digit represents 4 binary digits. Thus a four-digit
hex number is a 4*4 = 16-bit binary number: as you say, there are
65536 of them.
You can get Perl to print out the decimal, hex and binary
representations of a number using sprintf with the %d, %x and %b
formats.
> > In this case,
> > you are under a misapprehension: the more recent versions of Unicode
> > are in fact 21-bit character encodings, not 16-bit: that is, \x{12345}
> > is a valid Unicode character number (currently not assigned a
> > character).
>
> Oh my goodness, that's a lot of characters. Why doesn't everyone just learn
> English? ;-)
It is indeed a lot... most of them are unused at present, but they had
just too many with all the Chinese-Japanese-Korean ideograms and all
the Arabic ligatures to fit into 16 bits.
> >> but it may help someone else in my position. I had previously set
> >> "use bytes" to be able to use accented characters in good-old
> >> latin-1.
> >
> > You shouldn't need to do this: if you're mixing character sets, I'd
> > strongly recommend you convert everything to Perl's internal Unicode
> > using Encode. If you want latin1 literals in your Perl source, put
> > use encoding 'latin1';
> > at the top; and don't try to mix encodings (ie. have both latin1 and
> > utf8 literals) in one source file.
>
> Well, what I've done is used latin-1 literals and saved the file in latin-1
> encoding. Then I have used utf8 codes like \x{201D} to represent utf8
> characters. I've written "use bytes" at the top of my perl script. Forgive
> my ignorance but how would it behave differently with "use encoding latin1"
> at the top?
'use bytes' disables Perl's Unicode support, and makes it treat all
strings as sequences of 8-bit bytes. When 'use bytes' is not in
effect, strings can be thought of as sequences of 21-bit numbers (in
fact, the representation is more compact than that, which occasionally
'leaks through' when things go wrong).
Under 'use bytes', you are declaring that your data is 'binary' as
opposed to 'textual'. The fact that if you treat it as textual Perl
will pretend it's Latin1 is for backwards compatibility only: I would
say that 'strictly' speaking Perl ought to give an error if you try
and use characters outside of ASCII (but then, Perl didn't get where
it is today by being strict about things :). In fact, under 'use
bytes', even if you state some data is textual by pushing an :encoding
layer onto the filehandle, Perl will still treat the data as 8-bit
bytes; which is one of the ways the underlying representation can
'leak through' as I mentioned above.
'use encoding 'latin1'' *just* declares that your source file is in
Latin1. It doesn't affect how Perl views your data at all: data that
comes from a filehandle marked with :raw will be considered to be
'binary', ie. a sequence of 8-bit bytes; and data which comes from a
filehandle marked with :encoding will be considered to be 'textual',
i.e. a sequence of 21-bit Unicode codepoints.
This is all a little confusing: you may need to think about it a bit
before it sinks in. I know I did... :)
References: perldoc perluniintro, perldoc perlunicode, unicode.org,
perldoc PerlIO, perldoc PerlIO::encoding.
Ben
--
'Deserve [death]? I daresay he did. Many live that deserve death. And some die
that deserve life. Can you give it to them? Then do not be too eager to deal
out death in judgement. For even the very wise cannot see all ends.'
:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-:-: ben@morrow.me.uk
------------------------------
Date: Thu, 12 Feb 2004 20:37:30 +0000
From: Zachary Zebrowski <zak@SDF.LONESTAR.ORG>
Subject: Re: Test message, please ignore. (Sorry post)
Message-Id: <Pine.NEB.4.58.0402122024240.22486@sdf.lonestar.org>
All,
I'm genuinely sorry for the noise.
I did not know of the test.* news groups because I seldomly use newsgroups
and I was doing an experiment with a + address.
Again, I'm genuinely sorry.
Zak
------------------------------
Date: Thu, 12 Feb 2004 22:44:04 -0000
From: "David K. Wall" <dwall@fastmail.fm>
Subject: Re: Test message, please ignore. (Sorry post)
Message-Id: <Xns948DB46794FF9dkwwashere@216.168.3.30>
Zachary Zebrowski <zak@SDF.LONESTAR.ORG> wrote:
> I'm genuinely sorry for the noise.
>
> I did not know of the test.* news groups because I seldomly use newsgroups
> and I was doing an experiment with a + address.
>
> Again, I'm genuinely sorry.
It's not really a big deal for isolated occurrences. One reason I said
something is because *other* people will see this exchange and learn to use
the test groups for testing.
I hope I didn't offend you. I'm irritated about several things at the
moment, here at work and in one thread on c.l.p.m (I think the offense on
clpm was accidental, but I'm still irritated) and I may have taken it out on
you. Sorry about that.
--
David
------------------------------
Date: Thu, 12 Feb 2004 19:52:28 -0000
From: "David K. Wall" <dwall@fastmail.fm>
Subject: Re: Test message, please ignore.
Message-Id: <Xns948D974F54E9Edkwwashere@216.168.3.30>
[posted and mailed]
zak zebrowski <zak+test@freeshell.org> wrote:
> This is a simple test message, please ignore.
This isn't a test newsgroup. Please use misc.test, alt.test, or one of the
numerous *.test newsgroups for testing. (But check first to make sure it
really IS a test newsgroup)
--
David Wall
------------------------------
Date: 12 Feb 2004 22:36:42 GMT
From: anno4000@lublin.zrz.tu-berlin.de (Anno Siegel)
Subject: Re: tying hashes
Message-Id: <c0gv5q$jh8$1@mamenchi.zrz.TU-Berlin.DE>
Yuri Shtil <yshtil@cisco.com> wrote in comp.lang.perl.misc:
> Hi all
>
> I am considering using tie to control access to the properties of our
> objects. The idea is to intercept store/fetch calls and check the
> validity of the keys/values.
>
> My questions are:
>
> - has anyone used tie for this purpose
You mean, checking the values that go into an object? Yes, that's one of
the standard applications of tie(), sometimes in conjunction with lvalue
accessors or an overloaded "%{}".
> - are there any performance or other drawbacks
Tied variables are slow. Also, to tie a hash you need to define quite
a number of methods, though modules like Tie::Hash help with that.
> - are there better alternatives to tie
Plain standard accessors do the same job more directly.
I see two primary reasons for tie: You may want the fancy interface, or you
may want to send other parts of the program a hash that is more than meets
the eye.
Anno
------------------------------
Date: Thu, 12 Feb 2004 14:43:51 -0800
From: Yuri Shtil <yshtil@cisco.com>
Subject: Re: tying hashes
Message-Id: <402C01A7.10107@cisco.com>
Anno Siegel wrote:
> Yuri Shtil <yshtil@cisco.com> wrote in comp.lang.perl.misc:
>
>>Hi all
>>
>>I am considering using tie to control access to the properties of our
>>objects. The idea is to intercept store/fetch calls and check the
>>validity of the keys/values.
>>
>>My questions are:
>>
>> - has anyone used tie for this purpose
>
>
> You mean, checking the values that go into an object? Yes, that's one of
> the standard applications of tie(), sometimes in conjunction with lvalue
> accessors or an overloaded "%{}".
>
>
>> - are there any performance or other drawbacks
>
>
> Tied variables are slow. Also, to tie a hash you need to define quite
> a number of methods, though modules like Tie::Hash help with that.
>
>
>> - are there better alternatives to tie
>
>
> Plain standard accessors do the same job more directly.
>
> I see two primary reasons for tie: You may want the fancy interface, or you
> may want to send other parts of the program a hash that is more than meets
> the eye.
>
> Anno
Could you elaborate (or point to an elaboration) on "lvalue accessors",
"Plain standard accessors" and overloaded "%{}"?
Yuri.
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc. For subscription or unsubscription requests, send
#the single line:
#
# subscribe perl-users
#or:
# unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.
NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V10 Issue 6125
***************************************