[32795] in Perl-Users-Digest
Perl-Users Digest, Issue: 4059 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sun Oct 20 14:09:42 2013
Date: Sun, 20 Oct 2013 11:09:03 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Sun, 20 Oct 2013 Volume: 11 Number: 4059
Today's topics:
Unicode help please <dave@invalid.invalid>
Re: Unicode help please <bjoern@hoehrmann.de>
Re: Unicode help please <ben@morrow.me.uk>
Re: Unicode help please <dave@invalid.invalid>
Re: Unicode help please <bjoern@hoehrmann.de>
Re: Unicode help please <dave@invalid.invalid>
Re: Unicode help please <ben.usenet@bsb.me.uk>
What's the effect of a null input record separator? here@softcom.net
Re: What's the effect of a null input record separator? <bjoern@hoehrmann.de>
Re: What's the effect of a null input record separator? sybilfriedman@gmail.com
Re: What's the effect of a null input record separator? here@softcom.net
Re: What's the effect of a null input record separator? <jwkrahn@example.com>
Re: What's the effect of a null input record separator? <ben@morrow.me.uk>
Re: What's the effect of a null input record separator? (Tim McDaniel)
Re: What's the effect of a null input record separator? <ben@morrow.me.uk>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Sat, 19 Oct 2013 11:32:00 +0000 (UTC)
From: "Dave Saville" <dave@invalid.invalid>
Subject: Unicode help please
Message-Id: <fV45K0OBJxbE-pn2-BdMN6dcSnc38@paddington.bear.den>
I have a perl script that does sanity checking on a mail store. The
"house keeping" files of the mail store used to be lines with the
fields separated by some hex character. This proved difficult if not
impossible to expand so the lead developer decided to change to XML
files in UTF8. Now one of the checks is between the old and new
versions of a file as if XML files are not found then they are
generated from the old ones - and we have had some "interesting"
problems :-)
One of the files holds file system folder names and one of the checks
is that the name is the same in both files - we had a bug where they
weren't. It has been working fine until a German user came along with
an Umlaut in the folder name. :-)
The base code page of the system is 850. So the true file system name
has the cp850 umlaut as does the old housekeeping file but of course
the XML version has a double byte version of the character.
My problem is getting them to compare equal. Just playing with a test
script and I just can't figure it out.
use strict;
use warnings;
use Unicode::Normalize;
open my $INI, '<', 'folder.ini' or die $!;
my ( $ini_folder_name, $id, $ini_is_archived ) = ( split /\xDE/,
<$INI> )[ 0, 1, 10 ];
print $ini_folder_name, "\n";
open my $XML, '<:raw:utf8', 'folderpr.xml' or die $!;
#open my $XML, '<', 'folderpr.xml' or die $!;
local $/ = undef;
my $xml = <$XML>;
my $XML_folder_name;
if ( $xml =~ m{>([^<]+)</profile>}s )
{
$XML_folder_name = $1;
}
print $XML_folder_name, "\n" if NFD($ini_folder_name) eq
NFD($XML_folder_name);
If I don't open the xml file :utf8 they obviously don't test equal but
neither do they when opened :utf8
In the latter case a hex dump of the output shows that all the utf8
seems to have done is drop the first of the two characters making up
the unicode. The XML file has the correct UTF8 code for the cp850
umlaut.
TIA
--
Regards
Dave Saville
------------------------------
Date: Sat, 19 Oct 2013 14:03:01 +0200
From: Bjoern Hoehrmann <bjoern@hoehrmann.de>
Subject: Re: Unicode help please
Message-Id: <5is469hpcct5t2avapnjjdv2kj9dkr2s0c@hive.bjoern.hoehrmann.de>
* Dave Saville wrote in comp.lang.perl.misc:
>The base code page of the system is 850. So the true file system name
>has the cp850 umlaut as does the old housekeeping file but of course
>the XML version has a double byte version of the character.
>
>My problem is getting them to compare equal. Just playing with a test
>script and I just can't figure it out.
Based on the description above, you have to decode both using the Encode
module, Encode::decode('cp850', ...) and Encode::decode('utf-8', ...),
and then simply use `eq` on the result. Do keep in mind that you might
well be dealing with Windows-1252 instead, if it's a semi-modern Windows
system only the console might be using CP850. Using `Unicode::Normalize`
is incorrect for this purpose, you would be using that e.g. when the OS
or file system modifies file names, but that's Apple's ballpark. Unicode
normalisation helps you if you want to compare U+00F6 ("ö") and the two-
character sequence U+006F ("o") followed by U+0308 (combining diaeresis)
where NFC(...) generates the short and NFD(...) generates the long form.
>In the latter case a hex dump of the output shows that all the utf8
>seems to have done is drop the first of the two characters making up
>the unicode. The XML file has the correct UTF8 code for the cp850
>umlaut.
If the above does not help, you should tell us the hex codes and actual
characters involved.
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
------------------------------
Date: Sat, 19 Oct 2013 14:13:37 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Unicode help please
Message-Id: <15daja-1bk1.ln1@anubis.morrow.me.uk>
Quoth "Dave Saville" <dave@invalid.invalid>:
> I have a perl script that does sanity checking on a mail store. The
> "house keeping" files of the mail store used to be lines with the
> fields separated by some hex character. This proved difficult if not
> impossible to expand so the lead developer decided to change to XML
> files in UTF8. Now one of the checks is between the old and new
> versions of a file as if XML files are not found then they are
> generated from the old ones - and we have had some "interesting"
> problems :-)
>
> One of the files holds file system folder names and one of the checks
> is that the name is the same in both files - we had a bug where they
> weren't. It has been working fine until a German user came along with
> an Umlaut in the folder name. :-)
>
> The base code page of the system is 850. So the true file system name
> has the cp850 umlaut as does the old housekeeping file but of course
> the XML version has a double byte version of the character.
>
> My problem is getting them to compare equal. Just playing with a test
> script and I just can't figure it out.
>
> use strict;
> use warnings;
> use Unicode::Normalize;
> open my $INI, '<', 'folder.ini' or die $!;
If this file really is in cp850 you need to tell perl that:
open my $INI, "<:encoding(cp850)", "folder.ini" or die $!;
However, if, as Bjoern suggests, it's actually in cp1252, then this
isn't the cause of the problem, since all the umlauted characters are in
the same places in cp1252 and ISO8859-1, and perl assumes ISO8859-1 if
you don't tell it otherwise.
What character are you dealing with, and what byte is actually used to
represent it in the file?
> my ( $ini_folder_name, $id, $ini_is_archived ) = ( split /\xDE/,
> <$INI> )[ 0, 1, 10 ];
> print $ini_folder_name, "\n";
> open my $XML, '<:raw:utf8', 'folderpr.xml' or die $!;
You should never use :utf8 for input. It does no validity checking, and
the rest of perl tends to assume Unicode strings will be valid, which
can lead to segfaults if they're not. Always use :encoding(utf8)
instead. (:utf8 is generally safe for output.)
> In the latter case a hex dump of the output shows that all the utf8
> seems to have done is drop the first of the two characters making up
> the unicode. The XML file has the correct UTF8 code for the cp850
> umlaut.
Again, which character are you talking about, and what byte sequence is
used in the file to represent it? If the old file is actually in cp850
then e.g. a-umlaut will be 0x84, whereas in UTF-8 it would be 0xc3 0xa4.
If you've got an a-umlaut represented as 0xc3 0x84 (or the equivalent)
then the new file has been converted to UTF-8 incorrectly.
Ben
------------------------------
Date: Sat, 19 Oct 2013 16:08:07 +0000 (UTC)
From: "Dave Saville" <dave@invalid.invalid>
Subject: Re: Unicode help please
Message-Id: <fV45K0OBJxbE-pn2-BS5Md9WdwZdE@paddington.bear.den>
On Sat, 19 Oct 2013 13:13:37 UTC, Ben Morrow <ben@morrow.me.uk> wrote:
Hi Ben
>
> Quoth "Dave Saville" <dave@invalid.invalid>:
> > I have a perl script that does sanity checking on a mail store. The
> > "house keeping" files of the mail store used to be lines with the
> > fields separated by some hex character. This proved difficult if not
> > impossible to expand so the lead developer decided to change to XML
> > files in UTF8. Now one of the checks is between the old and new
> > versions of a file as if XML files are not found then they are
> > generated from the old ones - and we have had some "interesting"
> > problems :-)
> >
> > One of the files holds file system folder names and one of the checks
> > is that the name is the same in both files - we had a bug where they
> > weren't. It has been working fine until a German user came along with
> > an Umlaut in the folder name. :-)
> >
> > The base code page of the system is 850. So the true file system name
> > has the cp850 umlaut as does the old housekeeping file but of course
> > the XML version has a double byte version of the character.
> >
> > My problem is getting them to compare equal. Just playing with a test
> > script and I just can't figure it out.
> >
> > use strict;
> > use warnings;
> > use Unicode::Normalize;
> > open my $INI, '<', 'folder.ini' or die $!;
>
> If this file really is in cp850 you need to tell perl that:
>
> open my $INI, "<:encoding(cp850)", "folder.ini" or die $!;
>
I had assumed it was cp850 because that is what Western OS/2 systems
default to. But looking at the data I see hex4DFC6C6C which is
ISO8859-1 lower case u umlaut. The xml file has hex4DC2B36C 6C. Which
looks OK to me.
> However, if, as Bjoern suggests, it's actually in cp1252, then this
> isn't the cause of the problem, since all the umlauted characters are in
> the same places in cp1252 and ISO8859-1, and perl assumes ISO8859-1 if
> you don't tell it otherwise.
>
> What character are you dealing with, and what byte is actually used to
> represent it in the file?
>
> > my ( $ini_folder_name, $id, $ini_is_archived ) = ( split /\xDE/,
> > <$INI> )[ 0, 1, 10 ];
> > print $ini_folder_name, "\n";
> > open my $XML, '<:raw:utf8', 'folderpr.xml' or die $!;
>
> You should never use :utf8 for input. It does no validity checking, and
> the rest of perl tends to assume Unicode strings will be valid, which
> can lead to segfaults if they're not. Always use :encoding(utf8)
> instead. (:utf8 is generally safe for output.)
>
Ah, thanks. Copied from the perl cookbook :-)
I really really don't get this stuff. :-(
Internally perl uses utf8 - yes? So if no code is specified it assumes
ISO8859-1 for input and output and converts to utf8 to store. I
presume it does not do any conversion if opened binary.
So the ini file is read assuming ISO8859-1 and converted to utf8.
The xml file is already utf8 so by telling perl that then no
conversion is done.
So if everything is in utf8 why can't I compare them?
--
Regards
Dave Saville
------------------------------
Date: Sat, 19 Oct 2013 18:44:16 +0200
From: Bjoern Hoehrmann <bjoern@hoehrmann.de>
Subject: Re: Unicode help please
Message-Id: <5lc5699aq98kqplit1evti8iprnnh6plcs@hive.bjoern.hoehrmann.de>
* Dave Saville wrote in comp.lang.perl.misc:
>I had assumed it was cp850 because that is what Western OS/2 systems
>default to. But looking at the data I see hex4DFC6C6C which is
>ISO8859-1 lower case u umlaut. The xml file has hex4DC2B36C 6C. Which
>looks OK to me.
If that is 4D C2 B3 6C 6C then
% perl -MEncode -Mcharnames=:full -e
"print charnames::viacode(ord decode('utf-8', qq(\xc2\xb3)))"
SUPERSCRIPT THREE
You probably need something like this:
% perl -MEncode -Mcharnames=:full -e
"print charnames::viacode(ord
decode('Windows-1252',
encode('cp850',
decode('utf-8', qq(\xc2\xb3)))))"
LATIN SMALL LETTER U WITH DIAERESIS
This seems to have multiple character encodings applied to it
incorrectly, and the sequence above might undo that, but more
test samples would be needed to determine that for sure.
>I really really don't get this stuff. :-(
>
>Internally perl uses utf8 - yes? So if no code is specified it assumes
>ISO8859-1 for input and output and converts to utf8 to store. I
>presume it does not do any conversion if opened binary.
>
>So the ini file is read assuming ISO8859-1 and converted to utf8.
>The xml file is already utf8 so by telling perl that then no
>conversion is done.
>
>So if everything is in utf8 why can't I compare them?
Perl internals are more complicated than the above. The Encode module,
and the features built upon it, like the `:encoding(...)` layer, know
how to turn bytes into something more character-ish and this is needed
pretty much always, and your original code did it only on one sequence
of bytes but not the other.
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
------------------------------
Date: Sat, 19 Oct 2013 17:58:58 +0000 (UTC)
From: "Dave Saville" <dave@invalid.invalid>
Subject: Re: Unicode help please
Message-Id: <fV45K0OBJxbE-pn2-DyBfmTCtlB3V@paddington.bear.den>
On Sat, 19 Oct 2013 16:44:16 UTC, Bjoern Hoehrmann
<bjoern@hoehrmann.de> wrote:
> You probably need something like this:
>
> % perl -MEncode -Mcharnames=:full -e
> "print charnames::viacode(ord
> decode('Windows-1252',
> encode('cp850',
> decode('utf-8', qq(\xc2\xb3)))))"
> LATIN SMALL LETTER U WITH DIAERESIS
>
Hmm I get encode(<somecode>, $foo) will take $foo and return it
encoded in somecode. But what does decode('utf-8', $foo) decode
*into*?
--
Regards
Dave Saville
------------------------------
Date: Sat, 19 Oct 2013 19:46:50 +0100
From: Ben Bacarisse <ben.usenet@bsb.me.uk>
Subject: Re: Unicode help please
Message-Id: <0.9301a06e427af2406593.20131019194650BST.8761styu79.fsf@bsb.me.uk>
"Dave Saville" <dave@invalid.invalid> writes:
> On Sat, 19 Oct 2013 16:44:16 UTC, Bjoern Hoehrmann
> <bjoern@hoehrmann.de> wrote:
>
>> You probably need something like this:
>>
>> % perl -MEncode -Mcharnames=:full -e
>> "print charnames::viacode(ord
>> decode('Windows-1252',
>> encode('cp850',
>> decode('utf-8', qq(\xc2\xb3)))))"
>> LATIN SMALL LETTER U WITH DIAERESIS
>>
>
> Hmm I get encode(<somecode>, $foo) will take $foo and return it
> encoded in somecode. But what does decode('utf-8', $foo) decode
> *into*?
It decodes a sequence of octets (\xc2\xb3) into Perl's internal form.
The 'utf-8' tells decode how to interpret the octets. The result is the
Unicode character U+00B3 -- superscript 3.
encode('cp850', ...) takes a string in Perl's internal character format
and produces a sequence of octets. In this case, just one: \xfc -- the
code for superscript 3 in CP-850.
Finally, decode('Windows-1252', ...) takes this octet stream (just the
one: \xfc) and turns it into a Perl string using the Windows-1252 code
table to decide what character each code point refers to. In this case,
\xfc is a lower-case u with dieresis.
What may be more interesting is the reverse of this. Something happened
as the XML was generated that caused the wrong code to be used. Exactly
what is hard to say, but it stems from the fact that CP-850 and
Windows-1252 differ about what \xfc means. I think the simplest
explanation is that the data was always originally in Windows-1252
encoding (u-dieresis being \xfc) but when the data was converted to
UTF-8, it was incorrectly assumed to be CP-850. Thus the \xfc was taken
to be a superscript three, which was, in some sense, correctly rendered
in UTF-8 in the XML file.
--
Ben.
------------------------------
Date: Sat, 19 Oct 2013 08:38:35 -0700 (PDT)
From: here@softcom.net
Subject: What's the effect of a null input record separator?
Message-Id: <caf62283-49f5-46dc-9d36-bdb89cca6864@googlegroups.com>
Given the one-liner:
perl -0ne 'print "$ARGV\n" if [some condition];' *
What effect does the -0 have on each file? I know the values for paragraph mode 00 and file slurp mode 0777, but I can't find a definitive answer on when $/ is set to the null character.
------------------------------
Date: Sat, 19 Oct 2013 17:52:37 +0200
From: Bjoern Hoehrmann <bjoern@hoehrmann.de>
Subject: Re: What's the effect of a null input record separator?
Message-Id: <6ha569tn6v5ihkrl19horhdcgsiujmqncq@hive.bjoern.hoehrmann.de>
* here@softcom.net wrote in comp.lang.perl.misc:
>Given the one-liner:
>
>perl -0ne 'print "$ARGV\n" if [some condition];' *
>
>What effect does the -0 have on each file? I know the values for
>paragraph mode 00 and file slurp mode 0777, but I can't find a
>definitive answer on when $/ is set to the null character.
http://search.cpan.org/perldoc?perlvar "Trying to set the record size
to zero or less will cause reading in the (rest of the) whole file."
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
------------------------------
Date: Sat, 19 Oct 2013 12:21:38 -0700 (PDT)
From: sybilfriedman@gmail.com
Subject: Re: What's the effect of a null input record separator?
Message-Id: <bd3b26a2-bdce-4d14-8532-4abd0cd6c888@googlegroups.com>
On Saturday, October 19, 2013 8:38:35 AM UTC-7, he...@softcom.net wrote:
> Given the one-liner:
>
>
>
> perl -0ne 'print "$ARGV\n" if [some condition];' *
>
>
>
> What effect does the -0 have on each file? I know the values for paragraph mode 00 and file slurp mode 0777, but I can't find a definitive answer on when $/ is set to the null character.
------------------------------
Date: Sat, 19 Oct 2013 13:11:17 -0700 (PDT)
From: here@softcom.net
Subject: Re: What's the effect of a null input record separator?
Message-Id: <e39248df-9eb6-4b7c-8508-182a1acddde9@googlegroups.com>
On Saturday, October 19, 2013 8:38:35 AM UTC-7, he...@softcom.net wrote:
> Given the one-liner:
>
>
>
> perl -0ne 'print "$ARGV\n" if [some condition];' *
>
>
>
> What effect does the -0 have on each file? I know the values for paragraph mode 00 and file slurp mode 0777, but I can't find a definitive answer on when $/ is set to the null character.
########################
So does that mean that -0 on the command line is equivalent to:
undef $/;
while (<>) {
...
}
------------------------------
Date: Sat, 19 Oct 2013 13:17:53 -0700
From: "John W. Krahn" <jwkrahn@example.com>
Subject: Re: What's the effect of a null input record separator?
Message-Id: <RNB8u.173440$Oj5.72689@fx02.iad>
here@softcom.net wrote:
> Given the one-liner:
>
> perl -0ne 'print "$ARGV\n" if [some condition];' *
>
> What effect does the -0 have on each file? I know the values for
> paragraph mode 00 and file slurp mode 0777, but I can't find a
> definitive answer on when $/ is set to the null character.
$ perl -MO=Deparse -0ne 'print "$ARGV\n" if [some condition];' *
BEGIN { $/ = "\000"; $\ = undef; }
LINE: while (defined($_ = <ARGV>)) {
print "$ARGV\n" if ['condition'->some];
}
-e syntax OK
The value "\0" will be used as the record separator so if you have a
"text" file it is the same as slurp mode but if you have a "binary" file
you will get as many records as there are "\0" characters in the file.
John
--
Any intelligent fool can make things bigger and
more complex... It takes a touch of genius -
and a lot of courage to move in the opposite
direction. -- Albert Einstein
------------------------------
Date: Sun, 20 Oct 2013 00:22:52 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: What's the effect of a null input record separator?
Message-Id: <crgbja-oeo1.ln1@anubis.morrow.me.uk>
Quoth derhoermi@gmx.net:
> * here@softcom.net wrote in comp.lang.perl.misc:
> >Given the one-liner:
> >
> >perl -0ne 'print "$ARGV\n" if [some condition];' *
> >
> >What effect does the -0 have on each file? I know the values for
> >paragraph mode 00 and file slurp mode 0777, but I can't find a
> >definitive answer on when $/ is set to the null character.
>
> http://search.cpan.org/perldoc?perlvar "Trying to set the record size
> to zero or less will cause reading in the (rest of the) whole file."
Not relevant. That's talking about $/ = \0, whereas -0 on the
command-line is equivalent to $/ = "\0".
The record separator ('newline' character) is set to ASCII NUL, the
character with value 0. This is the same behaviour as 'xargs -0', and
compatible with input from 'find -print0' and other utilities that
produce null-separated output. It's useful in cases where newlines may
be a valid part of the record.
Ben
------------------------------
Date: Sat, 19 Oct 2013 23:50:34 +0000 (UTC)
From: tmcd@panix.com (Tim McDaniel)
Subject: Re: What's the effect of a null input record separator?
Message-Id: <l3v5s9$c4$1@reader1.panix.com>
In article <crgbja-oeo1.ln1@anubis.morrow.me.uk>,
Ben Morrow <ben@morrow.me.uk> wrote:
>
>Quoth derhoermi@gmx.net:
>> * here@softcom.net wrote in comp.lang.perl.misc:
>> >Given the one-liner:
>> >
>> >perl -0ne 'print "$ARGV\n" if [some condition];' *
>> >
>> >What effect does the -0 have on each file? I know the values for
>> >paragraph mode 00 and file slurp mode 0777, but I can't find a
>> >definitive answer on when $/ is set to the null character.
>>
>> http://search.cpan.org/perldoc?perlvar "Trying to set the record
>> size to zero or less will cause reading in the (rest of the) whole
>> file."
>
>Not relevant. That's talking about $/ = \0,
Do you mean
$/ = 0
or is there something special about a scalar reference to 0?
--
Tim McDaniel, tmcd@panix.com
------------------------------
Date: Sun, 20 Oct 2013 01:17:32 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: What's the effect of a null input record separator?
Message-Id: <s1kbja-p2p1.ln1@anubis.morrow.me.uk>
Quoth tmcd@panix.com:
> In article <crgbja-oeo1.ln1@anubis.morrow.me.uk>,
> Ben Morrow <ben@morrow.me.uk> wrote:
> >
> >Not relevant. That's talking about $/ = \0,
>
> Do you mean
> $/ = 0
> or is there something special about a scalar reference to 0?
There is something special about setting $/ to a scalar reference to an
integer. See perlvar.
Ben
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 4059
***************************************