[33159] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 4438 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue May 26 03:09:18 2015

Date: Tue, 26 May 2015 00:09:03 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Tue, 26 May 2015     Volume: 11 Number: 4438

Today's topics:
        word boundrary and umlaut in Perl regex <fmassion@web.de>
    Re: word boundrary and umlaut in Perl regex <bauhaus@futureapps.invalid>
    Re: word boundrary and umlaut in Perl regex <rweikusat@mobileactivedefense.com>
    Re: word boundrary and umlaut in Perl regex <fmassion@web.de>
    Re: word boundrary and umlaut in Perl regex <fmassion@web.de>
    Re: word boundrary and umlaut in Perl regex <rweikusat@mobileactivedefense.com>
    Re: word boundrary and umlaut in Perl regex <rweikusat@mobileactivedefense.com>
    Re: word boundrary and umlaut in Perl regex <rweikusat@mobileactivedefense.com>
    Re: word boundrary and umlaut in Perl regex <fmassion@web.de>
    Re: word boundrary and umlaut in Perl regex <rweikusat@mobileactivedefense.com>
    Re: word boundrary and umlaut in Perl regex <rweikusat@mobileactivedefense.com>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Sun, 24 May 2015 03:12:55 -0700 (PDT)
From: F Massion <fmassion@web.de>
Subject: word boundrary and umlaut in Perl regex
Message-Id: <d98e1d7e-559a-49b1-ac9f-d0e9c6aa8098@googlegroups.com>

I have 2 hashes.=20

Hash1 has sentences-IDs + sentences=20
Example:
214	Das Modul Automotive Material Flow besteht aus f=FCnf Gruppen
222	Es erfolgt eine Nachentw=E4sserung des Beh=E4lters.

Hash2 has terms-IDs + terms=20
Example:=20
795	Nachentw=E4sserung
796	NF

I want to get a list of all sentences which contain a term.
This works fine even with most words with umlaut but it looks as if the wor=
d boundary delimiter "\b" would consider a string immediately after an umla=
ut as the start of a word. I get therefore a match for the term "NF" in the=
 word "f=FCnf" of sentence #214. This is obviously wrong.

Here is the code:

foreach $key1 (sort keys %hash1)
{
	while (($key2, $value2) =3D each %hash2)
	{
		if ($hash1{$key1} =3D~ m/\b$value2\b/i)
		{
		print  "$hash1{$key1};$value2\n";
		}
	}
}

The files are encoded UTF8.

I have tried all possible combinations of the following instructions, but t=
o no avail.

use utf8;=20
use locale;
binmode STDIN,  ":utf8";
binmode STDOUT, ":utf8";

Any suggestion?



------------------------------

Date: Sun, 24 May 2015 12:50:33 +0200
From: Georg Bauhaus <bauhaus@futureapps.invalid>
Subject: Re: word boundrary and umlaut in Perl regex
Message-Id: <mjsabi$qn2$1@dont-email.me>

On 24.05.15 12:12, F Massion wrote:
> This works fine even with most words with umlaut but it looks as if the word boundary delimiter "\b" would consider a string immediately after an umlaut as the start of a word. I get therefore a match for the term "NF" in the word "fünf" of sentence #214. This is obviously wrong.


Maybe some I/O coding issue at some point? Looking at words with the help
of Devel::Peek, i.e. something like Dump $s, I get Perl values with or without
UTF8 for the following, depending on the presence of "use utf8" alone.
(Also, the UTF-8 PVs get more MAGIC (and possibly some hints) when I add length($s)
anywhere.)

$ perl -w -e 'no utf8; $s = "Fünf"; print STDOUT int($s =~ m/\bnf/i), "\n";'
1
$ perl -w -e 'use utf8; $s = "Fünf"; print STDOUT int($s =~ m/\bnf/i), "\n";'
0

where Dump $s yields
FLAGS = (POK,pPOK,UTF8)
PV = 0x7fa46ac05f40 "F\303\274nf"\0 [UTF8 "F\x{fc}nf"]

But

$ perl -w -e 'use utf8; $s = $ARGV[0]; print STDOUT int($s =~ m/\bnf/i), "\n";' "Fünf"
1

where Dump $s yields
FLAGS = (POK,pPOK)
PV = 0x7ff941405f40 "f\303\274nf"\0

Locale has UTF-8,
$ perl -v
This is perl 5, version 16, subversion 2 (v5.16.2) built for darwin-thread-multi-2level



------------------------------

Date: Sun, 24 May 2015 12:52:03 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: word boundrary and umlaut in Perl regex
Message-Id: <87wpzy42ho.fsf@doppelsaurus.mobileactivedefense.com>

F Massion <fmassion@web.de> writes:
> I have 2 hashes. 
>
> Hash1 has sentences-IDs + sentences 
> Example:
> 214	Das Modul Automotive Material Flow besteht aus fünf Gruppen
> 222	Es erfolgt eine Nachentwässerung des Behälters.
>
> Hash2 has terms-IDs + terms 
> Example: 
> 795	Nachentwässerung
> 796	NF
>
> I want to get a list of all sentences which contain a term.
> This works fine even with most words with umlaut but it looks as if the word boundary delimiter "\b" would consider a string immediately after an umlaut as the start of a word. I get therefore a match for the term "NF" in the word "fünf" of sentence #214. This is obviously wrong.
>
> Here is the code:
>
> foreach $key1 (sort keys %hash1)
> {
> 	while (($key2, $value2) = each %hash2)
> 	{
> 		if ($hash1{$key1} =~ m/\b$value2\b/i)
> 		{
> 		print  "$hash1{$key1};$value2\n";
> 		}
> 	}
> }
>
> The files are encoded UTF8.
>
> I have tried all possible combinations of the following instructions, but to no avail.
>
> use utf8; 
> use locale;
> binmode STDIN,  ":utf8";
> binmode STDOUT, ":utf8";
>
> Any suggestion?

Since you didn't post enough code to demonstrate the problem, it's
somewhat difficult to tell what you got wrong, however, assuming input
files /tmp/d0 (manually re-encoded iso8895-1 -> UTF-8)

----
214	Das Modul Automotive Material Flow besteht aus fünf Gruppen
222	Es erfolgt eine Nachentwässerung des Behälters.
----

and /tmp/d1

----
795	Nachentwässerung
796	NF
456	erfolgt
----

the following code works for me:

----
binmode(STDOUT, ':utf8');

sub f2hash
{
    my $fh;

    open($fh, '<:utf8', $_[0]);
    return { map { chomp; split(' ', $_, 2) } <$fh>  }
}

my $h0 = f2hash('/tmp/d0');
my $h1 = f2hash('/tmp/d1');

for my $v0 (values(%$h0)) {
    $v0 =~ /\b$_\b/i and print("$v0; $_\n") for values(%$h1);
}
----

BTW, I don't know what you may be doing with the keys elsewhere
but a wise man once said "Doing linear traversals over a hash is
like clubbing someone to death with a loaded Uzi" ...


------------------------------

Date: Sun, 24 May 2015 07:04:46 -0700 (PDT)
From: F Massion <fmassion@web.de>
Subject: Re: word boundrary and umlaut in Perl regex
Message-Id: <a9d3c2bb-4f69-43ed-bd52-43655db7d684@googlegroups.com>

Am Sonntag, 24. Mai 2015 13:52:07 UTC+2 schrieb Rainer Weikusat:
> F Massion <fmassion@web.de> writes:
> > I have 2 hashes.=20
> >
> > Hash1 has sentences-IDs + sentences=20
> > Example:
> > 214	Das Modul Automotive Material Flow besteht aus f=FCnf Gruppen
> > 222	Es erfolgt eine Nachentw=E4sserung des Beh=E4lters.
> >
> > Hash2 has terms-IDs + terms=20
> > Example:=20
> > 795	Nachentw=E4sserung
> > 796	NF
> >
> > I want to get a list of all sentences which contain a term.
> > This works fine even with most words with umlaut but it looks as if the=
 word boundary delimiter "\b" would consider a string immediately after an =
umlaut as the start of a word. I get therefore a match for the term "NF" in=
 the word "f=FCnf" of sentence #214. This is obviously wrong.
> >
> > Here is the code:
> >
> > foreach $key1 (sort keys %hash1)
> > {
> > 	while (($key2, $value2) =3D each %hash2)
> > 	{
> > 		if ($hash1{$key1} =3D~ m/\b$value2\b/i)
> > 		{
> > 		print  "$hash1{$key1};$value2\n";
> > 		}
> > 	}
> > }
> >
> > The files are encoded UTF8.
> >
> > I have tried all possible combinations of the following instructions, b=
ut to no avail.
> >
> > use utf8;=20
> > use locale;
> > binmode STDIN,  ":utf8";
> > binmode STDOUT, ":utf8";
> >
> > Any suggestion?
>=20
> Since you didn't post enough code to demonstrate the problem, it's
> somewhat difficult to tell what you got wrong, however, assuming input
> files /tmp/d0 (manually re-encoded iso8895-1 -> UTF-8)
>=20
> ----
> 214	Das Modul Automotive Material Flow besteht aus f=FCnf Gruppen
> 222	Es erfolgt eine Nachentw=E4sserung des Beh=E4lters.
> ----
>=20
> and /tmp/d1
>=20
> ----
> 795	Nachentw=E4sserung
> 796	NF
> 456	erfolgt
> ----
>=20
> the following code works for me:
>=20
> ----
> binmode(STDOUT, ':utf8');
>=20
> sub f2hash
> {
>     my $fh;
>=20
>     open($fh, '<:utf8', $_[0]);
>     return { map { chomp; split(' ', $_, 2) } <$fh>  }
> }
>=20
> my $h0 =3D f2hash('/tmp/d0');
> my $h1 =3D f2hash('/tmp/d1');
>=20
> for my $v0 (values(%$h0)) {
>     $v0 =3D~ /\b$_\b/i and print("$v0; $_\n") for values(%$h1);
> }
> ----
>=20
> BTW, I don't know what you may be doing with the keys elsewhere
> but a wise man once said "Doing linear traversals over a hash is
> like clubbing someone to death with a loaded Uzi" ...

Thank you. I will study your code because I do not understand yet what you =
do but it works.
You've brought up a nice quote. I need the keys though to do other things w=
ith them.


------------------------------

Date: Sun, 24 May 2015 07:59:26 -0700 (PDT)
From: F Massion <fmassion@web.de>
Subject: Re: word boundrary and umlaut in Perl regex
Message-Id: <61dd8aa3-00d5-4f52-a19d-732b6e8bb1d5@googlegroups.com>

Am Sonntag, 24. Mai 2015 16:04:51 UTC+2 schrieb F Massion:
> Am Sonntag, 24. Mai 2015 13:52:07 UTC+2 schrieb Rainer Weikusat:
> > F Massion <fmassion@web.de> writes:
> > > I have 2 hashes.=20
> > >
> > > Hash1 has sentences-IDs + sentences=20
> > > Example:
> > > 214	Das Modul Automotive Material Flow besteht aus f=FCnf Gruppen
> > > 222	Es erfolgt eine Nachentw=E4sserung des Beh=E4lters.
> > >
> > > Hash2 has terms-IDs + terms=20
> > > Example:=20
> > > 795	Nachentw=E4sserung
> > > 796	NF
> > >
> > > I want to get a list of all sentences which contain a term.
> > > This works fine even with most words with umlaut but it looks as if t=
he word boundary delimiter "\b" would consider a string immediately after a=
n umlaut as the start of a word. I get therefore a match for the term "NF" =
in the word "f=FCnf" of sentence #214. This is obviously wrong.
> > >
> > > Here is the code:
> > >
> > > foreach $key1 (sort keys %hash1)
> > > {
> > > 	while (($key2, $value2) =3D each %hash2)
> > > 	{
> > > 		if ($hash1{$key1} =3D~ m/\b$value2\b/i)
> > > 		{
> > > 		print  "$hash1{$key1};$value2\n";
> > > 		}
> > > 	}
> > > }
> > >
> > > The files are encoded UTF8.
> > >
> > > I have tried all possible combinations of the following instructions,=
 but to no avail.
> > >
> > > use utf8;=20
> > > use locale;
> > > binmode STDIN,  ":utf8";
> > > binmode STDOUT, ":utf8";
> > >
> > > Any suggestion?
> >=20
> > Since you didn't post enough code to demonstrate the problem, it's
> > somewhat difficult to tell what you got wrong, however, assuming input
> > files /tmp/d0 (manually re-encoded iso8895-1 -> UTF-8)
> >=20
> > ----
> > 214	Das Modul Automotive Material Flow besteht aus f=FCnf Gruppen
> > 222	Es erfolgt eine Nachentw=E4sserung des Beh=E4lters.
> > ----
> >=20
> > and /tmp/d1
> >=20
> > ----
> > 795	Nachentw=E4sserung
> > 796	NF
> > 456	erfolgt
> > ----
> >=20
> > the following code works for me:
> >=20
> > ----
> > binmode(STDOUT, ':utf8');
> >=20
> > sub f2hash
> > {
> >     my $fh;
> >=20
> >     open($fh, '<:utf8', $_[0]);
> >     return { map { chomp; split(' ', $_, 2) } <$fh>  }
> > }
> >=20
> > my $h0 =3D f2hash('/tmp/d0');
> > my $h1 =3D f2hash('/tmp/d1');
> >=20
> > for my $v0 (values(%$h0)) {
> >     $v0 =3D~ /\b$_\b/i and print("$v0; $_\n") for values(%$h1);
> > }
> > ----
> >=20
> > BTW, I don't know what you may be doing with the keys elsewhere
> > but a wise man once said "Doing linear traversals over a hash is
> > like clubbing someone to death with a loaded Uzi" ...
>=20
> Thank you. I will study your code because I do not understand yet what yo=
u do but it works.
> You've brought up a nice quote. I need the keys though to do other things=
 with them.

A last remark, as Rainer's method did work on my display but not in the out=
put file. I had the Problem that the characters didn't show properly in my =
output file which is HTML with Encoding utf8.

I have now found a workaround  in the match Expression which does the trick=
:

if ($hash1{$key1} =3D~ m/\b$value2\b/i) has been replaced with:
if ($hash1{$key1} =3D~ m/(?<=3D\p{L})\b$value2\b/i)

Apparently, in the word "f=FCnf" the "=FC" has been converted to non-letter=
 characters. Therefore the lookbehind condition prevents a match here.


------------------------------

Date: Sun, 24 May 2015 18:21:07 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: word boundrary and umlaut in Perl regex
Message-Id: <87iobhuc1o.fsf@doppelsaurus.mobileactivedefense.com>

F Massion <fmassion@web.de> writes:
> Am Sonntag, 24. Mai 2015 16:04:51 UTC+2 schrieb F Massion:
>> Am Sonntag, 24. Mai 2015 13:52:07 UTC+2 schrieb Rainer Weikusat:
>> > F Massion <fmassion@web.de> writes:
>> > > I have 2 hashes. 
>> > >
>> > > Hash1 has sentences-IDs + sentences 
>> > > Example:
>> > > 214	Das Modul Automotive Material Flow besteht aus fünf Gruppen
>> > > 222	Es erfolgt eine Nachentwässerung des Behälters.
>> > >
>> > > Hash2 has terms-IDs + terms 
>> > > Example: 
>> > > 795	Nachentwässerung
>> > > 796	NF
>> > >
>> > > I want to get a list of all sentences which contain a term.
>> > > This works fine even with most words with umlaut but it looks as
>> > > if the word boundary delimiter "\b" would consider a string
>> > > immediately after an umlaut as the start of a word. I get
>> > > therefore a match for the term "NF" in the word "fünf" of
>> > > sentence #214. This is obviously wrong.

[...]

>> > input
>> > files /tmp/d0 (manually re-encoded iso8895-1 -> UTF-8)
>> > 
>> > ----
>> > 214	Das Modul Automotive Material Flow besteht aus fünf Gruppen
>> > 222	Es erfolgt eine Nachentwässerung des Behälters.
>> > ----
>> > 
>> > and /tmp/d1
>> > 
>> > ----
>> > 795	Nachentwässerung
>> > 796	NF
>> > 456	erfolgt
>> > ----

[...]

>> Thank you. I will study your code because I do not understand yet what you do but it works.
>> You've brought up a nice quote. I need the keys though to do other things with them.
>
> A last remark, as Rainer's method did work on my display but not in
> the output file. I had the Problem that the characters didn't show
> properly in my output file which is HTML with Encoding utf8.

IOW, the code I posted worked, some other code you didn't post still
doesn't. Using the same inputs as above,

--------
binmode(STDOUT, ':utf8');

sub f2hash
{
    my $fh;

    open($fh, '<:utf8', $_[0]);
    return { map { chomp; split(' ', $_, 2) } <$fh>  }
}

my $h0 = f2hash('/tmp/d0');
my $h1 = f2hash('/tmp/d1');

print <<TT;
<html>
<head>
<title>Es merkelt die Muehle mit garstigem Krach</title>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> 
</head>
<body>
TT
    
for my $v0 (values(%$h0)) {
    $v0 =~ /\b$_\b/i and print("$v0; $_\n") for values(%$h1);
}

print <<TT;
</body>
</html>
TT
-------

generates a UTF-8 encoded HTML file.

> I have now found a workaround  in the match Expression which does the trick:
>
> if ($hash1{$key1} =~ m/\b$value2\b/i) has been replaced with:
> if ($hash1{$key1} =~ m/(?<=\p{L})\b$value2\b/i)
>
> Apparently, in the word "fünf" the "ü" has been converted to
> non-letter characters.

If this is happening to you, you're very likely using Perl in 'default
mode', ie, without locale or other information telling it otherwise,
together with input data containing iso-8859-1 characters beyond the
ASCII range. It then won't consider these to be word characters/
letters. Another option would be that you're using UTF-8 inputs but
without telling Perl so. In this case, the output is especially amusing
as the input octets making up the UTF-8 encoded extended characters will
end up being UTF-8 encoded (if :utf8 is used there), eg, the a diaresis
in the input file, 0xc3 0xa4, ends up as 0xc3 0x83 0xc2 0xa4, which is
0xc3 0xa4 UTF-8-encoded.


------------------------------

Date: Sun, 24 May 2015 21:25:09 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: word boundrary and umlaut in Perl regex
Message-Id: <87mw0tafkq.fsf@doppelsaurus.mobileactivedefense.com>

Georg Bauhaus <bauhaus@futureapps.invalid> writes:
> On 24.05.15 12:12, F Massion wrote:
>> This works fine even with most words with umlaut but it looks as if
>> the word boundary delimiter "\b" would consider a string immediately
>> after an umlaut as the start of a word. I get therefore a match for
>> the term "NF" in the word "fünf" of sentence #214. This is obviously
>> wrong.

> Maybe some I/O coding issue at some point?

Perl support for unicode is somewhat bizarre, presumably because someone
with a UNIX(*) naturally came to the descision of using UTF-8 in strings
which was supposed to be decoded on demand while a bunch of latter day
someones more used to the 'native' 16-bit encodings of WinDOS X later
came the the 'conclusion' that this was surely an error. Because of
this, perl is conceptually not compatible with anything at this level,
not even with itself(!): It has to be told explicitly of the encoding of
all (string) data flowing into it and the encoded of all data going out
of it also has to be selected explicitly. 

> Looking at words with the help
> of Devel::Peek, i.e. something like Dump $s, I get Perl values with or without
> UTF8 for the following, depending on the presence of "use utf8" alone.
> (Also, the UTF-8 PVs get more MAGIC (and possibly some hints) when I add length($s)
> anywhere.)
>
> $ perl -w -e 'no utf8; $s = "Fünf"; print STDOUT int($s =~ m/\bnf/i), "\n";'
> 1
> $ perl -w -e 'use utf8; $s = "Fünf"; print STDOUT int($s =~ m/\bnf/i), "\n";'
> 0

[...]

> where Dump $s yields
> FLAGS = (POK,pPOK,UTF8)
> PV = 0x7fa46ac05f40 "F\303\274nf"\0 [UTF8 "F\x{fc}nf"]
>
> But
>
> $ perl -w -e 'use utf8; $s = $ARGV[0]; print STDOUT int($s =~ m/\bnf/i), "\n";' "Fünf"
> 1
>
> where Dump $s yields
> FLAGS = (POK,pPOK)
> PV = 0x7ff941405f40 "f\303\274nf"\0

The purpose of 'use utf8' is to tell the parser that its input file (or
some section of it) uses utf8 encoding. This affects literal strings
used in the source text but has no effect on commandline arguments.

perl -MEncode -e '$s = decode("UTF-8", $ARGV[0]); print STDOUT int($s =~ m/\bnf/i), "\n";' "Fünf"

works as intended (AFAICT, there is no way to tell perl that "This is a
UTF-8 environment and you're using UTF-8, dammit, so just accept that
data" because the people working on this don't want this to be possible.



------------------------------

Date: Sun, 24 May 2015 21:38:06 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: word boundrary and umlaut in Perl regex
Message-Id: <87iobhaez5.fsf@doppelsaurus.mobileactivedefense.com>

Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:

[...]

> perl -MEncode -e '$s = decode("UTF-8", $ARGV[0]); print STDOUT int($s =~ m/\bnf/i), "\n";' "Fünf"
>
> works as intended (AFAICT, there is no way to tell perl that "This is a
> UTF-8 environment and you're using UTF-8, dammit, so just accept that
> data" because the people working on this don't want this to be possible.

It's actually sort-of possible,

perl -CSAD -e '$s = $ARGV[0];  print STDOUT int($s =~ m/\bnf/i), "\n";' "Fünf"

with

	S	-	STDIN/ -OUT/ -ERR are UTF-8
        A	-	@ARGV UTF-8
        D	-	all other streams default to :utf8

BUT

	Since perl 5.10.1, if the -C option is used on the "#!" line, it
        must be specified on the command line as well, since the
        standard streams are already set up at this point in the
        execution of the perl interpreter.
	[perldoc perlrun]


------------------------------

Date: Mon, 25 May 2015 02:55:23 -0700 (PDT)
From: F Massion <fmassion@web.de>
Subject: Re: word boundrary and umlaut in Perl regex
Message-Id: <70e9e7e5-1930-4cd2-8af2-afcf36a58d7a@googlegroups.com>

> IOW, the code I posted worked, some other code you didn't post still
> doesn't. Using the same inputs as above,
>=20

Sorry for this, but indeed the complete code is much longer and I think not=
 relevant for the issue here.
The file encoding options which have been tested are listed in the initial =
posting.

With regard to the output I have these lines (which I also tested without t=
he charset encoding instructions):

open(AUSGABE,">ergebnis.htm");

print AUSGABE "<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01//EN' 'http://ww=
w.w3.org/TR/html4/strict.dtd'> <meta http-equiv=3D'Content-Type' content=3D=
'text/html; charset=3DUTF-8'>\n" ;
print AUSGABE "<body>";
print AUSGABE "<html>";
print AUSGABE "<div>";
print AUSGABE "\<table border=3D\"2\"\>";


> print <<TT;
> <html>
> <head>
> <title>Es merkelt die Muehle mit garstigem Krach</title>
> <meta http-equiv=3D"Content-Type" content=3D"text/html;charset=3DUTF-8">=
=20
> </head>
> <body>
> TT

> generates a UTF-8 encoded HTML file.
>=20

I would be interested to know what happens to your "Muehle" when you write =
it "M=FChle".

Thanks for your interest and your explanations. Programming is actually not=
 my job and these explanations are very helpful.



------------------------------

Date: Mon, 25 May 2015 15:26:11 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: word boundrary and umlaut in Perl regex
Message-Id: <87mw0sd98c.fsf@doppelsaurus.mobileactivedefense.com>

F Massion <fmassion@web.de> writes:
>> IOW, the code I posted worked, some other code you didn't post still
>> doesn't. Using the same inputs as above,
>> 
>
> Sorry for this, but indeed the complete code is much longer and I
> think not relevant for the issue here.  The file encoding options
> which have been tested are listed in the initial posting.

According to you initial posting, you didn't test any file encoding
options except for STDIN and STDOUT which are likely unused, at least
for the data in question.

> With regard to the output I have these lines (which I also tested
> without the charset encoding instructions):
>
> open(AUSGABE,">ergebnis.htm");

And that's likely were the problem starts: Assuming you've correctly
informed perl that its inputs were UTF-8 encoded, writing the resulting
data to this file will convert it to 'the native, extended character
set' if possible, ie, there are no codepoints > 255. For me, this means
the text ends up as iso8859-1. In order to avoid this, the outputs also
need to be marked as utf8, eg, by using

open(AUSGABE, '>:utf8', 'ergebnis.htm')

[...]

>> print <<TT;
>> <html>
>> <head>
>> <title>Es merkelt die Muehle mit garstigem Krach</title>
>> <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> 
>> </head>
>> <body>
>> TT
>
>> generates a UTF-8 encoded HTML file.
>> 
>
> I would be interested to know what happens to your "Muehle" when you
> write it "Mühle".

The iso8859-1 character from your posting? Or the UTF-8 equivalent? You
could just try that yourself but since I was curious, here are a few
results based on the UTF-8 u-umlaut (I stopped using anything but ASCII
characters on the grounds that the unicode-consortium had successfully
deprecated the notion that anything but that counts as 'letter' years
ago --- in newer standardese, these are all 'extended grapheme
clusters', ie, weird shit the natives insist on drawing we have to
reproduce if we want to do business with them ...), with the results
being exactly as expected in every case:

	- writing the UTF-8 to a stream marked as UTF-8 without telling
          the parser that the source contains UTF-8 characters (ie, no
          'use utf8') results in the UTF-8 represensation itself being
          UTF-8 encoded

	- "doing it right" results in a UTF-8 u diaresis in the output

        - omitting both 'use utf8' and :utf8 for the output streams
          causes the UTF-8 ue to fall through while the umlauts from the
          inputs marked as UTF-8 are converted to iso8859-1

	- use utf8 but no :utf8 => everything iso
        



------------------------------

Date: Mon, 25 May 2015 15:37:27 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: word boundrary and umlaut in Perl regex
Message-Id: <87iobgd8pk.fsf@doppelsaurus.mobileactivedefense.com>

Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:

[...]

> 	- writing the UTF-8 to a stream marked as UTF-8 without telling
>           the parser that the source contains UTF-8 characters (ie, no
>           'use utf8') results in the UTF-8 represensation itself being
>           UTF-8 encoded
>
> 	- "doing it right" results in a UTF-8 u diaresis in the output
>
>         - omitting both 'use utf8' and :utf8 for the output streams
>           causes the UTF-8 ue to fall through while the umlauts from the
>           inputs marked as UTF-8 are converted to iso8859-1
>
> 	- use utf8 but no :utf8 => everything iso
>         

Remark I can't help making here: So much for another attempt to solve
the problem of "so many incompatible variants already" by "doing it
right this time" and introducing another variant.


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 4438
***************************************


home help back first fref pref prev next nref lref last post