[31500] in Perl-Users-Digest
Perl-Users Digest, Issue: 2759 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Jan 7 21:14:20 2010
Date: Thu, 7 Jan 2010 18:14:11 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Thu, 7 Jan 2010 Volume: 11 Number: 2759
Today's topics:
Re: unicode newbie, can you help? sln@netherlands.com
Re: unicode newbie, can you help? sln@netherlands.com
Re: unicode newbie, can you help? sln@netherlands.com
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Thu, 07 Jan 2010 12:01:52 -0800
From: sln@netherlands.com
Subject: Re: unicode newbie, can you help?
Message-Id: <64dck5t4iubnfl8ntpigkti7jfnurl3taa@4ax.com>
On Thu, 7 Jan 2010 00:56:37 -0800 (PST), "alexxx.magni@gmail.com" <alexxx.magni@gmail.com> wrote:
>On 7 Gen, 09:51, "alexxx.ma...@gmail.com" <alexxx.ma...@gmail.com>
>wrote:
>> On 5 Gen, 19:45, s...@netherlands.com wrote:
>>
>>
>>
>> > On Tue, 5 Jan 2010 06:40:24 -0800 (PST), "alexxx.ma...@gmail.com" <alexxx.ma...@gmail.com> wrote:
>> > >sorry if it's a naive question, but I never needed up to now to deal
>> > >with anything beyond a-zA-Z0-9....
>>
>> > >now I have a long text with standard ascii and, sometimes, cyrillic
>> > >text - and I want to filter it out (the cyrillic, of course).
>> > >I tried with
>>
>> > >perl -wne 'foreach($_){if (/(\p{InCyrillic})/){print"record $_ matches
>> > >>>>$1<<<\n"}else{print"ok\n"}}' test.htm
>>
>> > >but it fails to recognize it - what am I doing wrong?
>>
>> > >thanks for any help...
>>
>> > >alessandro
>>
>
>
>thanks everybody for your help,
>the following did the trick - ok it was written in the html header in
>clear letters that it was utf-8 - but until yesterday I never even
>knew what encoding could mean...
>
>dalessandro
>
> perl -MEncode -wne '$_ = decode_utf8($_); foreach($_){if (/(\p
>{InCyrillic})/){print"record $_ matches ===>>>$1<<<===\n\n\n"}else
>{print"ok\n"}}' test.htm
Many times "Unicode" is written in the header. Whats written in the
header is not necessarily what the encoding actually is.
When you posted the sample analysis of bomtest it detected 'ascii'
because there was *no* utf-8 BOM marker and the utf-8 encoding didn't
come until later in the sample.
Traditionally, decoding is done in the perlio layers if you are
reading in data from file based handles. Unless you are doing something
special, decode strings can be tricky.
Since I use the CheckUTF() and Get_BOM() functions in other code,
to transparently add Unicode layers (utf8, utf-8/16/32LE/BE only) if
needed, I changed the functions slightly to actually add utf8 and
add a sample size option of -1 that will sample the entire file to
guess function.
So, if you take your test.htm and run it through a cyrillic test,
this would be a typical usage of CheckUTF() and Get_BOM().
You have to replace these stubbed out functions from the new
new bomtest1.pl code below.
-sln
----------------
# Your Cyrillic test on test.htm
----------------
use strict;
use warnings;
binmode (STDOUT, ':utf8');
# Open the file in byte mode, let the encoding be guessed
my $fname = 'test.htm';
open my $fh, '<', $fname or die "can't open $fname for read $!";
my $offset = CheckUTF($fh, 'size'=> -1, 'verbose'=> 1, 'getbom_v'=> 1); # verbose
# my $offset = CheckUTF($fh, 'size'=> -1, 'verbose'=> 0, 'getbom'=> 1); # transparent
seek($fh, $offset, 0); # not necessary if just reading the file once
while (<$fh>) {
print '-'x10,"\nRAW record:\n$_\n";
my $clean = '';
while ( / (\p{InCyrillic}+) | ((?:(?!\p{InCyrillic}).)+) /xg ) {
if (defined $1) {
print "Cyrillic: $1\n";
} else {
print "OK: $2\n";
$clean .= $2;
}
}
print "\nCleaned:\n$clean\n";
}
close $fh;
exit(0);
### Subs ..
# Add these subs from the bomtest1.pl code below this ..
sub ordsplit {}
sub Get_BOM {}
sub CheckUTF {}
__END__
# Your test.htm and Cyrillic test output:
UTF Check, guess(332): utf8. Adding layer .. unix crlf utf8
Get_BOM() - Layers = unix crlf utf8
* Reading up to 4 characters
- char 3c , bytes = 1
- char 68 , bytes = 1
- char 74 , bytes = 1
- char 6d , bytes = 1
- Read 4 characters. Position = 4
- Sample (4) = 3c 68 74 6d
* Analysing bytes
BOMS avail = ( fffe0000 ) ( 0000feff ) ( efbbbf ) ( fffe ) ( feff )
- Bom bytes from sample (4) = 3c 68 74 6d
* BOM Not found
Get_BOM returned , , , 0
----------
RAW record:
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /
><title>g - [34m~A[34m~B[34m~@-¦-+-+[34m~F-¦ 0</title></head><body><a hreex.html
" style="color:#000000">-¦-¦-¦[34m~@[34m~E</a><br><br><b>-í[34m~B[34m~@-¦-+-+[34
m~F-¦ 0</b><table><tbody><tr bgcolor="#000000" style="color:#ffffff"><td wid><b>
ID</b></td>
OK: <html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-
8" /><title>g - [34m~A[34m~B[34m~@
Cyrillic: -¦-+-+
OK: [34m~F
Cyrillic: -¦
OK: 0</title></head><body><a hreex.html" style="color:#000000">
Cyrillic: -¦-¦-¦
OK: [34m~@[34m~E</a><br><br><b>
Cyrillic: -í
OK: [34m~B[34m~@
Cyrillic: -¦-+-+
OK: [34m~F
Cyrillic: -¦
OK: 0</b><table><tbody><tr bgcolor="#000000" style="color:#ffffff"><td wid><b>I
D</b></td>
Cleaned:
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /
><title>g - [34m~A[34m~B[34m~@[34m~F 0</title></head><body><a hreex.html" style=
"color:#000000">[34m~@[34m~E</a><br><br><b>[34m~B[34m~@[34m~F 0</b><table><tbody
><tr bgcolor="#000000" style="color:#ffffff"><td wid><b>ID</b></td>
-------
bomtest1.pl
-------
# File: bomtest1.pl
#
# CheckUTF() -
# UTF-8/16/32 check, Will add a guessed utf8 or utf-8/16/32 layer
# if there is no current encoding() layer.
# Opened filehandle should be in byte context.
# This is more for automatically adding a utf layer on
# unknown files opened in byte mode.
# Guesses utf-8/16/32 encodings, BOMb'd or not.
# Uses Encode::Guess module. Will use standard guess settings (see docs).
# Used standalone or optionally with Get_BOM().
#
# Get_BOM() -
# SEEK_SET workaround for BOMb'd Encoding(utf-) layer Bug
# and a general way to get the BOM and offset for subsequent
# seeks.
# The file can be opened using any encoding. The handle passed
# is duped, that handle set to ':raw'. No character semantics
# are used when analysing the BOM.
# This will also detect a mismatch in width between the
# detected BOM and the current opened encoding layer.
# Used standalone or optionally with CheckUTF().
# CheckUTF() has an option to get the bom on the callers
# behalf, combining steps.
#
# test_surrogate_pair() -
# Just a utility function to print hi/lo surrogate pairs
# given a unicode test character (> 0x10000).
#
#
# Modify the top variables to test various configurations.
# Includes:
# $fname File name for output and input
# $write_file Flag 0/1 to create a test file
# (0 to not write, just read file $fname)
# $check_size Size of sample to read for encoding guess in Check_UTF()
# (less than zero will read entire file, default -1)
# $data Data string for writing test file, can be unicode
# $enc Encoding to read/write file $fname
# $passes Number of passes to test BOM seek bug (default 1)
# $test_surr Stand-alone surrogate pair test
#
# -sln, 1/7/10
# --------------------------------------------------------
use strict;
use warnings;
binmode (STDOUT, ':utf8');
{
## PARAMETERS ------
my $fname = 'sample.htm.txt';
# my $fname = 'dummy.txt';
# my $fname = 'c:\temp\XML\CollectedData_1804.xml';
my $Unichar = "\x{21000}";
my $data = "\x{1}\x{2}$Unichar";
my $enc = ''; # blank or UTF-xxLE/BE or :bytes, :utf8, :raw, etc ...
my $enc_string = ($enc =~ /:/) ? "$enc" : (length($enc) ? "encoding($enc)" : '');
my $check_size = -1;
my $write_file = 0;
my $test_surr = 0;
my $passes = 1;
## END PARAMETERS ------
my ($fh, $line);
my ($width,$encfound,$bom,$offset);
# Test surrogate pairs ..
##
test_surrogate_pair($Unichar) if $test_surr;
# Create a test file with encoding ..
##
if ($write_file) {
print "\nWriting sample data as '$enc' to $fname\n",'-'x20,"\n";
open $fh, ">$enc_string", $fname or die "can't open $fname for write $!";
print $fh $data;
close $fh;
} else {
print "\nNOT WRITING to $fname !!\n",'-'x20,"\n";
}
# Read test file with encoding ..
##
print "\nReading $fname as '$enc'\n",'-'x20,"\n";
open $fh, "<$enc_string", $fname or die "can't open $fname for read $!";
$offset = CheckUTF($fh, 'size'=> $check_size, 'verbose'=> 1, 'getbom_v'=> 1);
print "\n";
for (1 .. $passes) {
print "Pass $_: ";
seek($fh, $offset, 0); # SEEK_SET
my $fpos = tell($fh);
my $ok = ($fpos == $offset) ? 'ok' : 'bad';
$line = <$fh>;
print "seek $offset/$fpos $ok, data (".length($line).") = ".ordsplit($line)."\n";
print "$line\n";
}
close $fh;
# Read test file with encoding into scalar, open handle to scalar as :utf8 ..
##
print "\nBuffering $fname as '$enc', open/read buffer as ':utf8'\n",'-'x20,"\n";
open $fh, "<$enc_string", $fname or die "can't open $fname for read $!";
$offset = CheckUTF($fh, 'size'=> $check_size, 'verbose'=> 1, 'getbom_v'=> 1);
seek($fh, $offset, 0); # SEEK_SET
my $membuf = <$fh>;
close $fh;
open $fh, "<:utf8", \$membuf or die "can't open \$membuf for read $!";
$offset = 0;
print "\n";
for (1 .. $passes) {
print "Pass $_: ";
seek($fh, $offset, 0); # SEEK_SET
my $fpos = tell($fh);
my $ok = ($fpos == $offset) ? 'ok' : 'bad';
$line = <$fh>;
print "seek $offset/$fpos $ok, data (".length($line).") = ".ordsplit($line)."\n";
}
close $fh;
# Read test file in byte mode ..
##
print "\nReading $fname as bytes\n",'-'x20,"\n";
open $fh, '<', $fname or die "can't open $fname for read $!";
$offset = 0;
for (1 .. $passes) {
print "Pass $_: ";
seek($fh, $offset, 0); # SEEK_SET
my $fpos = tell($fh);
my $ok = ($fpos == $offset) ? 'ok' : 'bad';
$line = <$fh>;
print "seek $offset/$fpos $ok, data (".length($line).") = ";
for (map {ord $_} split //, $line) {
printf ("%x ",$_);
}
print "\n";
}
close $fh;
}
exit 0;
##
## End of program ...
##
sub ordsplit
{
my $string = shift;
my $buf = '';
for (map {ord $_} split //, $string) {
$buf.= sprintf ("%x ",$_);
}
return $buf;
}
sub Get_BOM
{
my ($fh_orig, $verbose) = @_;
$verbose = 0 unless (defined $verbose and lc($verbose) eq 'v');
use Encode;
my %bom2enc = (
map { encode($_, "\x{feff}") => $_ } qw(
UTF-8
UTF-16BE
UTF-16LE
UTF-32BE
UTF-32LE
)
);
my %enc2bom = (
reverse(%bom2enc),
map { $_ => encode($_, "\x{feff}") } qw(
UCS-2
iso-10646-1
utf8
)
);
my @bombs = sort { length $b <=> length $a } keys %bom2enc;
my $MAX_BOM_LENGTH = length $bombs[0];
my $bomstr = join '|', @bombs;
my $bom_re = qr/^($bomstr)/o;
my @arlays = PerlIO::get_layers($fh_orig);
my $layers = "@arlays";
print "Get_BOM() - Layers = $layers\n" if $verbose;
if (defined($fh_orig) && (ref($fh_orig) eq 'GLOB' || ref(\$fh_orig) eq 'GLOB'))
{
# Dup the passed in handle, binmode it to raw
# We will ONLY do byte semantics !!
# --------------------------------------------------
open my $fh_dup, "<&", $fh_orig or die "can't dup filehandle: $!";
binmode ($fh_dup, ":raw");
if (seek($fh_dup, 0, 0)) # SEEK_SET
{
my $sample = '';
## Read in $MAX_BOM_LENGTH characters.
## As a check, store character/bytes read based
## on the differential file position per read.
## Since our dup handle is in byte mode, the bytes
## per character will always be 1.
## This is a hold over when there could be encoded
## characters and is not necessary.
## -----------------------------------------------
my $fpos = tell($fh_dup);
my @samparray = ();
print " * Reading up to $MAX_BOM_LENGTH characters\n" if $verbose;
for (1 .. $MAX_BOM_LENGTH)
{
my $count = read ($fh_dup, my $buff, 1);
last if (!defined $count or $count != 1);
push @samparray, {'char'=> $buff, 'width'=> (tell($fh_dup)-$fpos)};
$fpos = tell($fh_dup);
}
$sample = '';
for my $href (@samparray) {
$sample .= $href->{'char'};
printf (" - char %x , bytes = %d\n", ord($href->{'char'}), $href->{'width'}) if $verbose;
}
if ($verbose) {
print " - Read ".length($sample)." characters. Position = $fpos\n";
print " - Sample (".length($sample).") = ";
for (map {ord $_} split //, $sample) {
printf ("%02x ", $_);
}
print "\n * Analysing bytes\n";
}
## Convert the 'characters' to bytes.
## Only process $MAX_BOM_LENGTH bytes.
## We only care about the BOM, which will match
## the encoding bytes we already have in %bom2enc.
## -----------------------------------------------
my ($octets, $numb_octets, $vecoffset) = ('',0,0);
for my $href (@samparray)
{
my @ar = ();
my ($charwidth, $ordchar) = (
$href->{'width'},
ord( $href->{'char'})
);
last if (($numb_octets + $charwidth) > $MAX_BOM_LENGTH);
$numb_octets += $charwidth;
while ($ordchar > 0) {
push @ar, $ordchar & 0xff;
$ordchar >>= 8;
--$charwidth;
}
push (@ar,0) while ($charwidth-- > 0);
for (reverse @ar) {
vec ($octets, $vecoffset++, 8) = $_;
}
}
if ($verbose) {
print " BOMS avail = ";
for my $bomavail (@bombs) {
print '( ';
for (map {ord $_} split //, $bomavail) {
printf ("%02x",$_);
}
print ' ) ';
}
print "\n - Bom bytes from sample (".length($octets).") = ";
for (map {ord $_} split //, $octets) {
printf ("%02x ", $_);
}
print "\n";
}
# end convert
my ($bomenc, $bom, $bomwidth, $bomoffset) = ('','','',0);
if (my ($found) = $octets =~ /$bom_re/) {
($bomenc, $bomoffset) = (
$bom2enc{$found},
length($found)
);
for (map {ord $_} split //, $found) {
$bom .= sprintf ("%02x", $_);
}
print " * Found $bomenc, BOM = $bom\n" if ($verbose);
my $noendian = $bomenc;
$noendian =~ s/(?:LE|BE)$//i;
$bomwidth = ($layers =~ /$noendian/i) ? 'ok' : 'mismatch';
print " - Does not match width of current encoding layer\n" if ($verbose and $bomwidth ne 'ok');
} else {
print " * BOM Not found\n" if $verbose;
}
close ($fh_dup);
# Original handle -> Must SEEK_SET/read twice to
# mitigate UTF-16/32 BOM seek bug
# ----------------------------------------------
seek($fh_orig, 0, 0); # SEEK_SET
read ($fh_orig, $sample, 1); # read
seek($fh_orig, 0, 0); # SEEK_SET last. Caller to seek past BOM.
return ($bomenc, $bom, $bomwidth, $bomoffset);
}
close ($fh_dup);
# seek failed, fall through
}
# There was an error ..
if ($verbose) {
print " * Invalid filehandle or file is unseekable\n - Utf BOMs available (max length = $MAX_BOM_LENGTH):\n";
while (my ($key,$val) = each %enc2bom)
{
my $valstring = '';
for (split //, $val) {
$valstring .= sprintf ("%x ",ord $_);
}
print " $key = $bom2enc{$val} = $valstring\n";
#print " $key = $bom2enc{$val} = '$val' = '$enc2bom{$key}' = $valstring\n";
}
}
return ('','','',-1);
}
sub CheckUTF
{
# UTF-8/16/32 check
# Will add a layer if there is no current
# encoding() layer. So open filehandle should
# be in byte context.
# ----------------------------------------
# Result value:
# -1 - not filehandle (or undef'ed)
# 0 - read error or empty file
# 1 - already have encoding layer (could force change later)
# 2 - guess_encoder() error message
# 3 - do nothing unless utf-8/16/32(be/le)
# 4 - added a layer
my ($fh_check,@args) = @_;
my %parm = ('verbose'=>0, 'size'=>60, 'getbom' =>0, 'bomopts'=>'');
while (my ($name, $val) = splice (@args, 0, 2))
{
$name =~ s/^\s+|\s+$//;
$name = lc $name;
next if not defined $val;
if ( $name =~ /^getbom(?:_(\w*))?/ and $val) {
$parm{'gbom'} = 1;
$parm{'bomopts'} = $1 if defined($1);
}
elsif ($name eq 'verbose' || $name eq 'size' && ($val<0 || $val>4)) {
$parm{$name} = $val;
}
}
# 0 1 2 3 4 5 6
my ($Res, $Utfmsg, $layers, $bomenc, $bom, $bomwidth, $bomoffset) = (
0, 'UTF Check', '', '', '', '', 0);
$layers = ':'.join (':', PerlIO::get_layers($fh_check)).':';
if (!defined($fh_check) || (ref($fh_check) ne 'GLOB' && ref(\$fh_check) ne 'GLOB')) {
$Utfmsg .= ", not a filehandle";
print "$Utfmsg\n" if $parm{'verbose'};
$Res = -1;
}
elsif ($layers =~ /:encoding/) {
$Utfmsg .= ", already have encoding layer";
$Res = 1;
}
else {
my ($count, $sample, $buf);
my $utf8layer = $layers =~ /:utf8/;
binmode ($fh_check,":bytes");
seek ($fh_check, 0, 0); # SEEK_SET
# Set 'size' to file size if 'size' passed was less than zero
if ($parm{'size'} < 0) {
seek ($fh_check, 0, 2); # SEEK_END
$parm{'size'} = tell ($fh_check);
seek ($fh_check, 0, 0); # SEEK_SET
}
# Try to read a large sample
if (!defined($count = read ($fh_check,$sample,$parm{'size'},0))) {
$Utfmsg .= ". $!"; # read error
} elsif ($count <= 0) {
$Utfmsg .= ". File is empty";
} else {
seek ($fh_check, 0, 0); # SEEK_SET
use Encode::Guess;
my $decoder = guess_encoding ($sample); # ascii/utf8/BOMed UTF
$Res = 2;
if (ref($decoder)) {
my $name = $decoder->name;
$decoder = 'Do nothing';
$Res = 3;
$Utfmsg .= ", guess($count): $name";
if ($name =~ /UTF.*?(?:8|16|32)/i) {
# $name =~ s/(?:LE|BE)$//i;
$decoder = "Adding layer";
$Res = 4;
if ($name =~ /utf8/i && !$utf8layer) {
$utf8layer = 1;
} else {
binmode ($fh_check, ":encoding($name)");
}
}
}
$Utfmsg .= ". $decoder" if (defined $decoder); # guess error
}
binmode ($fh_check,":utf8") if $utf8layer;
}
$Utfmsg .= ' ..';
$layers = "@{[PerlIO::get_layers($fh_check)]}";
print "@{[$Utfmsg,$layers]}\n" if $parm{'verbose'};
$bomoffset = $Res if $Res == -1;
if ($Res != -1 && $parm{'gbom'}) {
($bomenc, $bom, $bomwidth, $bomoffset) = Get_BOM ($fh_check, $parm{'bomopts'});
$Utfmsg .= " bom($bomoffset) = $bom ..";
print "Get_BOM returned $bomenc, $bom, $bomwidth, $bomoffset\n" if $parm{'verbose'};
}
return ($Res, $Utfmsg, $layers, $bomenc, $bom, $bomwidth, $bomoffset);
}
sub test_surrogate_pair
{
my $test = shift;
print "\nTesting surrogate pairs\n",'-'x20,"\n";
if (ord($test) < 0x10000) {
print "nothing to test < 0x10000\n";
} else {
my $hi = (ord($test) - 0x10000) / 0x400 + 0xD800;
my $lo = (ord($test) - 0x10000) % 0x400 + 0xDC00;
my $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
printf "test uni = %x\n", ord($test);
printf "hi surrogate = %x\n", $hi;
printf "lo surrogate = %x\n", $lo;
printf "uni = %x\n", $uni;
}
}
__END__
------------------------------
Date: Thu, 07 Jan 2010 12:59:27 -0800
From: sln@netherlands.com
Subject: Re: unicode newbie, can you help?
Message-Id: <nfick55s122bhctujcd5h0vgaqpfvffebe@4ax.com>
On Thu, 07 Jan 2010 12:01:52 -0800, sln@netherlands.com wrote:
>On Thu, 7 Jan 2010 00:56:37 -0800 (PST), "alexxx.magni@gmail.com" <alexxx.magni@gmail.com> wrote:
>
>>On 7 Gen, 09:51, "alexxx.ma...@gmail.com" <alexxx.ma...@gmail.com>
>>wrote:
>>> On 5 Gen, 19:45, s...@netherlands.com wrote:
[snip]
>
>sub CheckUTF
>{
> # UTF-8/16/32 check
[snip]
>
> if (ref($decoder)) {
> my $name = $decoder->name;
> $decoder = 'Do nothing';
> $Res = 3;
> $Utfmsg .= ", guess($count): $name";
> if ($name =~ /UTF.*?(?:8|16|32)/i) {
> # $name =~ s/(?:LE|BE)$//i;
> $decoder = "Adding layer";
> $Res = 4;
#start change
if ($name =~ /utf8/i) {
if ($utf8layer) {
$decoder = 'Already have utf8 layer';
$Res = 1;
}
$utf8layer = 1;
} else {
binmode ($fh_check, ":encoding($name)");
}
#end change
> }
> }
> $Utfmsg .= ". $decoder" if (defined $decoder); # guess error
[snip]
>}
------------------------------
Date: Thu, 07 Jan 2010 14:58:58 -0800
From: sln@netherlands.com
Subject: Re: unicode newbie, can you help?
Message-Id: <mtock55ugu4ef5slplun9ohj2n3gntvfrk@4ax.com>
On Thu, 07 Jan 2010 12:59:27 -0800, sln@netherlands.com wrote:
>On Thu, 07 Jan 2010 12:01:52 -0800, sln@netherlands.com wrote:
>
>>On Thu, 7 Jan 2010 00:56:37 -0800 (PST), "alexxx.magni@gmail.com" <alexxx.magni@gmail.com> wrote:
>>
>>>On 7 Gen, 09:51, "alexxx.ma...@gmail.com" <alexxx.ma...@gmail.com>
>>>wrote:
>>>> On 5 Gen, 19:45, s...@netherlands.com wrote:
>
But, decoded utf8 can increase regular expression processing
time by up to several magnitudes depending on the regex.
If the overwhelming majority of the sample is ascii, and you
don't care about the rest, or just want to filter ascii, it might
be better to process as bytes.
For this reason, you can have this function default to 'Do nothing' (default)
when it finds utf8, or force it to add the utf8 layer
if it finds it. Like this:
$offset = CheckUTF($fh, 'size'=> $check_size, 'verbose'=> 1, 'getbom_v'=> 1, 'utf8ok' => 1);
>>sub CheckUTF
>>{
>> # UTF-8/16/32 check
>[snip]
my $utf_types = '(?:16|32)';
while (my ($name, $val) = splice (@args, 0, 2))
{
[snip]
elsif ($name eq 'utf8ok' and $val) {
$utf_types = '(?:8|16|32)';
}
}
[snip]
>>
>> if (ref($decoder)) {
>> my $name = $decoder->name;
>> $decoder = 'Do nothing';
>> $Res = 3;
>> $Utfmsg .= ", guess($count): $name";
if ($name =~ /UTF.*?$utf_types/i) {
>> # $name =~ s/(?:LE|BE)$//i;
>> $decoder = "Adding layer";
>> $Res = 4;
>#start change
> if ($name =~ /utf8/i) {
> if ($utf8layer) {
> $decoder = 'Already have utf8 layer';
> $Res = 1;
> }
> $utf8layer = 1;
> } else {
> binmode ($fh_check, ":encoding($name)");
> }
>#end change
>> }
>> }
>> $Utfmsg .= ". $decoder" if (defined $decoder); # guess error
>[snip]
>>}
>
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 2759
***************************************