[31500] in Perl-Users-Digest


home	help	back	first	fref	pref	prev	next	nref	lref	last	post
Perl-Users Digest, Issue: 2759 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Jan 7 21:14:20 2010

Date: Thu, 7 Jan 2010 18:14:11 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Thu, 7 Jan 2010     Volume: 11 Number: 2759

Today's topics:
    Re: unicode newbie, can you help? sln@netherlands.com
    Re: unicode newbie, can you help? sln@netherlands.com
    Re: unicode newbie, can you help? sln@netherlands.com
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Thu, 07 Jan 2010 12:01:52 -0800
From: sln@netherlands.com
Subject: Re: unicode newbie, can you help?
Message-Id: <64dck5t4iubnfl8ntpigkti7jfnurl3taa@4ax.com>

On Thu, 7 Jan 2010 00:56:37 -0800 (PST), "alexxx.magni@gmail.com" <alexxx.magni@gmail.com> wrote:

>On 7 Gen, 09:51, "alexxx.ma...@gmail.com" <alexxx.ma...@gmail.com>
>wrote:
>> On 5 Gen, 19:45, s...@netherlands.com wrote:
>>
>>
>>
>> > On Tue, 5 Jan 2010 06:40:24 -0800 (PST), "alexxx.ma...@gmail.com" <alexxx.ma...@gmail.com> wrote:
>> > >sorry if it's a naive question, but I never needed up to now to deal
>> > >with anything beyond a-zA-Z0-9....
>>
>> > >now I have a long text with standard ascii and, sometimes, cyrillic
>> > >text - and I want to filter it out (the cyrillic, of course).
>> > >I tried with
>>
>> > >perl -wne 'foreach($_){if (/(\p{InCyrillic})/){print"record $_ matches
>> > >>>>$1<<<\n"}else{print"ok\n"}}' test.htm
>>
>> > >but it fails to recognize it - what am I doing wrong?
>>
>> > >thanks for any help...
>>
>> > >alessandro
>>
>
>
>thanks everybody for your help,
>the following did the trick - ok it was written in the html header in
>clear letters that it was utf-8 - but until yesterday I never even
>knew what encoding could mean...
>
>dalessandro
>
> perl -MEncode -wne '$_ = decode_utf8($_); foreach($_){if (/(\p
>{InCyrillic})/){print"record $_ matches ===>>>$1<<<===\n\n\n"}else
>{print"ok\n"}}' test.htm


Many times "Unicode" is written in the header. Whats written in the
header is not necessarily what the encoding actually is.

When you posted the sample analysis of bomtest it detected 'ascii'
because there was *no* utf-8 BOM marker and the utf-8 encoding didn't
come until later in the sample.

Traditionally, decoding is done in the perlio layers if you are
reading in data from file based handles. Unless you are doing something
special, decode strings can be tricky.

Since I use the CheckUTF() and Get_BOM() functions in other code,
to transparently add Unicode layers (utf8, utf-8/16/32LE/BE only) if
needed, I changed the functions slightly to actually add utf8 and
add a sample size option of -1 that will sample the entire file to
guess function.

So, if you take your test.htm and run it through a cyrillic test,
this would be a typical usage of CheckUTF() and Get_BOM().
You have to replace these stubbed out functions from the new
new bomtest1.pl code below.

-sln

----------------
# Your Cyrillic test on test.htm
----------------
use strict;
use warnings;

binmode (STDOUT, ':utf8');

# Open the file in byte mode, let the encoding be guessed
my $fname = 'test.htm';
open my $fh, '<', $fname or die "can't open $fname for read $!";

my $offset = CheckUTF($fh, 'size'=> -1, 'verbose'=> 1, 'getbom_v'=> 1);   # verbose
# my $offset = CheckUTF($fh, 'size'=> -1, 'verbose'=> 0, 'getbom'=> 1);   # transparent

seek($fh, $offset, 0);  # not necessary if just reading the file once

while (<$fh>) {
	print '-'x10,"\nRAW record:\n$_\n";
	my $clean = '';
	while ( / (\p{InCyrillic}+) | ((?:(?!\p{InCyrillic}).)+) /xg ) {
		if (defined $1) {
			print "Cyrillic: $1\n";
		} else {
			print "OK: $2\n";
			$clean .= $2;
		}
	}
	print "\nCleaned:\n$clean\n";
}
close $fh;
exit(0);

### Subs ..

# Add these subs from the bomtest1.pl code below this ..
sub ordsplit {}
sub Get_BOM {}
sub CheckUTF {}

__END__
  
# Your test.htm and Cyrillic test output:
UTF Check, guess(332): utf8. Adding layer .. unix crlf utf8
Get_BOM() - Layers = unix crlf utf8
 * Reading up to 4 characters
     - char 3c , bytes = 1
     - char 68 , bytes = 1
     - char 74 , bytes = 1
     - char 6d , bytes = 1
     - Read 4 characters. Position = 4
     - Sample (4) = 3c  68  74  6d
 * Analysing bytes
     BOMS avail = ( fffe0000 ) ( 0000feff ) ( efbbbf ) ( fffe ) ( feff )
     - Bom bytes from sample (4) = 3c  68  74  6d
 * BOM Not found
Get_BOM returned , , , 0
----------
RAW record:
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /
><title>g - [34m~A[34m~B[34m~@-Ś-+-+[34m~F-Ś 0</title></head><body><a hreex.html
" style="color:#000000">-Ś-Ś-Ś[34m~@[34m~E</a><br><br><b>-í[34m~B[34m~@-Ś-+-+[34
m~F-Ś 0</b><table><tbody><tr bgcolor="#000000" style="color:#ffffff"><td wid><b>
ID</b></td>

OK: <html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-
8" /><title>g - [34m~A[34m~B[34m~@
Cyrillic: -Ś-+-+
OK: [34m~F
Cyrillic: -Ś
OK:  0</title></head><body><a hreex.html" style="color:#000000">
Cyrillic: -Ś-Ś-Ś
OK: [34m~@[34m~E</a><br><br><b>
Cyrillic: -í
OK: [34m~B[34m~@
Cyrillic: -Ś-+-+
OK: [34m~F
Cyrillic: -Ś
OK:  0</b><table><tbody><tr bgcolor="#000000" style="color:#ffffff"><td wid><b>I
D</b></td>

Cleaned:
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /
><title>g - [34m~A[34m~B[34m~@[34m~F 0</title></head><body><a hreex.html" style=
"color:#000000">[34m~@[34m~E</a><br><br><b>[34m~B[34m~@[34m~F 0</b><table><tbody
><tr bgcolor="#000000" style="color:#ffffff"><td wid><b>ID</b></td>

-------
bomtest1.pl
-------
# File: bomtest1.pl
#
# CheckUTF() - 
#   UTF-8/16/32 check, Will add a guessed utf8 or utf-8/16/32 layer
#   if there is no current encoding() layer.
#   Opened filehandle should be in byte context.
#   This is more for automatically adding a utf layer on
#   unknown files opened in byte mode.
#   Guesses utf-8/16/32 encodings, BOMb'd or not.
#   Uses Encode::Guess module. Will use standard guess settings (see docs).
#   Used standalone or optionally with Get_BOM().
#
# Get_BOM() -
#    SEEK_SET workaround for BOMb'd Encoding(utf-) layer Bug
#    and a general way to get the BOM and offset for subsequent
#    seeks.
#    The file can be opened using any encoding. The handle passed
#    is duped, that handle set to ':raw'. No character semantics
#    are used when analysing the BOM.
#    This will also detect a mismatch in width between the
#    detected BOM and the current opened encoding layer.
#    Used standalone or optionally with CheckUTF().
#    CheckUTF() has an option to get the bom on the callers
#    behalf, combining steps.
#
# test_surrogate_pair() -
#    Just a utility function to print hi/lo surrogate pairs
#    given a unicode test character (> 0x10000).
#
#
# Modify the top variables to test various configurations.
# Includes:
#    $fname            File name for output and input
#    $write_file       Flag 0/1 to create a test file
#                      (0 to not write, just read file $fname)
#    $check_size       Size of sample to read for encoding guess in Check_UTF()
#                      (less than zero will read entire file, default -1)
#    $data             Data string for writing test file, can be unicode
#    $enc              Encoding to read/write file $fname
#    $passes           Number of passes to test BOM seek bug (default 1)
#    $test_surr        Stand-alone surrogate pair test
#
#  -sln, 1/7/10
# --------------------------------------------------------

use strict;
use warnings;

binmode (STDOUT, ':utf8');

{
## PARAMETERS ------
   my $fname = 'sample.htm.txt';
#  my $fname = 'dummy.txt';
#  my $fname = 'c:\temp\XML\CollectedData_1804.xml';

  my $Unichar = "\x{21000}";

  my $data       = "\x{1}\x{2}$Unichar";
  my $enc        = ''; # blank or UTF-xxLE/BE or :bytes, :utf8, :raw, etc ...
  my $enc_string = ($enc =~ /:/) ? "$enc" : (length($enc) ? "encoding($enc)" : '');
  my $check_size = -1;
  my $write_file = 0;
  my $test_surr  = 0;
  my $passes     = 1;
## END PARAMETERS ------

  my ($fh, $line);
  my ($width,$encfound,$bom,$offset);
 
# Test surrogate pairs ..
##
  test_surrogate_pair($Unichar) if $test_surr;

# Create a test file with encoding ..
##
  if ($write_file) {
	print "\nWriting sample data as '$enc' to $fname\n",'-'x20,"\n";
	open $fh, ">$enc_string", $fname or die "can't open $fname for write $!";

	print $fh $data;

	close $fh;
  } else {
	print "\nNOT WRITING to $fname !!\n",'-'x20,"\n";
  }

# Read test file with encoding ..
##
  print "\nReading $fname as '$enc'\n",'-'x20,"\n";
  open $fh, "<$enc_string", $fname or die "can't open $fname for read $!";

  $offset = CheckUTF($fh, 'size'=> $check_size, 'verbose'=> 1, 'getbom_v'=> 1);
  print "\n";

  for (1 .. $passes) {
	print "Pass $_:  ";
	seek($fh, $offset, 0);   # SEEK_SET
	my $fpos = tell($fh);
	my $ok = ($fpos == $offset) ? 'ok' : 'bad';
	$line = <$fh>;
	print "seek $offset/$fpos $ok, data (".length($line).") = ".ordsplit($line)."\n";
	print "$line\n";
  }
  close $fh;

# Read test file with encoding into scalar, open handle to scalar as :utf8 ..
##
  print "\nBuffering $fname as '$enc', open/read buffer as ':utf8'\n",'-'x20,"\n";
  open $fh, "<$enc_string", $fname or die "can't open $fname for read $!";

  $offset = CheckUTF($fh, 'size'=> $check_size, 'verbose'=> 1, 'getbom_v'=> 1);
  seek($fh, $offset, 0);   # SEEK_SET

  my $membuf = <$fh>;
  close $fh;

  open $fh, "<:utf8", \$membuf or die "can't open \$membuf for read $!";

  $offset = 0;
  print "\n";

  for (1 .. $passes) {
	print "Pass $_:  ";
	seek($fh, $offset, 0);   # SEEK_SET
	my $fpos = tell($fh);
	my $ok = ($fpos == $offset) ? 'ok' : 'bad';
	$line = <$fh>;
	print "seek $offset/$fpos $ok, data (".length($line).") = ".ordsplit($line)."\n";
  }
  close $fh;

# Read test file in byte mode ..
##
  print "\nReading $fname as bytes\n",'-'x20,"\n";
  open $fh, '<', $fname or die "can't open $fname for read $!";

  $offset = 0;

  for (1 .. $passes) {
	print "Pass $_:  ";
	seek($fh, $offset, 0);   # SEEK_SET
	my $fpos = tell($fh);
	my $ok = ($fpos == $offset) ? 'ok' : 'bad';
	$line = <$fh>;
	print "seek $offset/$fpos $ok, data (".length($line).") =   ";
	for (map {ord $_} split //, $line) {
		printf ("%x  ",$_);
	}
	print "\n";
  }
  close $fh;

}
exit 0;

##
## End of program  ...
##

sub ordsplit
{ 
	my $string = shift;
	my $buf = '';
	for (map {ord $_} split //, $string) {
		$buf.= sprintf ("%x  ",$_);
	}
	return $buf;
}

sub Get_BOM
{
	my ($fh_orig, $verbose) = @_;
	$verbose = 0 unless (defined $verbose and lc($verbose) eq 'v');
	use Encode;

	my %bom2enc = (
	   map { encode($_, "\x{feff}") => $_ } qw(
	    UTF-8
	    UTF-16BE
	    UTF-16LE
	    UTF-32BE
	    UTF-32LE
	   )
	);
	my %enc2bom = (
	   reverse(%bom2enc),
	   map { $_ => encode($_, "\x{feff}") } qw(
	    UCS-2
	    iso-10646-1
	    utf8
	   )
	);
	my @bombs = sort { length $b <=> length $a } keys %bom2enc;
	my $MAX_BOM_LENGTH = length $bombs[0];
	my $bomstr = join '|', @bombs;
	my $bom_re = qr/^($bomstr)/o;
	my @arlays = PerlIO::get_layers($fh_orig);
	my $layers = "@arlays";

	print "Get_BOM() - Layers = $layers\n" if $verbose;

	if (defined($fh_orig) && (ref($fh_orig) eq 'GLOB' || ref(\$fh_orig) eq 'GLOB'))
	{
		# Dup the passed in handle, binmode it to raw
		# We will ONLY do byte semantics !!
		# --------------------------------------------------
		open my $fh_dup, "<&", $fh_orig or die "can't dup filehandle: $!";
		binmode ($fh_dup, ":raw");

		if (seek($fh_dup, 0, 0))     # SEEK_SET
		{
			my $sample = '';

			## Read in $MAX_BOM_LENGTH characters.
			## As a check, store character/bytes read based
			## on the differential file position per read.
			## Since our dup handle is in byte mode, the bytes
			## per character will always be 1.
			## This is a hold over when there could be encoded
			## characters and is not necessary.
			## -----------------------------------------------
			my $fpos = tell($fh_dup);
			my @samparray = ();
			print " * Reading up to $MAX_BOM_LENGTH characters\n" if $verbose;

			for (1 .. $MAX_BOM_LENGTH)
			{
				my $count = read ($fh_dup, my $buff, 1);
				last if (!defined $count or $count != 1);
				push @samparray, {'char'=> $buff, 'width'=> (tell($fh_dup)-$fpos)};
				$fpos = tell($fh_dup);
			}
			$sample = '';
			for my $href (@samparray) {
				$sample .= $href->{'char'};
				printf ("     - char %x , bytes = %d\n", ord($href->{'char'}), $href->{'width'}) if $verbose;
			}
			if ($verbose) {
				print "     - Read ".length($sample)." characters. Position = $fpos\n";
				print "     - Sample (".length($sample).") = ";
				for (map {ord $_} split //, $sample) {
					printf ("%02x  ", $_);
				}
				print "\n * Analysing bytes\n";
			}
			## Convert the 'characters' to bytes.
			## Only process $MAX_BOM_LENGTH bytes.
			## We only care about the BOM, which will match
			## the encoding bytes we already have in %bom2enc.
			## -----------------------------------------------
			my ($octets, $numb_octets, $vecoffset) = ('',0,0);

			for my $href (@samparray)
			{
				my @ar = ();
				my ($charwidth, $ordchar) = (
					$href->{'width'},
					ord( $href->{'char'})
				);
				last if (($numb_octets + $charwidth) > $MAX_BOM_LENGTH);
				$numb_octets += $charwidth;
				
				while ($ordchar > 0) {
					push @ar, $ordchar & 0xff;
					$ordchar >>= 8;
					--$charwidth;
				}
				push (@ar,0) while ($charwidth-- > 0);
				for (reverse @ar) {
					vec ($octets, $vecoffset++, 8) = $_;
				}
			}
			if ($verbose) {
				print "     BOMS avail = ";
				for my $bomavail (@bombs) {
					print '( ';
					for (map {ord $_} split //, $bomavail) {
						printf ("%02x",$_);
					}
					print ' ) ';
				}
				print "\n     - Bom bytes from sample (".length($octets).") = ";
				for (map {ord $_} split //, $octets) {
					printf ("%02x  ", $_);
				}
				print "\n";
			}
			# end convert

			my ($bomenc, $bom, $bomwidth, $bomoffset) = ('','','',0);
			if (my ($found) = $octets =~ /$bom_re/) {

				($bomenc, $bomoffset) = (
				    $bom2enc{$found},
				    length($found)
				);
				for (map {ord $_} split //, $found) {
					$bom .= sprintf ("%02x", $_);
				}
				print " * Found $bomenc, BOM = $bom\n" if ($verbose);

				my $noendian = $bomenc;
				$noendian =~ s/(?:LE|BE)$//i;
				$bomwidth = ($layers =~ /$noendian/i) ? 'ok' : 'mismatch';
				print "     - Does not match width of current encoding layer\n" if ($verbose and $bomwidth ne 'ok');
			} else {
				print " * BOM Not found\n" if $verbose;
			}

			close ($fh_dup);

			# Original handle -> Must SEEK_SET/read twice to
			# mitigate UTF-16/32 BOM seek bug 
			# ----------------------------------------------
			seek($fh_orig, 0, 0);        # SEEK_SET
			read ($fh_orig, $sample, 1); # read
			seek($fh_orig, 0, 0);        # SEEK_SET last. Caller to seek past BOM.

			return ($bomenc, $bom, $bomwidth, $bomoffset);
		}
		close ($fh_dup);

		# seek failed, fall through
	}
	# There was an error ..
	if ($verbose) {
		print " * Invalid filehandle or file is unseekable\n    - Utf BOMs available (max length = $MAX_BOM_LENGTH):\n";
		while (my ($key,$val) = each %enc2bom)
		{
			my $valstring = '';
			for (split //, $val) {
				$valstring .= sprintf ("%x ",ord $_);
			}
			print  "       $key = $bom2enc{$val} = $valstring\n";
			#print "       $key = $bom2enc{$val} = '$val' = '$enc2bom{$key}' = $valstring\n";
		}
	}
	return ('','','',-1);
}

sub CheckUTF
{
	# UTF-8/16/32 check
	# Will add a layer if there is no current
	# encoding() layer. So open filehandle should
	# be in byte context.
	# ----------------------------------------
	# Result value:
	# -1 - not filehandle (or undef'ed)
	# 0  - read error or empty file
	# 1  - already have encoding layer (could force change later)
	# 2  - guess_encoder() error message
	# 3  - do nothing unless utf-8/16/32(be/le)
	# 4  - added a layer

	my ($fh_check,@args) = @_;
	my %parm = ('verbose'=>0, 'size'=>60, 'getbom' =>0, 'bomopts'=>'');
	while (my ($name, $val) = splice (@args, 0, 2))
	{
		$name =~ s/^\s+|\s+$//;
		$name = lc $name;
		next if not defined $val;

		if ( $name =~ /^getbom(?:_(\w*))?/ and $val) {
			$parm{'gbom'} = 1;
			$parm{'bomopts'} = $1 if defined($1);
		}
		elsif ($name eq 'verbose' || $name eq 'size' && ($val<0 || $val>4)) {
			$parm{$name} = $val;
		}
	}
	#    0       1         2       3       4       5          6
	my ($Res, $Utfmsg, $layers, $bomenc, $bom, $bomwidth, $bomoffset) = (
	       0, 'UTF Check',  '',      '',   '',        '',       0);
	$layers = ':'.join (':', PerlIO::get_layers($fh_check)).':';

	if (!defined($fh_check) || (ref($fh_check) ne 'GLOB' && ref(\$fh_check) ne 'GLOB')) {
		$Utfmsg .= ", not a filehandle";
		print "$Utfmsg\n" if $parm{'verbose'};
		$Res = -1;
	}
	elsif ($layers =~ /:encoding/) {
		$Utfmsg .= ", already have encoding layer";
		$Res = 1;
	}
	else {
		my ($count, $sample, $buf);
		my $utf8layer = $layers =~ /:utf8/;

		binmode ($fh_check,":bytes");
		seek ($fh_check, 0, 0);         # SEEK_SET

		# Set 'size' to file size if 'size' passed was less than zero

		if ($parm{'size'} < 0) {
			seek ($fh_check, 0, 2);         # SEEK_END
			$parm{'size'} = tell ($fh_check);
			seek ($fh_check, 0, 0);         # SEEK_SET
		}

		# Try to read a large sample

		if (!defined($count = read ($fh_check,$sample,$parm{'size'},0))) {
			$Utfmsg .= ". $!"; # read error
		} elsif ($count <= 0) {
			$Utfmsg .= ". File is empty";
		} else {
			seek ($fh_check, 0, 0);   # SEEK_SET
			use Encode::Guess;

			my $decoder = guess_encoding ($sample); # ascii/utf8/BOMed UTF
			$Res = 2;

			if (ref($decoder)) {
				my $name = $decoder->name;
				$decoder = 'Do nothing';
				$Res = 3;
				$Utfmsg .= ", guess($count): $name";
				if ($name =~ /UTF.*?(?:8|16|32)/i) {
					# $name =~ s/(?:LE|BE)$//i;
					$decoder = "Adding layer";
					$Res = 4;
					if ($name =~ /utf8/i && !$utf8layer) {
						$utf8layer = 1;
					} else {
						binmode ($fh_check, ":encoding($name)");
					}
				}
			}
			$Utfmsg .= ". $decoder" if (defined $decoder); # guess error
		}
		binmode ($fh_check,":utf8") if $utf8layer;
	}
	$Utfmsg .= ' ..';
	$layers = "@{[PerlIO::get_layers($fh_check)]}";

	print "@{[$Utfmsg,$layers]}\n" if $parm{'verbose'};

	$bomoffset = $Res if $Res == -1;
	if ($Res != -1 && $parm{'gbom'}) {
		($bomenc, $bom, $bomwidth, $bomoffset) = Get_BOM ($fh_check, $parm{'bomopts'});
		$Utfmsg .= " bom($bomoffset) = $bom ..";
		print "Get_BOM returned $bomenc, $bom, $bomwidth, $bomoffset\n" if $parm{'verbose'};
	}
	return ($Res, $Utfmsg, $layers, $bomenc, $bom, $bomwidth, $bomoffset);
}

sub test_surrogate_pair
{
	my $test = shift;
	print "\nTesting surrogate pairs\n",'-'x20,"\n";
	if (ord($test) < 0x10000) {
		print "nothing to test < 0x10000\n";
	} else {
		my $hi = (ord($test) - 0x10000) / 0x400 + 0xD800;
		my $lo = (ord($test) - 0x10000) % 0x400 + 0xDC00;
		my $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
		printf "test uni      = %x\n", ord($test);
		printf "hi surrogate  = %x\n", $hi;
		printf "lo surrogate  = %x\n", $lo;
		printf "uni           = %x\n", $uni;
	}
}
__END__




------------------------------

Date: Thu, 07 Jan 2010 12:59:27 -0800
From: sln@netherlands.com
Subject: Re: unicode newbie, can you help?
Message-Id: <nfick55s122bhctujcd5h0vgaqpfvffebe@4ax.com>

On Thu, 07 Jan 2010 12:01:52 -0800, sln@netherlands.com wrote:

>On Thu, 7 Jan 2010 00:56:37 -0800 (PST), "alexxx.magni@gmail.com" <alexxx.magni@gmail.com> wrote:
>
>>On 7 Gen, 09:51, "alexxx.ma...@gmail.com" <alexxx.ma...@gmail.com>
>>wrote:
>>> On 5 Gen, 19:45, s...@netherlands.com wrote:

[snip]
>
>sub CheckUTF
>{
>	# UTF-8/16/32 check
[snip]
>
>			if (ref($decoder)) {
>				my $name = $decoder->name;
>				$decoder = 'Do nothing';
>				$Res = 3;
>				$Utfmsg .= ", guess($count): $name";
>				if ($name =~ /UTF.*?(?:8|16|32)/i) {
>					# $name =~ s/(?:LE|BE)$//i;
>					$decoder = "Adding layer";
>					$Res = 4;
#start change
					if ($name =~ /utf8/i) {
						if ($utf8layer) {
							$decoder = 'Already have utf8 layer';
							$Res = 1;
						}
						$utf8layer = 1;
					} else {
						binmode ($fh_check, ":encoding($name)");
					}
#end change
>				}
>			}
>			$Utfmsg .= ". $decoder" if (defined $decoder); # guess error
[snip]
>}




------------------------------

Date: Thu, 07 Jan 2010 14:58:58 -0800
From: sln@netherlands.com
Subject: Re: unicode newbie, can you help?
Message-Id: <mtock55ugu4ef5slplun9ohj2n3gntvfrk@4ax.com>

On Thu, 07 Jan 2010 12:59:27 -0800, sln@netherlands.com wrote:

>On Thu, 07 Jan 2010 12:01:52 -0800, sln@netherlands.com wrote:
>
>>On Thu, 7 Jan 2010 00:56:37 -0800 (PST), "alexxx.magni@gmail.com" <alexxx.magni@gmail.com> wrote:
>>
>>>On 7 Gen, 09:51, "alexxx.ma...@gmail.com" <alexxx.ma...@gmail.com>
>>>wrote:
>>>> On 5 Gen, 19:45, s...@netherlands.com wrote:
>

But, decoded utf8 can increase regular expression processing
time by up to several magnitudes depending on the regex.
If the overwhelming majority of the sample is ascii, and you
don't care about the rest, or just want to filter ascii, it might
be better to process as bytes.

For this reason, you can have this function default to 'Do nothing' (default)
when it finds utf8, or force it to add the utf8 layer
if it finds it. Like this:

$offset = CheckUTF($fh, 'size'=> $check_size, 'verbose'=> 1, 'getbom_v'=> 1, 'utf8ok' => 1);

>>sub CheckUTF
>>{
>>	# UTF-8/16/32 check
>[snip]
	my $utf_types = '(?:16|32)';

	while (my ($name, $val) = splice (@args, 0, 2))
	{
[snip]
		elsif ($name eq 'utf8ok' and $val) {
			$utf_types = '(?:8|16|32)';
		}
	}
[snip]
>>
>>			if (ref($decoder)) {
>>				my $name = $decoder->name;
>>				$decoder = 'Do nothing';
>>				$Res = 3;
>>				$Utfmsg .= ", guess($count): $name";
				if ($name =~ /UTF.*?$utf_types/i) {
>>					# $name =~ s/(?:LE|BE)$//i;
>>					$decoder = "Adding layer";
>>					$Res = 4;
>#start change
>					if ($name =~ /utf8/i) {
>						if ($utf8layer) {
>							$decoder = 'Already have utf8 layer';
>							$Res = 1;
>						}
>						$utf8layer = 1;
>					} else {
>						binmode ($fh_check, ":encoding($name)");
>					}
>#end change
>>				}
>>			}
>>			$Utfmsg .= ". $decoder" if (defined $decoder); # guess error
>[snip]
>>}
>



------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 2759
***************************************

home	help	back	first	fref	pref	prev	next	nref	lref	last	post
[31500] in Perl-Users-Digest

Perl-Users Digest, Issue: 2759 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)Thu Jan 7 21:14:20 2010

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Jan 7 21:14:20 2010