[32543] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 3808 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat Nov 3 09:09:26 2012

Date: Sat, 3 Nov 2012 06:09:10 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Sat, 3 Nov 2012     Volume: 11 Number: 3808

Today's topics:
    Re: array <nospam@nspam.invalid>
    Re: array (Seymour J.)
    Re: array <rvtol+usenet@xs4all.nl>
    Re: array <hjp-usenet2@hjp.at>
        Clear the "Wide character in print" warning and leave t jidanni@jidanni.org
    Re: Clear the "Wide character in print" warning and lea <hjp-usenet2@hjp.at>
        Copy array into input read process? <tuxedo@mailinator.com>
    Re: Copy array into input read process? <glex_no-spam@qwest-spam-no.invalid>
    Re: Copy array into input read process? <tuxedo@mailinator.com>
    Re: lerning perl <nospam@nspam.invalid>
    Re: Mime::Lite module generating an error <rweikusat@mssgmbh.com>
        Regex question, limit repeats UNLESS within specified t <jwcarlton@gmail.com>
    Re: Regex question, limit repeats UNLESS within specifi <justin.1211@purestblue.com>
    Re: Regex question, limit repeats UNLESS within specifi <jwcarlton@gmail.com>
    Re: Regex question, limit repeats UNLESS within specifi <*@eli.users.panix.com>
    Re: Regex question, limit repeats UNLESS within specifi <hjp-usenet2@hjp.at>
        Trampoline sub (Tim McDaniel)
    Re: Trampoline sub <rweikusat@mssgmbh.com>
    Re: Trampoline sub <derykus@gmail.com>
    Re: Why "Wide character in print"? <whynot@pozharski.name>
    Re: Why "Wide character in print"? <hjp-usenet2@hjp.at>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Thu, 1 Nov 2012 16:04:42 -0500
From: "Bill Cunningham" <nospam@nspam.invalid>
Subject: Re: array
Message-Id: <k6uo59$76f$1@dont-email.me>

Keith Thompson wrote:
> Perl does use [] for array accesses.
>
> Don't base your attempt to learn Perl on your knowledge of C.
> It's a very different language (with some similar syntax) with a
> very different memory model.

    Higher level too. Much easier to use. Here I don't think arrays have to 
be iterated over like with C. I can't even remember right off how to do 
that. Checking out perl is going to be an experience.

Bill




------------------------------

Date: Fri, 02 Nov 2012 08:27:57 -0400
From: Shmuel (Seymour J.) Metz <spamtrap@library.lspace.org.invalid>
Subject: Re: array
Message-Id: <5093bc4d$8$fuzhry+tra$mr2ice@news.patriot.net>

In <k6uo59$76f$1@dont-email.me>, on 11/01/2012
   at 04:04 PM, "Bill Cunningham" <nospam@nspam.invalid> said:

>Higher level too. Much easier to use. Here I don't think arrays 
>have to be iterated over like with C. I can't even remember right 
>off how to do that.

There's a for statement with C-like syntax, but I don't believe that
I've ever used it; I find foreach more convenient. If you're iterating
over steps rather than over elements of an array, you can still use
foreach, in conjunction with a range operator:

 foreach (1..10)

-- 
Shmuel (Seymour J.) Metz, SysProg and JOAT  <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action.  I reserve the
right to publicly post or ridicule any abusive E-mail.  Reply to
domain Patriot dot net user shmuel+news to contact me.  Do not
reply to spamtrap@library.lspace.org



------------------------------

Date: Sat, 03 Nov 2012 12:23:35 +0100
From: "Dr.Ruud" <rvtol+usenet@xs4all.nl>
Subject: Re: array
Message-Id: <5094feb7$0$6916$e4fe514c@news2.news.xs4all.nl>

On 2012-11-02 13:27, Shmuel (Seymour J.) Metz wrote:
> Bill:

>> Higher level too. Much easier to use. Here I don't think arrays
>> have to be iterated over like with C. I can't even remember right
>> off how to do that.
>
> There's a for statement with C-like syntax, but I don't believe that
> I've ever used it; I find foreach more convenient. If you're iterating
> over steps rather than over elements of an array, you can still use
> foreach, in conjunction with a range operator:
>
>   foreach (1..10)

In Perl, 'for' and 'foreach' are the same.


Be careful with long ranges
like for ( 1 .. 1_000_000_000 ),
since they are not lazy:

time perl -wle '
   my $i;
   for ( 1 .. 100_000_000 ) {
     ++$i;
   }
   print $i;
'
100000000

real	0m10.005s
user	0m9.966s
sys	0m0.016s


time perl -wle '
   my $i;
   foreach ( $i= 0; $i < 100_000_000; $i++ ) {
     ++$i;
   }
   print $i;
'
100000000

real	0m6.321s
user	0m6.291s
sys	0m0.012s

-- 
Ruud



------------------------------

Date: Sat, 3 Nov 2012 12:31:36 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: array
Message-Id: <slrnk9a04o.r8t.hjp-usenet2@hrunkner.hjp.at>

On 2012-11-01 21:04, Bill Cunningham <nospam@nspam.invalid> wrote:
> Keith Thompson wrote:
>> Perl does use [] for array accesses.
>>
>> Don't base your attempt to learn Perl on your knowledge of C.
>> It's a very different language (with some similar syntax) with a
>> very different memory model.
>
>     Higher level too. Much easier to use. Here I don't think arrays have to 
> be iterated over like with C.

If you want to do something with every element of an array you have to
iterate over it. This is completely independent of the language. Perl
just gives you more ways to do it than C, some of them hide the
mechanics of iterating:


In C, given an array a with n members:

    for (int i = 0; i < n; i++) {
	do something with a[i]
    }

In Perl, C-like loop:

    for (my $i = 0; $i <= $#a; $i++) {
	do something with $a[$i];
    }

(note that you don't have to know n here - you can get the last valid
index from the array itself)

In Perl, "foreach" loop (the name is a bit misleading):

    for my $e (@a) {
	do something with $e
    }

(note that this doesn't use an index at all, $e is aliased to all
elements in order).

And since perl 5.14, you can get the index and the element in parallel:

    while (my ($i, $e) = each @a) {
	do something with $i and/or $e
    }

And finally, often you want to transform each element of an array and
assign the resulting list to a new array:

    my @b = map { do something with $_ } @a;


Plus there are a lot more functions (both builtin and in modules) which
operate on whole arrays, so you will be writing less explicit loops in
Perl than in C. 

And another one (I am starting to feel like a standup comedian): In
Perl, strings are first-class citizens, not arrays of bytes, so that
eliminates another reason for explicit loops.

And finally, there are hashes.

And now I'm finishing this article, before I think of something else
which might be implemented by explicitely iterating over an array in C
but which is done differently in Perl.

	hp

-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel


------------------------------

Date: Fri, 2 Nov 2012 05:23:24 +0000 (UTC)
From: jidanni@jidanni.org
Subject: Clear the "Wide character in print" warning and leave the output unmangled
Message-Id: <k6vlcb$r6i$1@news.datemas.de>

None of the advice on perlunifaq or elsewhere can both
* Clear the "Wide character in print" warning, and
* Leave the output non doubly encoded.

#!/usr/bin/perl

# How to test this program:
# $ export restriction=TW; PERLLIB=$HOME/perl5/lib/perl5 ./ytpl jidanni2  > /tmp/o
# $ cat /tmp/o
# That will show you any problems it has.

# Print out YouTube playlists. Usage:
# Example: $0 YouTubeUserID
# Example: restriction=TW $0 jidanni2
# Copyright       : http://www.fsf.org/copyleft/gpl.html
# Author          : Dan Jacobson -- http://jidanni.org/
# Created On      : Wed Mar  2 08:35:33 2011
# Last Modified On: Fri Nov  2 13:23:16 2012
# Update Count    : 830
use strict;

#use Encode;
#use warnings FATAL => 'all';
#binmode STDIN, ":utf8";
#binmode STDOUT, ':encoding(UTF-8)';
#binmode STDIN, ':encoding(UTF-8)';
#binmode(STDOUT);
#binmode STDERR, ":utf8";binmode STDOUT, ":utf8";binmode STDIN,  ":utf8";
#use utf8;
#use open qw/:std :encoding(utf8)/;
##use diagnostics;
#use Data::Dumper;
use WebService::GData::Constants qw(:all);
use WebService::GData::YouTube;
die 'Specify a user please.' unless my $user = shift;
my ( %checklist, %vids, $playlists, );
my $yt = new WebService::GData::YouTube();
$yt->connection->env_proxy;
##$yt->connection->enable_compression(TRUE); #disaster
$yt->query->max_results(50);

#if the number of your playlist is superior to 50, you will need to
#loop via the result like you used to do before with the video results
#(start_index+items_per_page). there is no other easy way to do this
#yet.

eval { $playlists = $yt->get_user_playlists($user) } or die $@->content;
@$playlists = sort { $a->title cmp $b->title } @$playlists;
for ( $ENV{restriction} ) { $yt->query()->restriction($_) if $_ }

for my $playlist (@$playlists) {
    my @missing = (undef) x $playlist->count_hint;
    my $entries;
    while (
        eval {
## can't use compression starting here:
            $entries = $yt->get_user_playlist_by_id( $playlist->playlist_id );
        }
      )
    {
        die $@->content if $@;
        for my $entry (@$entries) {
            my $IDP = ( split( /:/, $entry->id ) )[-1] or die;
            if (   $entry->appcontrol_state
                && $entry->appcontrol_state eq "requesterRegion" )
            {
                #		print "yy$IDP ", $entry->id, "\n";
                next;
            }

## http://code.google.com/intl/en/apis/youtube/2.0/reference.html#youtube_data_api_tag_yt:state
## Also one day could use
## my $string = $entry->denied_countries;
## my @matches = $string=~m/(TW|US)/g;

            delete $missing[ $entry->position - 1 ];
            my $v = sprintf "%03d|%s|%s|%s", $entry->position, $entry->video_id,
              $IDP,
              $entry->title;

            if ( $entry->media_player ) {

                #		use Data::Dumper;
                push @{ $vids{1}{ $playlist->playlist_id } }, $v;

        #		print STDERR Dumper("ç´…",$playlist->title), "ç´…", $playlist->title;
        #		die;
                unless ( $playlist->title eq '英文歌詞 English lyrics' ) {
                    push @{ $checklist{ $entry->video_id } }, join "|",
                      $playlist->title,
                      $v;
                }
            }
            else {
                push @{ $vids{0}{ $playlist->playlist_id } },
                  "# $v|" . $entry->appcontrol_state;

                #		print STDERR "xx$IDP\n";

            }
        }
    }
    for ( 0 .. $playlist->count_hint - 1 ) {
        if ( exists $missing[$_] ) {
            push @{ $vids{0}{ $playlist->playlist_id } },
              sprintf "# %03d|Problem!", $_ + 1;
## try watching it in a browser when logged out to find out what was wrong
        }
    }
}
{
    my ( $total, $list ) = ( 0, 'Duplicates' );
    printf "\n%d playlists, %d videos.\n:::: $list:\n", scalar @$playlists,
      scalar keys %checklist;
    for ( keys %checklist ) {
        if ( $#{ $checklist{$_} } ) {
            for ( @{ $checklist{$_} } ) { print "$_\n"; $total++ }
        }
    }
    print "Total $list: $total\n";
}

for my $playlist (@$playlists) {
    push @{ $vids{0}{ $playlist->playlist_id } }, "Empty playlist!"
      unless $vids{1}{ $playlist->playlist_id };
}

{
    my @list = qw/Unavailable Available/;
    for ( 0, 1 ) {
        print "\n:::: $list[$_]:\n";
        my $total = 0;
        for my $playlist (@$playlists) {
            next unless $vids{$_}{ $playlist->playlist_id };
##          print '==== http://www.youtube.com/my_playlists?p=',
##	    print '==== http://www.youtube.com/playlist?action_edit=1&list=PL',
##	    print '==== http://www.youtube.com/playlist?list=PL',
            print '==== http://www.youtube.com/playlist?list=',
              $playlist->playlist_id, ' |', $playlist->title, "\n";
            for ( sort @{ $vids{$_}{ $playlist->playlist_id } } ) {

                #                print decode_utf8( $_ ), "\n";
                print $_, "\n";
                $total++;
            }
        }
        print "Total $list[$_]: $total\n";
    }
}


------------------------------

Date: Sat, 3 Nov 2012 13:08:22 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Clear the "Wide character in print" warning and leave the output unmangled
Message-Id: <slrnk9a29o.r8t.hjp-usenet2@hrunkner.hjp.at>

On 2012-11-02 05:23, jidanni@jidanni.org <jidanni@jidanni.org> wrote:
> None of the advice on perlunifaq or elsewhere can both
> * Clear the "Wide character in print" warning, and
> * Leave the output non doubly encoded.
>
> #!/usr/bin/perl
>
> # How to test this program:
> # $ export restriction=TW; PERLLIB=$HOME/perl5/lib/perl5 ./ytpl jidanni2  > /tmp/o
> # $ cat /tmp/o
> # That will show you any problems it has.

Thanks for providing a complete script which demonstrates the problem.
This makes finding the problem simpler. However:

[...]
> use WebService::GData::Constants qw(:all);
> use WebService::GData::YouTube;
> die 'Specify a user please.' unless my $user = shift;

I'm not going to create a youtube account just to test this script. 
So I cannot test it.

Unfortunately, you didn't report where the "Wide character in print"
warning occurs, either, and it is not obvious to me from the source
code. I am guessing that it happens in the last loop, because you tried
to use decode_utf8 there.

So I'm just giving generic advice here:

1) Always use “binmode(..., ":encoding(...)");” explicitely on STDIN,
   STDOUT and STDERR. The encoding must be the one your terminal uses,
   so if your terminal supports UTF-8, use that. (for production, you
   might want to use “use open ":locale"”, but for debugging it's best
   to eliminate any source of variable behaviour and hardcode the
   encoding).

2) Try to shorten your program further, to make it easier to see where
   the problem is without actually running the program.

3) When processing character data, convert from (external) byte
   encodings to (internal) character strings as early as possible.

   My guess is that you get some byte encoded data from the
   WebService::GData module. You should decode() this, and you should do
   this as early as possible so that the rest of your code doesn't have
   to care about the encoding. This is especially necessary if you
   combine strings from several sources which might use different
   encodings.

4) When searching for encoding problems, I like to use this simple
   function to dump strings to stdout:

    sub dumpstr {
	my ($s) = @_;

	print utf8::is_utf8($s) ? "char" : "byte";
	print ":";
	for (split //, $s) {
	    printf " %#02x", ord($_);
	}
	print "\n";
    }

   use it to dump the string that is giving you the warning or that is
   double-encoded. That will usually tell you *what* is wrong with the
   string, but not *why* it is wrong. Then go backwards through the code
   to see where you get the string from. If the string is computed from
   some other string(s) (e.g. concatenation, substring, etc), dump the
   inputs in the same way. Eventually you will have identified the
   source of the "wrong" string, and then you can probably fix it with
   a simple call to decode() right at the source. (If you get the string
   from a module, you might also want to file a bug report).

	hp

-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaĂźt. -- Ralph Babel


------------------------------

Date: Thu, 1 Nov 2012 21:43:53 +0100
From: Tuxedo <tuxedo@mailinator.com>
Subject: Copy array into input read process?
Message-Id: <k6umu9$hd2$1@news.albasani.net>

The below procedure, using lynx, simply dumps the source of a URL when 
calling it with a single argument and prints each line:

open(my $fh, '-|', 
'/usr/bin/lynx','-source','-nonumbers','-cache=0','-nolist', $ARGV[0]) or 
die $!;
while (my $line = <$fh>) {
print $line;
}

Placed in a lynx.pl file, it can be run as
 ./lynx.pl example.com

Following the '-|' bit some fixed options are passed:
'/usr/bin/lynx','-source','-nonumbers','-cache=0','-nolist'
 ... and thereafter the URL.

How can this be better done in allowing arguments to be passed like:
 ./lynx.pl example.com -source -nonumbers -cache=0 -nolist

 ... then capturing the arguments somehow like follows:
open(my $fh, '-|', '/usr/bin/lynx', @flags $ARGV[0]) or die $!;

In the end it's not much different from running lynx directly but I would 
like some fixed and some optional arguments passed, and I'm not sure how to 
construct and push additional arguments into a list, then copy the final 
array into the file handle reading process between the -| bit and the 
$ARGV[0] to allow for a flexible number of command line options.

Many thanks for any ideas.

Tuxedo


------------------------------

Date: Thu, 01 Nov 2012 16:41:00 -0500
From: "J. Gleixner" <glex_no-spam@qwest-spam-no.invalid>
Subject: Re: Copy array into input read process?
Message-Id: <5092ec6c$0$63198$815e3792@news.qwest.net>

On 11/01/12 15:43, Tuxedo wrote:
> The below procedure, using lynx, simply dumps the source of a URL when
> calling it with a single argument and prints each line:
>
> open(my $fh, '-|',
> '/usr/bin/lynx','-source','-nonumbers','-cache=0','-nolist', $ARGV[0]) or
> die $!;
> while (my $line =<$fh>) {
> print $line;
> }
>
> Placed in a lynx.pl file, it can be run as
> ./lynx.pl example.com
>
> Following the '-|' bit some fixed options are passed:
> '/usr/bin/lynx','-source','-nonumbers','-cache=0','-nolist'
> ... and thereafter the URL.
>
> How can this be better done in allowing arguments to be passed like:
> ./lynx.pl example.com -source -nonumbers -cache=0 -nolist
>
> ... then capturing the arguments somehow like follows:
> open(my $fh, '-|', '/usr/bin/lynx', @flags $ARGV[0]) or die $!;
>
> In the end it's not much different from running lynx directly but I would
> like some fixed and some optional arguments passed, and I'm not sure how to
> construct and push additional arguments into a list, then copy the final
> array into the file handle reading process between the -| bit and the
> $ARGV[0] to allow for a flexible number of command line options.
>
> Many thanks for any ideas.
>
> Tuxedo


my $cmd = q{ /usr/bin/lynx -source -nonumbers -cache=0 -nolist };

print `$cmd @ARGV`;

To handle command line options, give the Getopt::Long module a try, or 
one of the other Getopt modules.  You could set defaults and over-ride 
them or set new ones via the command line.


------------------------------

Date: Fri, 2 Nov 2012 07:58:08 +0100
From: Tuxedo <tuxedo@mailinator.com>
Subject: Re: Copy array into input read process?
Message-Id: <k6vqu1$ujk$1@news.albasani.net>

J. Gleixner wrote:

> On 11/01/12 15:43, Tuxedo wrote:
[...]

> 
> my $cmd = q{ /usr/bin/lynx -source -nonumbers -cache=0 -nolist };
> 
> print `$cmd @ARGV`;
> 
> To handle command line options, give the Getopt::Long module a try, or
> one of the other Getopt modules.  You could set defaults and over-ride
> them or set new ones via the command line.

Thanks for these tips. I will try Getopt::Long.

Tuxedo 


------------------------------

Date: Thu, 1 Nov 2012 15:35:29 -0500
From: "Bill Cunningham" <nospam@nspam.invalid>
Subject: Re: lerning perl
Message-Id: <k6umeg$rb8$1@dont-email.me>

Jürgen Exner wrote:

> In other words: you are and have been putting the cart before the
> horse. Maybe you should look into learning programming first. Once you
> understand that then switching to a different programming language is
> usually the easy part (yes, there are exceptions).

    I can see right now that perl is going to be easier and higher level 
than C. Now if I can do it. It C to check arrays you have to iterate through 
the array elements by writing a program. It seems to be a lot easier in 
perl. And I've got about as far as perlintro and I have to re-read it to se 
what => is and "<" and ">" in the open function.

Bill




------------------------------

Date: Thu, 01 Nov 2012 16:38:14 +0000
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Mime::Lite module generating an error
Message-Id: <87mwz1wa6h.fsf@sapphire.mobileactivedefense.com>

"dn.perl@gmail.com" <dn.perl@gmail.com> writes:
> I can send email from my linux server with 'mailx' command. I could
> also send an email from it using Mime::Lite module until recently.
> Today the same old working module has started failing, and it gives an
> error: Illegal Seek.
> What could be happening?

[...]

> my $rv = $msg->send() ;
> print "rv = $rv, $! \n\nAn email, with subject: ($subject), has been
> sent to $my_email\n\n" ;

Most likely, some code in the C libary called lseek on the underlying
file descriptor and this call failed because the file descriptor
referred to the write end of a pipe connecting your code to a sendmail
process and pipes are not seekable. $! is the Perl name for errno and
the current value of errno isn't generally meaningful except when some
system or C library call returned an error indication to the
caller. In this given case, unless ->send() returned 'false', no error
has occurred.




------------------------------

Date: Thu, 1 Nov 2012 17:31:57 -0700 (PDT)
From: Jason C <jwcarlton@gmail.com>
Subject: Regex question, limit repeats UNLESS within specified tags
Message-Id: <b4b83607-061d-41c6-a323-73eaa3f8155c@googlegroups.com>

I'm currently limiting repeated characters like so:

$text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;

I'm wanting to modify it to only limit repeated characters if they're not within <img...> or <a href=...></a> tags.

I'm guessing that this would be done with negative lookahead, like this:

# Note, these aren't tested, just here for the explanation
$text =~ s#(?<!<img)(.)\1{6,}#$1$1$1$1$1$1#gsi;
$text =~ s#(?<!<a href)(.)\1{6,}#$1$1$1$1$1$1#gsi;

Neither of these are going to be perfect, though, because:

1. in the first one, I need to test for both an opening <img and an ending >; otherwise, I think it would not catch something like "<img src='aaa.jpg'> bbbbbbbbbb" (since the repeated "b" comes after "<img").

2. in the second one, I also need to test for the ending >, but also for the closing </a>. Even if I fixed the ending >, I could still end up with a confusing "<a href='http://www.aaaaaaaaaa.com'>http://www.aaaaaa.com</a>"


Any suggestions on how to do either of these better? TIA,

Jason


------------------------------

Date: Fri, 2 Nov 2012 09:40:32 +0000
From: Justin C <justin.1211@purestblue.com>
Subject: Re: Regex question, limit repeats UNLESS within specified tags
Message-Id: <g1hcm9-35d.ln1@zem.masonsmusic.co.uk>

On 2012-11-02, Jason C <jwcarlton@gmail.com> wrote:
> I'm currently limiting repeated characters like so:
>
> $text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
>
> I'm wanting to modify it to only limit repeated characters if they're not within <img...> or <a href=...></a> tags.
>
> I'm guessing that this would be done with negative lookahead, like this:
>
> # Note, these aren't tested, just here for the explanation
> $text =~ s#(?<!<img)(.)\1{6,}#$1$1$1$1$1$1#gsi;
> $text =~ s#(?<!<a href)(.)\1{6,}#$1$1$1$1$1$1#gsi;


Found in /usr/share/perl/5.10/pod/perlfaq6.pod
   How do I match XML, HTML, or other nasty, ugly things with a regex?
       (contributed by brian d foy)

       If you just want to get work done, use a module and forget about the
       regular expressions. The "XML::Parser" and "HTML::Parser" modules are
       good starts, although each namespace has other parsing modules
       specialized for certain tasks and different ways of doing it. Start at
       CPAN Search ( http://search.cpan.org ) and wonder at all the work
       people have done for you already! :)

Use the modules and use your regex on what's left, don't don't try to
write REs for HTML, life is too short.


   Justin.

-- 
Justin C, by the sea.


------------------------------

Date: Fri, 2 Nov 2012 13:37:10 -0700 (PDT)
From: Jason C <jwcarlton@gmail.com>
Subject: Re: Regex question, limit repeats UNLESS within specified tags
Message-Id: <1babfea9-c4bd-46e4-9564-77a08024d184@googlegroups.com>

On Friday, November 2, 2012 6:08:03 AM UTC-4, Justin C wrote:
> On 2012-11-02, Jason C <jwcarlton@gmail.com> wrote:
> 
> > I'm currently limiting repeated characters like so:
> 
> >
> 
> > $text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
> 
> >
> 
> > I'm wanting to modify it to only limit repeated characters if they're not within <img...> or <a href=...></a> tags.
> 
> >
> 
> > I'm guessing that this would be done with negative lookahead, like this:
> 
> >
> 
> > # Note, these aren't tested, just here for the explanation
> 
> > $text =~ s#(?<!<img)(.)\1{6,}#$1$1$1$1$1$1#gsi;
> 
> > $text =~ s#(?<!<a href)(.)\1{6,}#$1$1$1$1$1$1#gsi;
> 
> 
> 
> 
> 
> Found in /usr/share/perl/5.10/pod/perlfaq6.pod
> 
>    How do I match XML, HTML, or other nasty, ugly things with a regex?
> 
>        (contributed by brian d foy)
> 
> 
> 
>        If you just want to get work done, use a module and forget about the
> 
>        regular expressions. The "XML::Parser" and "HTML::Parser" modules are
> 
>        good starts, although each namespace has other parsing modules
> 
>        specialized for certain tasks and different ways of doing it. Start at
> 
>        CPAN Search ( http://search.cpan.org ) and wonder at all the work
> 
>        people have done for you already! :)
> 
> 
> 
> Use the modules and use your regex on what's left, don't don't try to
> 
> write REs for HTML, life is too short.
> 
> 
> 
> 
> 
>    Justin.
> 
> 
> 
> -- 
> 
> Justin C, by the sea.

I've used HTML::Parser at length, but I don't think that it offers anything like what I'm needing. I looked through CPAN, and didn't find anything like this.

I might have made the OP seem too complicated. What I really need to figure out is how to run a regex where both the look-behind AND look-ahead match.

Something like this, I guess:

# Not tested
while (($text !~ /<img[^>]*?>/gi) &&
       ($text !~ /<a href[^>]*?>/gi)) {
  $text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi; 
}

Or maybe two separate loops, like this:

while ($text !~ /<img[^>]*?>/gi) {
  $text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
}

while ($text !~ /<a href([^>]*?)>(.*?)<\/a>/gi) {
  $pattern = $repl = $1;

  $pattern = quotemeta($pattern);
  $repl =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;

  $text =~ s#$pattern#$repl#gsi;
}

Thoughts?


------------------------------

Date: Fri, 2 Nov 2012 21:11:06 +0000 (UTC)
From: Eli the Bearded <*@eli.users.panix.com>
Subject: Re: Regex question, limit repeats UNLESS within specified tags
Message-Id: <eli$1211021700@qz.little-neck.ny.us>

In comp.lang.perl.misc, Jason C  <jwcarlton@gmail.com> wrote:
> On Friday, November 2, 2012 6:08:03 AM UTC-4, Justin C wrote:
>> Found in /usr/share/perl/5.10/pod/perlfaq6.pod
>>    How do I match XML, HTML, or other nasty, ugly things with a regex?
>>        (contributed by brian d foy)
>>        If you just want to get work done, use a module and forget about the
>>        regular expressions. The "XML::Parser" and "HTML::Parser" modules
> I've used HTML::Parser at length, but I don't think that it offers anything
> like what I'm needing. I looked through CPAN, and didn't find anything like
> this.

Your use case is exotic. You will not find exactly what you need off the
shelf. You will find ways to break a document up into <IMG>, <A>, and
neither of thsoe when you use a parsing module. Thus broken up, you can
then do your substring regexp.

> I might have made the OP seem too complicated. What I really need to figure
> out is how to run a regex where both the look-behind AND look-ahead match.

No, I don't think you made it seem "too complicated", it *is* too
complicated. Anytime you want look-behind or look-ahead you are risking
"too complicated". Do not make the mistake of thinking that a single
operation is the best way to solve any pattern finding task.

Elijah
------
<img src = "sixarrows.png" alt = "-> -> -> -> -> ->" title = ">>>>>>" >

> Something like this, I guess:
> 
> # Not tested
> while (($text !~ /<img[^>]*?>/gi) &&
>        ($text !~ /<a href[^>]*?>/gi)) {
>   $text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi; 
> }
> 
> Or maybe two separate loops, like this:
> 
> while ($text !~ /<img[^>]*?>/gi) {
>   $text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
> }
> 
> while ($text !~ /<a href([^>]*?)>(.*?)<\/a>/gi) {
>   $pattern = $repl = $1;
> 
>   $pattern = quotemeta($pattern);
>   $repl =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
> 
>   $text =~ s#$pattern#$repl#gsi;
> }
> 
> Thoughts?




------------------------------

Date: Sat, 3 Nov 2012 13:31:47 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Regex question, limit repeats UNLESS within specified tags
Message-Id: <slrnk9a3lj.r8t.hjp-usenet2@hrunkner.hjp.at>

On 2012-11-02 21:11, Eli the Bearded <*@eli.users.panix.com> wrote:
> In comp.lang.perl.misc, Jason C  <jwcarlton@gmail.com> wrote:
>> On Friday, November 2, 2012 6:08:03 AM UTC-4, Justin C wrote:
>>> Found in /usr/share/perl/5.10/pod/perlfaq6.pod
>>>    How do I match XML, HTML, or other nasty, ugly things with a regex?
>>>        (contributed by brian d foy)
>>>        If you just want to get work done, use a module and forget about the
>>>        regular expressions. The "XML::Parser" and "HTML::Parser" modules
>> I've used HTML::Parser at length, but I don't think that it offers anything
>> like what I'm needing. I looked through CPAN, and didn't find anything like
>> this.
>
> Your use case is exotic. You will not find exactly what you need off the
> shelf. You will find ways to break a document up into <IMG>, <A>, and
> neither of thsoe when you use a parsing module. Thus broken up, you can
> then do your substring regexp.

Agreed.

>
>> I might have made the OP seem too complicated. What I really need to figure
>> out is how to run a regex where both the look-behind AND look-ahead match.
>
> No, I don't think you made it seem "too complicated", it *is* too
> complicated.

I don't know whether it is complicated but I do know that I don't
understand it. My best guess is that he wants to limit duplicate
characters in the text of document, but wants to avoid mangling URLs.

So if someone writes:

    <p>John is stupid!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!</p>

he wants to change this to 

    <p>John is stupid!!!!!!</p>

But something like

    <img src="/images/img0000000123.jpg" title="Little Johnny and his dog">

should not be changed to 

    <img src="/images/img000000123.jpg" title="Little Johnny and his dog">

because that would invalidate the link.

But this is just a guess. 

Assuming I am right, I would use HTML::Parser to parse the file and then
do those substitutions only in text nodes. This is probably most easily
done with a handler.

	hp



-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel


------------------------------

Date: Fri, 2 Nov 2012 21:51:01 +0000 (UTC)
From: tmcd@panix.com (Tim McDaniel)
Subject: Trampoline sub
Message-Id: <k71f85$7tl$1@reader1.panix.com>

I'm moving some subs from one module to another.  I'm thinking of
leaving old names in to give more time to change all the callers.
I think the best way to have "trampoline" code is
    sub OldName { goto &NewModule::NewName; }
and it's reasonably clear.  Just out of curiosity, are there other
ways?

I tried
    *old{CODE} = \&real;
but it causes
    Can't modify glob elem in scalar assignment at local/test/077.pl
    line 6, near "&real;"
    Execution of local/test/077.pl aborted due to compilation errors.

I tried
    *old = *real;
and it works.  Does it have any bad effects, like creating $MAIN::old
or something?

-- 
Tim McDaniel, tmcd@panix.com


------------------------------

Date: Fri, 02 Nov 2012 22:25:35 +0000
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: Trampoline sub
Message-Id: <87mwyz63s0.fsf@sapphire.mobileactivedefense.com>

tmcd@panix.com (Tim McDaniel) writes:
> I'm moving some subs from one module to another.  I'm thinking of
> leaving old names in to give more time to change all the callers.
> I think the best way to have "trampoline" code is
>     sub OldName { goto &NewModule::NewName; }
> and it's reasonably clear.  Just out of curiosity, are there other
> ways?

I would write this as

	sub OldName { &NewModule::newname; }

goto &subref is magical in the sense that it is supposed to hide the
fact that some subroutine was created via AUTOLOAD by manipulating the
call stack accordingly. This is probably not necessary in your case.

>
> I tried
>     *old{CODE} = \&real;
> but it causes
>     Can't modify glob elem in scalar assignment at local/test/077.pl
>     line 6, near "&real;"

The purpose of the *foo{THING} syntax is to access the slots of a
glob, eg

,----
| [rw@sapphire]~ $perl -de 0
| 
| Loading DB routines from perl5db.pl version 1.32
| Editor support available.
| 
| Enter h or `h h' for help, or `man perldebug' for more help.
| 
| main::(-e:1):   0
|   DB<1> sub toast { return 'Toast'; }
| 
|   DB<2> $toast = *toast{CODE}
| 
|   DB<3>  p $toast->()
| Toast
`----

This is not necessary when assigning because a reference of a certain
type is automatically assigned to the correct glob slot,

,----
| [rw@sapphire]~ $perl -de 0
| 
| Loading DB routines from perl5db.pl version 1.32
| Editor support available.
| 
| Enter h or `h h' for help, or `man perldebug' for more help.
| 
| main::(-e:1):   0
|   DB<1> sub toast { return 'Toast'; }
| 
|   DB<2> *food = \&toast
| 
|   DB<3> p food()
| Toast
`----

>     Execution of local/test/077.pl aborted due to compilation errors.
>
> I tried
>     *old = *real;
> and it works.  Does it have any bad effects, like creating $MAIN::old
> or something?

It does what you were asking for: Put the glob referred to by *real in
the symbol table slot old:

,----
| [rw@sapphire]~ $perl -de 0
| 
| Loading DB routines from perl5db.pl version 1.32
| Editor support available.
| 
| Enter h or `h h' for help, or `man perldebug' for more help.
| 
| main::(-e:1):   0
|   DB<1> @real = qw(The rain in Spain stays mainly in the plains)
| 
|   DB<2> *old = *real
| 
|   DB<3> p join(' ', @old)
| The rain in Spain stays mainly in the plains
`----


------------------------------

Date: Sat, 3 Nov 2012 00:15:32 -0700 (PDT)
From: "C.DeRykus" <derykus@gmail.com>
Subject: Re: Trampoline sub
Message-Id: <a1ef5992-b92a-4e5e-a2ac-2c4fb647acfa@googlegroups.com>

On Friday, November 2, 2012 2:51:01 PM UTC-7, Tim McDaniel wrote:
> I'm moving some subs from one module to another.  I'm thinking of
> 
> leaving old names in to give more time to change all the callers.
> 
> I think the best way to have "trampoline" code is
> 
>     sub OldName { goto &NewModule::NewName; }
> 
> and it's reasonably clear.  Just out of curiosity, are there other
> 
> ways?
> 
> 
> 
> I tried
> 
>     *old{CODE} = \&real;
> 
> but it causes
> 
>     Can't modify glob elem in scalar assignment at local/test/077.pl
> 
>     line 6, near "&real;"
> 
>     Execution of local/test/077.pl aborted due to compilation errors.
> 
> 
> 
> I tried
> 
>     *old = *real;
> 
> and it works.  Does it have any bad effects, like creating $MAIN::old
> 
> or something?
> 
> 

One downside would be that all the glob slots of 
*old would be aliased to those of *real. Selective
aliasing over a local scope seems preferable:

   no warnings 'redefine';
   local *old  = \&real;



-- 
Charles DeRykus



------------------------------

Date: Fri, 02 Nov 2012 17:49:31 +0200
From: Eric Pozharski <whynot@pozharski.name>
Subject: Re: Why "Wide character in print"?
Message-Id: <slrnk97qsb.27g.whynot@orphan.zombinet>

with <slrnk94mfm.5vl.hjp-usenet2@hrunkner.hjp.at> Peter J. Holzer wrote:

*SKIP*
> Then I don't know what you meant by "utf8". Care to explain?

Do you know difference between utf-8 and utf8 for Perl?  (For long time,
up to yesterday, I believed that that utf-8 is all-caps;  I was wrong,
it's caseless.)

*SKIP*
>  * The encoding of the source code of the script

Wrong.

[quote perldoc encoding on]

    *   Internally converts all literals ("q//,qq//,qr//,qw///, qx//") from
        the encoding specified to utf8. In Perl 5.8.1 and later, literals in
        "tr///" and "DATA" pseudo-filehandle are also converted.

[quote off]

In pre-all-utf8 times qr// was working on bytes without being told to
behave otherwise.  That's different now.

>  * The default encoding of some I/O streams

We here, in our barbaric world, had (and still have) to process any
binary encoding except latin1 (guess what, CP866 is still alive).
However:

[quote perldoc encoding on]

    *   Changing PerlIO layers of "STDIN" and "STDOUT" to the encoding
        specified.

[quote off]

That's not saying anything about 'default'.  It's about 'encoding
specified'.

> and it does so even in an inconsistent manner (e.g. the encoding is
> applied to STDOUT, but not to STDERR)

No problems with that here.  STDERR is us-ascii, point.

> and finally, because it is too
> complex and that will lead to surprising results.

In your elitist latin1 world -- may be so.  But we, down here, are
barbarians, you know.

-- 
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom


------------------------------

Date: Sat, 3 Nov 2012 12:03:43 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Why "Wide character in print"?
Message-Id: <slrnk99ugj.r8t.hjp-usenet2@hrunkner.hjp.at>

On 2012-11-02 15:49, Eric Pozharski <whynot@pozharski.name> wrote:
> with <slrnk94mfm.5vl.hjp-usenet2@hrunkner.hjp.at> Peter J. Holzer wrote:
>
> *SKIP*
>> Then I don't know what you meant by "utf8". Care to explain?
>
> Do you know difference between utf-8 and utf8 for Perl?

UTF-8 is the "UCS Transformation Format, 8-bit form" as defined by the
Unicode consortium. It defines a mapping from unicode characters to
bytes and back. When you use it as an encoding in Perl, There will be
some checks that the input is actually a valid unicode character. For
example, you can't encode a surrogate character:

    $s2 = encode("utf-8", "\x{D812}");

results in the string "\xef\xbf\xbd", which is UTF-8 for U+FFFD (the
replacement character used to signal invalid characters).


utf8 may mean (at least) three different things in a Perl context:

 * It is a perl-proprietary encoding (actually two encodings, but EBCDIC
   support in perl has been dead for several years and I doubt it will
   ever come back, so I'll ignore that) for storing strings. The
   encoding is based on UTF-8, but it can represent code points with up
   to 64 bits[1], while UTF-8 is limited to 36 bits by design and to
   values <= 0x10FFFF by fiat. It also doesn't check for surrogates, so

	$s2 = encode("utf8", "\x{D812}");

    results in the string "\xed\xa0\x92", as one would naively expect.

    You should never use this encoding when reading or writing files.
    It's only for perl internal use and AFAIK it isn't documented
    anywhere except possibly in the source code.

 * Since the perl interpreter uses the format to store strings with
   Unicode character semantics (marked with the UTF8 flag), such strings
   are often called "utf8 strings" in the documentation.  This is
   somewhat unfortunate, because "utf8" looks very similar to "utf-8",
   which can cause confusion and because it exposes an implementation
   detail (There are several other possible storage formats a perl
   interpreter could reasonable use) to the user.

   I avoid this usage. I usually talk about "byte strings" or "character
   strings", or use even more verbose language to make clear what I am
   talking about. For example, in this thread the distinction between
   byte strings and character is almost irrelevant, it is only important
   whether a string contains an element > 0xFF or not.

 * There is also an I/O layer “:utf8”, which is subtly different from
   both “:encoding(utf8)” and “:encoding(utf-8)“.

> (For long time, up to yesterday, I believed that that utf-8 is
> all-caps;  I was wrong, it's caseless.)

Yes, the encoding names (as used in Encode::encode, Encode::decode and
the :encoding() I/O-Layers) are case-insensitive.


>>  * The encoding of the source code of the script
>
> Wrong.
>
> [quote perldoc encoding on]
>
>     *   Internally converts all literals ("q//,qq//,qr//,qw///, qx//") from
>         the encoding specified to utf8. In Perl 5.8.1 and later, literals in
>         "tr///" and "DATA" pseudo-filehandle are also converted.
>
> [quote off]

How is this proving me wrong? It confirms what I wrote. 

If you use “use encoding 'KOI8-U';”, you can use KOI8 sequences (either
literally or via escape sequences) in your source code. For example, if
you store this program in KOI8-U encoding:


#!/usr/bin/perl
use warnings;
use strict;
use 5.010;
use encoding 'KOI8-U';

my $s1 = "Đ‘";
say ord($s1);
my $s2 = "\x{E2}";
say ord($s2);
__END__

(i.e. the string literal on line 7 is stored as the byte sequence 0x22
0xE2 0x22), the program will print 1041 twice, because:

 * The perl compiler knows that the source code is in KOI-8, so a single
   byte 0xE2 in the source code represents the character “U+0411
   CYRILLIC CAPITAL LETTER BE”. Similarly, Escape sequences of the form
   \ooo and \Xxx are taken to denote bytes in the source character set
   and translated to unicode. So both the literal Đ‘ on line 7 and the
   \x{E2} on line 9 are translated to U+0411.

 * At run time, the bytecode interpreter sees a string with the single 
   unicode character U+0411. How this character was represented in the 
   source code is irrelevant (and indeed, unknowable) to the byte code 
   interpreter at this stage. It just prints the decimal representation
   of 0x0411, which happens to be 1041.


> In pre-all-utf8 times qr// was working on bytes without being told to
> behave otherwise.  That's different now.

Yes, I think I wrote that before. I don't know what this has to do with
the behaviour of “use encoding”, except that historically, “use
encoding” was intended to convert old byte-oriented scripts to the brave new
unicode-centered world with minimal effort. (I don't think it met that
goal: Over the years I have encountered a lot of people who had problems
with “use encoding”, but I don't remember ever reading from someone who
successfully converted their scripts by slapping  “use encoding '...'”
at the beginning.)

>>  * The default encoding of some I/O streams
>
> We here, in our barbaric world, had (and still have) to process any
> binary encoding except latin1 (guess what, CP866 is still alive).
> However:
>
> [quote perldoc encoding on]
>
>     *   Changing PerlIO layers of "STDIN" and "STDOUT" to the encoding
>         specified.
>
> [quote off]
>
> That's not saying anything about 'default'.  It's about 'encoding
> specified'.

You misunderstood what I meant by "default". When The perl interpreter
creates the STDIN and STOUT file handles, these have some I/O layers
applied to them, without the user having to explicitely having to call
binmode(). These are applied by default, and hence I call them the
default layers. The list of default layers varies between systems
(Windows adds the :crlf layer, Linux doesn't), on command line settings
(-CS adds the :utf8 layer, IIRC), and of course it can also be
manipulated by modules like “encoding”. “use encoding 'CP866';” pushes
the layer “:encoding(CP866)” onto the STDIN and STDOUT handles. You can
still override them with binmode(), but they are there by default, you
don't have to call “binmode STDIN, ":encoding(CP866)"” explicitely
(but you do have to call it explicitely for STDERR, which IMNSHO is
inconsistent).


>> and it does so even in an inconsistent manner (e.g. the encoding is
>> applied to STDOUT, but not to STDERR)
>
> No problems with that here.  STDERR is us-ascii, point.

If my scripts handle non-ascii characters, I want those characters also
in my error messages. If a script is intended for normal users (not
sysadmins), I might even want the error messages to be in their native
language instead of English. German can expressed in pure US-ASCII,
although it's awkward. Russian or Chinese is harder.

>> and finally, because it is too complex and that will lead to
>> surprising results.
>
> In your elitist latin1 world -- may be so.  But we, down here, are
> barbarians, you know.

May I remind you that it was you who was surprised by the behaviour of
“use encoding” in this thread, not me?

In Message <slrnk8q6nd.khg.whynot@orphan.zombinet> you wrote:

|         {10613:81} [0:0]% perl -Mencoding=utf8 -wle 'print "Ă "' # hooray!
|         Ă    
|         {10645:82} [0:0]% perl -Mencoding=utf8 -wle 'print "\x{E0}"' # oops
|         ďż˝   
|         {10654:83} [0:0]% perl -Mencoding=utf8 -wle 'print "\N{U+00E0}"' # hoora
|         Ă    
| 
| Except the middle one (what I should think about), I think encoding.pm
| wins again.

You didn't understand why the the middle one produced this particular
result. So you were surprised by the way “use encoding” translates
string literals. I wasn't surprised. I knew how it works and explained
it to you in my followup. 

Still, although I think I understand “use encoding” fairly well (because
I spent a lot of time reading the docs and playing with it when I still
thought it would be a useful tool, and later because I spent a lot of
time arguing on usenet that it isn't useful) I think it is too complex.
I would be afraid of making stupid mistakes like writing "\x{E0}" when I
meant chr(0xE0), and even if I don't make them, the next guy who has to
maintain the scripts probably understands much less about “use encoding”
than I do and is likely to misunderstand my code and introduce errors.

	hp


[1] I admit that I was surprised by this. It is documented that strings
    consist of 64-bit elements on 64-bit machines, but I thought this
    was an obvious documentation error until I actually tried it.

-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaĂźt. -- Ralph Babel


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3808
***************************************


home help back first fref pref prev next nref lref last post