[31273] in Perl-Users-Digest
Perl-Users Digest, Issue: 2518 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Wed Jul 15 14:09:46 2009
Date: Wed, 15 Jul 2009 11:09:09 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Wed, 15 Jul 2009 Volume: 11 Number: 2518
Today's topics:
Re: FAQ 6.7 How can I make "\w" match national characte <brian.d.foy@gmail.com>
Re: FAQ 6.7 How can I make "\w" match national characte <whynot@pozharski.name>
Re: FAQ 6.7 How can I make "\w" match national characte <hjp-usenet2@hjp.at>
Re: FAQ 6.7 How can I make "\w" match national characte <hjp-usenet2@hjp.at>
Re: FAQ 6.7 How can I make "\w" match national characte <ben@morrow.me.uk>
Re: FAQ 6.7 How can I make "\w" match national characte <ben@morrow.me.uk>
Re: FAQ 6.7 How can I make "\w" match national characte <ben@morrow.me.uk>
Re: FAQ 6.7 How can I make "\w" match national characte <hjp-usenet2@hjp.at>
getting return value of external application on win32 <alfonso.baldaserra@gmail.com>
Re: getting return value of external application on win <ben@morrow.me.uk>
normalize a <table> with multiple, variable, data in ea <oldyork90@yahoo.com>
Re: normalize a <table> with multiple, variable, data i <jimsgibson@gmail.com>
Re: removing paragraphs from text files <whynot@pozharski.name>
Re: removing paragraphs from text files <alfonso.baldaserra@gmail.com>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Tue, 14 Jul 2009 17:45:02 -0500
From: brian d foy <brian.d.foy@gmail.com>
Subject: Re: FAQ 6.7 How can I make "\w" match national character sets?
Message-Id: <140720091745024534%brian.d.foy@gmail.com>
In article <kplsi6-ehk2.ln1@osiris.mauzo.dyndns.org>, Ben Morrow
<ben@morrow.me.uk> wrote:
> Quoth PerlFAQ Server <brian@stonehenge.com>:
> >
> > 6.7: How can I make "\w" match national character sets?
> >
> > Put "use locale;" in your script. The \w character class is taken from
> > the current locale.
> >
> > See perllocale for details.
>
> I wonder if this should mention
>
> Note that if you are matching against a UTF8 string, the Unicode
> definition of \w (and other character classes) will always be used.
> This may be fixed in some future version of perl.
That sounds like a good thing to note. Thanks,
------------------------------
Date: Wed, 15 Jul 2009 09:55:26 +0300
From: Eric Pozharski <whynot@pozharski.name>
Subject: Re: FAQ 6.7 How can I make "\w" match national character sets?
Message-Id: <slrnh5qvdi.bbb.whynot@orphan.zombinet>
On 2009-07-14, brian d foy <brian.d.foy@gmail.com> wrote:
> In article <kplsi6-ehk2.ln1@osiris.mauzo.dyndns.org>, Ben Morrow
><ben@morrow.me.uk> wrote:
>
>> Quoth PerlFAQ Server <brian@stonehenge.com>:
>> >
>> > 6.7: How can I make "\w" match national character sets?
>> >
>> > Put "use locale;" in your script. The \w character class is taken from
>> > the current locale.
>> >
>> > See perllocale for details.
>>
>> I wonder if this should mention
>>
>> Note that if you are matching against a UTF8 string, the Unicode
>> definition of \w (and other character classes) will always be used.
>> This may be fixed in some future version of perl.
>
> That sounds like a good thing to note. Thanks,
I believe, that entry should suffer a major rewrite
{2982:24} [0:1]$ perl -wle 'print "ф" =~ /\w/'
{2996:25} [0:0]$ perl -Mlocale -wle 'print "ф" =~ /\w/'
{3000:26} [0:0]$ perl -Mutf8 -wle 'print "ф" =~ /\w/'
1
{3004:27} [0:0]$ perl -Mencoding=utf8 -wle 'print "ф" =~ /\w/'
1
{3011:28} [0:0]$ LC_ALL=ru_UA.UTF-8 perl -Mlocale -wle 'print "ф" =~ /\w/'
{3070:29} [0:0]$
--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom
------------------------------
Date: Wed, 15 Jul 2009 12:08:17 +0200
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: FAQ 6.7 How can I make "\w" match national character sets?
Message-Id: <slrnh5rakj.cik.hjp-usenet2@hrunkner.hjp.at>
On 2009-07-15 06:55, Eric Pozharski <whynot@pozharski.name> wrote:
> On 2009-07-14, brian d foy <brian.d.foy@gmail.com> wrote:
>> In article <kplsi6-ehk2.ln1@osiris.mauzo.dyndns.org>, Ben Morrow
>><ben@morrow.me.uk> wrote:
>>> Quoth PerlFAQ Server <brian@stonehenge.com>:
>>> > 6.7: How can I make "\w" match national character sets?
>>> >
>>> > Put "use locale;" in your script. The \w character class is taken from
>>> > the current locale.
>>> >
>>> > See perllocale for details.
>>>
>>> I wonder if this should mention
>>>
>>> Note that if you are matching against a UTF8 string, the Unicode
>>> definition of \w (and other character classes) will always be used.
>>> This may be fixed in some future version of perl.
>>
>> That sounds like a good thing to note. Thanks,
>
> I believe, that entry should suffer a major rewrite
This is probably true for all entries recommending "use locale".
> {2982:24} [0:1]$ perl -wle 'print "ф" =~ /\w/'
This isn't a single cyrillic letter. It is a byte string which happens
to contain the UTF-8 encoding of a cyrillic letter. But perl doesn't
know that it's supposed to decode that byte string unless you tell it.
hp
------------------------------
Date: Wed, 15 Jul 2009 12:37:31 +0200
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: FAQ 6.7 How can I make "\w" match national character sets?
Message-Id: <slrnh5rcbb.cik.hjp-usenet2@hrunkner.hjp.at>
On 2009-07-13 19:03, PerlFAQ Server <brian@stonehenge.com> wrote:
> 6.7: How can I make "\w" match national character sets?
>
> Put "use locale;" in your script. The \w character class is taken from
> the current locale.
>
> See perllocale for details.
>
Here is a more radical rewrite:
6.7: How can I make "\w" match national character sets?
Since perl 5.8.x, \w matches all Unicode word characters if matched
against a character string (sometimes also called a "utf8 string").
The best way to deal with national character sets is to always
decode them to character strings on input (either with
Encode::decode or with an ":encoding(...)" I/O layer) and then
continue to work with the character string.
See perlunitut for details.
------------------------------
Date: Wed, 15 Jul 2009 17:00:48 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: FAQ 6.7 How can I make "\w" match national character sets?
Message-Id: <g2b1j6-ccr2.ln1@osiris.mauzo.dyndns.org>
Quoth Eric Pozharski <whynot@pozharski.name>:
> On 2009-07-14, brian d foy <brian.d.foy@gmail.com> wrote:
> > In article <kplsi6-ehk2.ln1@osiris.mauzo.dyndns.org>, Ben Morrow
> ><ben@morrow.me.uk> wrote:
> >
> >> Quoth PerlFAQ Server <brian@stonehenge.com>:
> >> >
> >> > 6.7: How can I make "\w" match national character sets?
> >> >
> >> > Put "use locale;" in your script. The \w character class is taken from
> >> > the current locale.
> >> >
> >> > See perllocale for details.
> >>
> >> I wonder if this should mention
> >>
> >> Note that if you are matching against a UTF8 string, the Unicode
> >> definition of \w (and other character classes) will always be used.
> >> This may be fixed in some future version of perl.
> >
> > That sounds like a good thing to note. Thanks,
>
> I believe, that entry should suffer a major rewrite
>
> {2982:24} [0:1]$ perl -wle 'print "ф" =~ /\w/'
>
> {2996:25} [0:0]$ perl -Mlocale -wle 'print "ф" =~ /\w/'
I don't know what your locale is set to, so I wouldn't expect this to
work.
> {3000:26} [0:0]$ perl -Mutf8 -wle 'print "ф" =~ /\w/'
> 1
> {3004:27} [0:0]$ perl -Mencoding=utf8 -wle 'print "ф" =~ /\w/'
> 1
These two are exactly what I was talking about. With a UTF8-marked
string, \w uses the Unicode definition of \w, regardless of locale.
> {3011:28} [0:0]$ LC_ALL=ru_UA.UTF-8 perl -Mlocale -wle 'print "ф" =~ /\w/'
This I would expect to work, and in simple cases it does:
~% LC_ALL=en_GB.ISO8859-1 perl -E'say chr(0xff) =~ /\w/'
~% LC_ALL=en_GB.ISO8859-1 perl -Mlocale -E'say chr(0xff) =~ /\w/'
1
~%
I suspect that either the locale system as a whole or perl's
implementation of it doesn't understand character sets that aren't just
an 8-bit extension of ASCII.
Ben
------------------------------
Date: Wed, 15 Jul 2009 17:03:47 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: FAQ 6.7 How can I make "\w" match national character sets?
Message-Id: <38b1j6-ccr2.ln1@osiris.mauzo.dyndns.org>
Quoth "Peter J. Holzer" <hjp-usenet2@hjp.at>:
> On 2009-07-15 06:55, Eric Pozharski <whynot@pozharski.name> wrote:
>
> > {2982:24} [0:1]$ perl -wle 'print "ф" =~ /\w/'
>
> This isn't a single cyrillic letter. It is a byte string which happens
> to contain the UTF-8 encoding of a cyrillic letter. But perl doesn't
> know that it's supposed to decode that byte string unless you tell it.
No, that won't help. Or rather, it *will* cause it to match, but it will
match according to the Unicode rules and not to the ru_UA rules (which
are likely different in some cases). The locale system is supposed to
handle multi-byte character sets encoded as byte strings, but it doesn't
appear to within perl.
Ben
------------------------------
Date: Wed, 15 Jul 2009 17:09:57 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: FAQ 6.7 How can I make "\w" match national character sets?
Message-Id: <ljb1j6-ccr2.ln1@osiris.mauzo.dyndns.org>
Quoth "Peter J. Holzer" <hjp-usenet2@hjp.at>:
> On 2009-07-13 19:03, PerlFAQ Server <brian@stonehenge.com> wrote:
> > 6.7: How can I make "\w" match national character sets?
> >
> > Put "use locale;" in your script. The \w character class is taken from
> > the current locale.
> >
> > See perllocale for details.
> >
>
> Here is a more radical rewrite:
>
> 6.7: How can I make "\w" match national character sets?
>
> Since perl 5.8.x, \w matches all Unicode word characters if matched
> against a character string (sometimes also called a "utf8 string").
The visibility of the SvUTF8 flag from Perl is considered to be a bug,
and ought to be fixed at some point. The bad interactions between
locale.pm and Unicode are also considered to be a bug, but AFAIK noone
is terribly interested in fixing it.
> The best way to deal with national character sets is to always
> decode them to character strings on input (either with
> Encode::decode or with an ":encoding(...)" I/O layer) and then
> continue to work with the character string.
This isn't correct, though. If I am in a Russian locale, I only want to
match Russian \w characters, not Unicode \w characters. Currently 'use
locale' is the way to achieve this (probably with a 'use bytes' to make
sure no UTF8-marked strings sneak in by mistake), but (apparently) it
only works for single-byte character sets.
Ben
------------------------------
Date: Wed, 15 Jul 2009 19:39:07 +0200
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: FAQ 6.7 How can I make "\w" match national character sets?
Message-Id: <slrnh5s51r.iht.hjp-usenet2@hrunkner.hjp.at>
On 2009-07-15 16:09, Ben Morrow <ben@morrow.me.uk> wrote:
> Quoth "Peter J. Holzer" <hjp-usenet2@hjp.at>:
>> On 2009-07-13 19:03, PerlFAQ Server <brian@stonehenge.com> wrote:
>> > 6.7: How can I make "\w" match national character sets?
>> >
>> > Put "use locale;" in your script. The \w character class is
>> > taken from the current locale.
>> >
>> > See perllocale for details.
>> >
>>
>> Here is a more radical rewrite:
>>
>> 6.7: How can I make "\w" match national character sets?
>>
>> Since perl 5.8.x, \w matches all Unicode word characters if matched
>> against a character string (sometimes also called a "utf8 string").
>
> The visibility of the SvUTF8 flag from Perl is considered to be a bug,
> and ought to be fixed at some point.
Even if it is a bug (there are valid arguments why it is a feature) what
I wrote is still true: \w matches Unicode word characters in a character
string. If the "bug" is fixed, \w will also match Unicode word
characters in byte strings, but I didn't write anything about that. (I
do hope nobody proposes to "fix" the bug by reducing all character
classes to the US-ASCII repertoire - that would be extremely stupid.)
> The bad interactions between locale.pm and Unicode are also considered
> to be a bug, but AFAIK noone is terribly interested in fixing it.
There is probably nobody terribly interested in locale at all since Perl
got Unicode support. The locale is still useful to get information about
the environment (from the default encoding through search order to
message catalogs) but I see the way the locale changes semantics of
builtin functions as a legacy from C which is unnecessary and actually
harmful in Perl.
>> The best way to deal with national character sets is to always
>> decode them to character strings on input (either with
>> Encode::decode or with an ":encoding(...)" I/O layer) and then
>> continue to work with the character string.
>
> This isn't correct, though. If I am in a Russian locale, I only want to
> match Russian \w characters, not Unicode \w characters.
Maybe you do, but I don't. And I would be surprised if if works that way
in any locale: For example, the German locale on my Debian system
happily accepts α and Ж als '[:alpha:]' characters. Even the Turkish
locale accepts the Greek alpha :-).
Similarly I am quite sure that for example a Russian locale based on
KOI-8 will accept not only the Cyrillic letters but also the Latin
letters in that character set as alphabetical. It won't accept Greek
letters only because they cannot be encoded in KOI-8.
hp
------------------------------
Date: Wed, 15 Jul 2009 07:37:31 -0700 (PDT)
From: alfonsobaldaserra <alfonso.baldaserra@gmail.com>
Subject: getting return value of external application on win32
Message-Id: <70fd68ec-5098-4acb-a5e6-2360a21d8c81@a26g2000yqn.googlegroups.com>
hello,
i am calling an external command in perl on win32 as follows
my $app = 'c:\program files\foo\flarp.exe status quux';
my $spam = qx/ $app 2>&1 /;
now i need to get the status of executed command
if ( $? == 0 ) { print "yay"; }
the problem is it always returns 0. since quux is not running, when i
run the same command on cmd.exe i get return value as 3
> echo %ERRORLEVEL%
3
i have checked the archives with similar question but no help. i have
also checked system() and qx// documentation but they don't have
anything like this.
is there any other way to do this?
thanks.
------------------------------
Date: Wed, 15 Jul 2009 17:16:54 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: getting return value of external application on win32
Message-Id: <m0c1j6-ccr2.ln1@osiris.mauzo.dyndns.org>
Quoth alfonsobaldaserra <alfonso.baldaserra@gmail.com>:
>
> i am calling an external command in perl on win32 as follows
>
> my $app = 'c:\program files\foo\flarp.exe status quux';
> my $spam = qx/ $app 2>&1 /;
^^^^
This means that the command will be run through cmd.exe, which exits
successfully (despite the command having failed).
> now i need to get the status of executed command
>
> if ( $? == 0 ) { print "yay"; }
$? contains the exit value of the cmd.exe process (since that's the only
thing cmd.exe knows about). Apparently this is always 0.
I would recommend using Win32::Process instead. It's a little awkward,
but should let you do the redirection without involving cmd. You will
have to create the pipe and read from it by hand, of course.
Ben
------------------------------
Date: Tue, 14 Jul 2009 16:33:13 -0700 (PDT)
From: okey <oldyork90@yahoo.com>
Subject: normalize a <table> with multiple, variable, data in each <td>
Message-Id: <fc1532dd-a45c-482d-b88a-89a38932fa35@o6g2000yqj.googlegroups.com>
I have to believe this has been done many time before.
We have a well formed html <table> which contains X rows with Y
columns in each. Nice and regular...
Each <td> however contains mutliple chunks of data. Data in these
cells is delimited by a <br />
but could be anything I guess.
We need to take this table and regenerate it so that each data chunk
has it's own row.
For example
<table>
<tr>
<td>item01</td>
<td>item<br />item< br/>item</td>
</tr>
<tr>
<td>item03<br/></td>
<td>item<br /></td>
</tr>
would result in
<table>
<tr>
<td>item01</td>
<td>item</td>
</tr>
<tr>
<td>item01</td>
<td>item</td>
</tr>
<td>item01</td>
<td>item</td>
</tr>
<tr>
<td>item03 .....
... and so on.
We could code this. but this has to be some kind of module (with
normal good module stuff). Is there something out there
------------------------------
Date: Tue, 14 Jul 2009 18:17:51 -0700
From: Jim Gibson <jimsgibson@gmail.com>
Subject: Re: normalize a <table> with multiple, variable, data in each <td>
Message-Id: <140720091817515701%jimsgibson@gmail.com>
In article
<fc1532dd-a45c-482d-b88a-89a38932fa35@o6g2000yqj.googlegroups.com>,
okey <oldyork90@yahoo.com> wrote:
> I have to believe this has been done many time before.
>
> We have a well formed html <table> which contains X rows with Y
> columns in each. Nice and regular...
>
> Each <td> however contains mutliple chunks of data. Data in these
> cells is delimited by a <br />
> but could be anything I guess.
>
> We need to take this table and regenerate it so that each data chunk
> has it's own row.
>
> For example
>
> <table>
.....
>
> ... and so on.
>
> We could code this. but this has to be some kind of module (with
> normal good module stuff). Is there something out there
There is the HTML::TableExtractor module, which will help you with
extracting the table info. Reformatting is then up to you. Breaking out
rows in an HTML table seems a little too specialized to have a module
already written, but who knows?
--
Jim Gibson
------------------------------
Date: Tue, 14 Jul 2009 16:04:44 +0300
From: Eric Pozharski <whynot@pozharski.name>
Subject: Re: removing paragraphs from text files
Message-Id: <slrnh5p0lu.88b.whynot@orphan.zombinet>
On 2009-07-13, Peter J. Holzer <hjp-usenet2@hjp.at> wrote:
*SKIP*
*skipping alfonsobaldaserra since he skipped Tad anyway*
> s/define service\{.*PING.*\}\s+//sg
>
> OTOH would match anything from the first "define service{" to the last
> "}" in the file (provided there's a PING somewhere between them) so it
> would probably remove a lot more than you want. The /[^}]*/ in Tad's
> regex is there to keep the match within a single brace-delimited block
> (and it's a bit simple-minded: It won't work if you have a } inside a
> comment, for example, but you probably don't, so that doesn't matter).
Then stricter
qr/\}\n+/
and stricter
qr/\}(?:\h*\n)+/ # needs 5.10
and stricter
qr/\}\h*\n(?:\h*\n)*/
What leads as to
perdoc -q nesting
and applieing regexes at HTML.
--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom
------------------------------
Date: Wed, 15 Jul 2009 05:40:16 -0700 (PDT)
From: alfonsobaldaserra <alfonso.baldaserra@gmail.com>
Subject: Re: removing paragraphs from text files
Message-Id: <74bd6af4-5713-4770-94cf-6424e5da0ff8@s15g2000yqs.googlegroups.com>
> Not alot to go on, but don't expect this to be a real parser unless you understand
> the RULES.
that was an excellent explanation. thank you very much guys, i have
understood it now.
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc. For subscription or unsubscription requests, send
#the single line:
#
# subscribe perl-users
#or:
# unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.
NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 2518
***************************************