[33156] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 4435 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Wed May 20 09:09:21 2015

Date: Wed, 20 May 2015 06:09:05 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Wed, 20 May 2015     Volume: 11 Number: 4435

Today's topics:
        awk + grep in perl shulamitmi@bezeq.com
    Re: awk + grep in perl <rweikusat@mobileactivedefense.com>
    Re: awk + grep in perl <gravitalsun@hotmail.foo>
    Re: awk + grep in perl <derykus@gmail.com>
    Re: awk + grep in perl <derykus@gmail.com>
    Re: awk + grep in perl sharma__r@hotmail.com
    Re: awk + grep in perl <rweikusat@mobileactivedefense.com>
    Re: awk + grep in perl <rweikusat@mobileactivedefense.com>
    Re: awk + grep in perl <rweikusat@mobileactivedefense.com>
        Decoding "BER" Integers <rweikusat@mobileactivedefense.com>
    Re: Decoding "BER" Integers <blgl@stacken.kth.se>
    Re: Decoding "BER" Integers <rweikusat@mobileactivedefense.com>
    Re: Extract all "words" <derykus@gmail.com>
    Re: Extract all "words" <rweikusat@mobileactivedefense.com>
    Re: Extract all "words" <derykus@gmail.com>
    Re: Extract all "words" <rweikusat@mobileactivedefense.com>
    Re: Extract all "words" <derykus@gmail.com>
    Re: Extract all "words" <rweikusat@mobileactivedefense.com>
    Re: Extract all "words" <derykus@gmail.com>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Tue, 19 May 2015 06:48:22 -0700 (PDT)
From: shulamitmi@bezeq.com
Subject: awk + grep in perl
Message-Id: <cc2ad5fb-fec4-4e4d-94aa-13d858c34810@googlegroups.com>

Hello,

I have a file named "listfile" with the following content:

area1 file1
area2 file1
area1 file1
area2 file2
area2 file2

I need to get all names of "area" for "file1" and count them.
in unix shell: grep file1 listfile | awk '{print $1}' | uniq | wc -l 

what is the best way to do it in perl ?

thanks!


------------------------------

Date: Tue, 19 May 2015 15:13:10 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: awk + grep in perl
Message-Id: <874mn8odu1.fsf@doppelsaurus.mobileactivedefense.com>

shulamitmi@bezeq.com writes:
> I have a file named "listfile" with the following content:
>
> area1 file1
> area2 file1
> area1 file1
> area2 file2
> area2 file2
>
> I need to get all names of "area" for "file1" and count them.
> in unix shell: grep file1 listfile | awk '{print $1}' | uniq | wc -l 
>
> what is the best way to do it in perl ?

What's your definition of 'best'?

perl -lane '/file1$/ and $a{$F[0]}++ || ++$c; END { print $c }'

works. As does

awk '/file1$/{ a[$1]++ || ++c } END { print c }'


------------------------------

Date: Tue, 19 May 2015 17:22:40 +0300
From: George Mpouras <gravitalsun@hotmail.foo>
Subject: Re: awk + grep in perl
Message-Id: <mjfgui$clr$1@news.grnet.gr>

On 19/5/2015 16:48, shulamitmi@bezeq.com wrote:
> Hello,
>
> I have a file named "listfile" with the following content:
>
> area1 file1
> area2 file1
> area1 file1
> area2 file2
> area2 file2
>
> I need to get all names of "area" for "file1" and count them.
> in unix shell: grep file1 listfile | awk '{print $1}' | uniq | wc -l
>
> what is the best way to do it in perl ?
>
> thanks!
>


#!/usr/bin/perl
use strict; use warnings; my %data;

while(<DATA>){
/^(.+?)\s+(.*)\s*$/ or next; $data{$2}->{$1}=1
}

print "$_ : ".scalar(keys %{$data{$_}})."\n" foreach sort keys %data




__DATA__
area1 file1
area1 file1
area1 file1
area2 file1
area1 file1
area2 file2
area2 file2
area2 file2


------------------------------

Date: Tue, 19 May 2015 12:03:19 -0700 (PDT)
From: "C.DeRykus" <derykus@gmail.com>
Subject: Re: awk + grep in perl
Message-Id: <90ad39e5-003b-40dd-a75f-2daf28dfc650@googlegroups.com>

On Tuesday, May 19, 2015 at 6:48:27 AM UTC-7, shula...@bezeq.com wrote:
> Hello,
> 
> I have a file named "listfile" with the following content:
> 
> area1 file1
> area2 file1
> area1 file1
> area2 file2
> area2 file2
> 
> I need to get all names of "area" for "file1" and count them.
> in unix shell: grep file1 listfile | awk '{print $1}' | uniq | wc -l 
> 
> what is the best way to do it in perl ?
> 
> thanks!

Here's another way:

perl -nE '$h{$1}="" if /(area\d+)\s+file1$/;END{say $c=keys %h}' file


------------------------------

Date: Tue, 19 May 2015 12:17:12 -0700 (PDT)
From: "C.DeRykus" <derykus@gmail.com>
Subject: Re: awk + grep in perl
Message-Id: <883915eb-a639-4f42-8e1c-5f9a07268b1b@googlegroups.com>

On Tuesday, May 19, 2015 at 12:03:25 PM UTC-7, C.DeRykus wrote:
> On Tuesday, May 19, 2015 at 6:48:27 AM UTC-7, shula...@bezeq.com wrote:
> > Hello,
> > 
> > I have a file named "listfile" with the following content:
> > 
> > area1 file1
> > area2 file1
> > area1 file1
> > area2 file2
> > area2 file2
> > 
> > I need to get all names of "area" for "file1" and count them.
> > in unix shell: grep file1 listfile | awk '{print $1}' | uniq | wc -l 
> > 
> > what is the best way to do it in perl ?
> > 
> > thanks!
> 
> Here's another way:
> 
> perl -nE '$h{$1}="" if /(area\d+)\s+file1$/;END{say $c=keys %h}' file


Sigh, I just noticed I've inadvertently cloned a solution already shown.


------------------------------

Date: Tue, 19 May 2015 12:39:09 -0700 (PDT)
From: sharma__r@hotmail.com
Subject: Re: awk + grep in perl
Message-Id: <0f124d7c-54eb-4b20-b082-cce835032791@googlegroups.com>

On Tuesday, 19 May 2015 19:18:27 UTC+5:30, shula...@bezeq.com  wrote:
> Hello,
> 
> I have a file named "listfile" with the following content:
> 
> area1 file1
> area2 file1
> area1 file1
> area2 file2
> area2 file2
> 
> I need to get all names of "area" for "file1" and count them.
> in unix shell: grep file1 listfile | awk '{print $1}' | uniq | wc -l 
> 
> what is the best way to do it in perl ?
> 

-i)
You can do it this way, which keeps in mind the perl
best practices (Conway). Put the below shown code
into a file (say, runme.plx), then

# Usage: perl -- runme.plx listfile

use strict;
use warnings;

local $\ = "\n";

my $strip_ws = sub {
   for ((@_) ? @_ : $_) {
      s/^\s*//;s/\s*$//
   }
};

die "[ERROR] Please remember to provide the list file next time when you runme. Quitting..."
   unless @ARGV;

my ($list_file) = @ARGV;

open my $fh, "<", $list_file
   or die "[ERROR] Could not open the file '$list_file' for reading: $!";

my %h;

LINE:
while(local $_ = <$fh>) {
   chomp;

   $strip_ws->();

   next LINE unless /\sfile1$/;

   my ($area) = split;

   $h{$area}++;
}

print scalar keys %h;

close $fh
   or die "[ERROR] Could not close the file '$list_file' after reading: $!";

__END__



-ii) Or as a one-liner on a linux bash shell command line

  < listfile perl -0777pe '$_ = keys %{ +{ /^\s*(\S+)\s+file1\s*($)/mg } }';echo


------------------------------

Date: Tue, 19 May 2015 21:49:14 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: awk + grep in perl
Message-Id: <87egmcmgxh.fsf@doppelsaurus.mobileactivedefense.com>

sharma__r@hotmail.com writes:
> On Tuesday, 19 May 2015 19:18:27 UTC+5:30, shula...@bezeq.com  wrote:

[...]

> -ii) Or as a one-liner on a linux bash shell command line
>
>   < listfile perl -0777pe '$_ = keys %{ +{ /^\s*(\S+)\s+file1\s*($)/mg } }';echo

Removing everything which serves no purpose or is needlessly byzantine
(in order to confuse a reader) and making better use of
features of perl yields

perl -l -0777pe '$_ = keys %{{ /(\S+)\s+(file1)/g }}'

which is essentially the same as

perl -l -0777pe '%a = /(\S+)\s+(file1)/g; $_ = keys %a'

ie, for each line which contains file1, create an entry in a hash using
the first word as key (the value doesn't matter) and print the result of
evaluating keys(%hash) in scalar context which is the number of keys.






------------------------------

Date: Tue, 19 May 2015 23:09:51 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: awk + grep in perl
Message-Id: <87a8x0md74.fsf@doppelsaurus.mobileactivedefense.com>

Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:
> sharma__r@hotmail.com writes:
>> On Tuesday, 19 May 2015 19:18:27 UTC+5:30, shula...@bezeq.com  wrote:
>
> [...]
>
>> -ii) Or as a one-liner on a linux bash shell command line
>>
>>   < listfile perl -0777pe '$_ = keys %{ +{ /^\s*(\S+)\s+file1\s*($)/mg } }';echo
>
> Removing everything which serves no purpose or is needlessly byzantine
> (in order to confuse a reader) and making better use of
> features of perl yields
>
> perl -l -0777pe '$_ = keys %{{ /(\S+)\s+(file1)/g }}'

Since it's apparently silly season again, here's another which is a
little less obvious:

perl -lpe '($_=%{{map{/(\S+)(.)*file1/}<>}})=~s/\/.*//'


------------------------------

Date: Tue, 19 May 2015 23:12:18 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: awk + grep in perl
Message-Id: <874mn8md31.fsf@doppelsaurus.mobileactivedefense.com>

Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:
> Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:

[...]

> perl -lpe '($_=%{{map{/(\S+)(.)*file1/}<>}})=~s/\/.*//'

[Hopefully] working:

perl -lpe '($_=%{{map{/(\S+)(.)*file1/}$_,<>}})=~s/\/.*//'



------------------------------

Date: Tue, 19 May 2015 21:20:22 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Decoding "BER" Integers
Message-Id: <87iobomi9l.fsf@doppelsaurus.mobileactivedefense.com>

Perl pack/ unpack support a "Binary Encoded Representation" for
non-negative integers of (theoretically) arbitrary width. This works by
dividing the actual value into 7-bit groups (from least significant to
most significant) and putting each 7-bit group in the lower 7 bits of an
octet/ byte, starting with the most significant group (eg, 267, binary
100001011 would be grouped into 0000010 0001101 and stored
left-to-right). The 8th bit of each octet except the last making up the
number is set to one (so 267 would be encoded as bytes 10000010
00001101). Unfortunately, while unpack can be used to decode this
format, there doesn't seem to be a way to determine the encoded length
of a decoded integer (other than determining the position of the highest
set bit in the decoded value and calculate the number of needed octets
from that).

I'm planning to use/ using integers encoded in this way to represent
string lengths in some binary message format for representing a sequence
of structures composed of a sequence of members of certain types. In
order to decode this, it's obviously necessary to skip over the already
decoded parts of the message string, hence, I need to know the length of
a string length. My idea for dealing with this problem was to decode the
numbers octet-by-octet and count them by doing so. So far, I came up
with three algorithms for doing this and the one doing the unpack('C5',
 ...) seems best to me.

Any other ideas?

Two additional remarks: The length of a complete message is restricted
to a 32-bit integer. Perl supports O(1) deletion from the beginning of a
string.

-----------
use Benchmark;

my @lens = map { pack('w', rand(1 << 9)) } 0 .. 100;

sub uni
{
    my $v = $_[0];
    my ($cur, $n, $x);

    do {
	++$n;
	
	$cur = unpack('C', $v);
	$x <<= 7;
	$x |= $cur & 0x7f;

	substr($v, 0, 1, '');
    } while $cur > 127;

    return ($x, $n);
}

sub s1
{
    my $v = $_[0];
    my ($cur, $n, $x);

    while (($cur = unpack('C', $v)) > 127) {
	$x |= $cur & 0x7f;
	$x <<= 7;

	++$n;
	substr($v, 0, 1, '');
    }
    
    substr($v, 0, 1, '');
    return ($x | $cur, $n + 1);
}

sub un_all
{
    my $v = $_[0];
    my ($x, $n);

    for (unpack('C5', $v)) {
	++$n;

	if ($_ < 128) {
	    substr($v, 0, $n, '');
	    return ($x | $_, $n);
	}
	
	$x |= $_ & 0x7f;
	$x <<= 7;
    }
}

timethese(-3,
	  {
	   uni => sub {
	       uni($_) for @lens;
	   },

	   s1 => sub {
	       s1($_) for @lens;
	   },
	   
	   un_all => sub {
	       un_all($_) for @lens;
	   }});


------------------------------

Date: Wed, 20 May 2015 06:11:57 +0200
From: Bo Lindbergh <blgl@stacken.kth.se>
Subject: Re: Decoding "BER" Integers
Message-Id: <mjh1g8$l33$1@dont-email.me>

In article <87iobomi9l.fsf@doppelsaurus.mobileactivedefense.com>,
 Rainer Weikusat <rweikusat@mobileactivedefense.com> wrote:
> hence, I need to know the length of a string length.

If it's speed you're after, a) avoid copying the string and b) use a regexp.

sub w_len
{
    if ($_[0] =~ /\A[\x80-\xFF]*[\x00-\x7F]/) {
        ( unpack("w",$_[0]), $+[0] );
    } else {
        # input contains no octets below 0x80
        ();
    }
}


/Bo Lindbergh


------------------------------

Date: Wed, 20 May 2015 12:32:12 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Decoding "BER" Integers
Message-Id: <87wq035vsz.fsf@doppelsaurus.mobileactivedefense.com>

Bo Lindbergh <blgl@stacken.kth.se> writes:
> In article <87iobomi9l.fsf@doppelsaurus.mobileactivedefense.com>,
>  Rainer Weikusat <rweikusat@mobileactivedefense.com> wrote:
>> hence, I need to know the length of a string length.
>
> If it's speed you're after, a) avoid copying the string and b) use a regexp.
>
> sub w_len
> {
>     if ($_[0] =~ /\A[\x80-\xFF]*[\x00-\x7F]/) {
>         ( unpack("w",$_[0]), $+[0] );
>     } else {
>         # input contains no octets below 0x80
>         ();
>     }
> }

While this sort-of answers the question you pulled out of the text I
wrote, it's not usuable in the context of the benchmark I posted because
the string has to be copied to avoid changing the original as the
encoded length also has to be removed from it (because of the
requirements for actual use I posted --- in particular, I was also
interested in observable effects - if any - of calling substr per octet
or calling it once after the encoded length was determined). Also, at
least in my testing, using $+[0] caused the performance to become
seriously grotty. But /g and pos can be used instead. The regex is also
more complicated than necessary as it really only needs to find the
position of the first octet/ char with a numerical value < 128.

Implementation taking the above into account:

sub re
{
    my $v = $_[0];
    my ($x, $n);

    $x = unpack('w', $v);
    
    $v =~ /[\x00-\x7f]/g;
    $n = pos($v);
    substr($v, 0, $n, '');

    return ($x, $n);
}

Starts to pull ahead clearly (for me) once lengths reach 2**15 (1 << 15, 32768)
below, it's slower than the otherwise fasted and becomes worse as maximum
lengths decrease. Since 'equally distributed set of numbers from 0
 .. 32767' is not realistic in practice, I won't use it, but the general
idea is not bad (aka 'I should have though of that myself, thanks').

Random thought: Considering that regex matches are string operations
(and substr could be considered one, too), can I count on perl leaving
my binary data alone instead of starting to group it into characters
according to some supposedly secret encoding convention?

Might someone be hiding behind the coalshed here?




------------------------------

Date: Mon, 18 May 2015 11:31:44 -0700 (PDT)
From: "C.DeRykus" <derykus@gmail.com>
Subject: Re: Extract all "words"
Message-Id: <e0e7525f-cddb-42e4-b857-41024bbdfefe@googlegroups.com>

On Friday, May 15, 2015 at 4:44:17 AM UTC-7, Robert Crandal wrote:
> I would like to extract all "words" from a document, and output
> in the order that they occur to a file named "out.txt".
> 
> For example, given this input text:
> 
> "His light's shone on the J2 building, making the
> window-panes glow like so many fires."
> 
> Then, the outfile should be:
> 
> His
> light's
> shone
> on
> the
> J2
> building
> making
> the
> window-panes
> glow
> like
> so
> many
> files
> 
> I prefer to keep hyphens (-) and apostrophes (') that occur
> within words.
> 
> All other characters may be removed, such as commas, periods,
> question marks, exclamation points, parentheses, whitespaces,
> etc. etc. etc...
> 
> Also, I prefer to ignore words that contain ALL numbers, or
> are a mix of numbers and non-alpha characters.  For example,
> ignore:  12345, 05/05/15, (123)456-1564, 234-34-122, etc....
> 
> Is this best solved with a regular expression?

Lots of twists and hairpin curves on this cliff... here's an another
approach though:

while (<STDIN>) {
    s/[,.?!()]//g;         # remove these char's

    #-- ignore if all num's or a mix of num's,non-alpha
    s/(?:^|\s) (?: \d+ | [^[:alpha:]] )+ (?:\s|$)//xg ;

    say join("\n",split(" ",$_)) if /\S/;
}




------------------------------

Date: Mon, 18 May 2015 19:59:07 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Extract all "words"
Message-Id: <87y4klaf0k.fsf@doppelsaurus.mobileactivedefense.com>

"C.DeRykus" <derykus@gmail.com> writes:
> On Friday, May 15, 2015 at 4:44:17 AM UTC-7, Robert Crandal wrote:
>> I would like to extract all "words" from a document, and output
>> in the order that they occur to a file named "out.txt".
>> 
>> For example, given this input text:
>> 
>> "His light's shone on the J2 building, making the
>> window-panes glow like so many fires."

[...]

>> I prefer to keep hyphens (-) and apostrophes (') that occur
>> within words.
>> 
>> All other characters may be removed, such as commas, periods,
>> question marks, exclamation points, parentheses, whitespaces,
>> etc. etc. etc...
>> 
>> Also, I prefer to ignore words that contain ALL numbers, or
>> are a mix of numbers and non-alpha characters.  For example,
>> ignore:  12345, 05/05/15, (123)456-1564, 234-34-122, etc....
>> 
>> Is this best solved with a regular expression?
>
> Lots of twists and hairpin curves on this cliff... here's an another
> approach though:
>
> while (<STDIN>) {
>     s/[,.?!()]//g;         # remove these char's
>
>     #-- ignore if all num's or a mix of num's,non-alpha
>     s/(?:^|\s) (?: \d+ | [^[:alpha:]] )+ (?:\s|$)//xg ;
>
>     say join("\n",split(" ",$_)) if /\S/;
> }

The first step is avoidable as you can as well just split on
/[\s,.?!()]+/ (regex untested). Of course, the split expression needs to
be more complicated as this isn't an exhaustive set of punctuation
characters, some others would be :, ; and ". Even assuming the split set
was complete, the algorithm would still keep lone single quotes and
hyphens, something it wasn't supposed to do ...



------------------------------

Date: Mon, 18 May 2015 12:34:33 -0700 (PDT)
From: "C.DeRykus" <derykus@gmail.com>
Subject: Re: Extract all "words"
Message-Id: <5c6392ce-c160-400d-8a65-2f60db810a43@googlegroups.com>

On Monday, May 18, 2015 at 11:59:12 AM UTC-7, Rainer Weikusat wrote:
> "C.DeRykus" <derykus@gmail.com> writes:
> > On Friday, May 15, 2015 at 4:44:17 AM UTC-7, Robert Crandal wrote:
> >> I would like to extract all "words" from a document, and output
> >> in the order that they occur to a file named "out.txt".
> >> 
> >> For example, given this input text:
> >> 
> >> "His light's shone on the J2 building, making the
> >> window-panes glow like so many fires."
> 
> [...]
> 
> >> I prefer to keep hyphens (-) and apostrophes (') that occur
> >> within words.
> >> 
> >> All other characters may be removed, such as commas, periods,
> >> question marks, exclamation points, parentheses, whitespaces,
> >> etc. etc. etc...
> >> 
> >> Also, I prefer to ignore words that contain ALL numbers, or
> >> are a mix of numbers and non-alpha characters.  For example,
> >> ignore:  12345, 05/05/15, (123)456-1564, 234-34-122, etc....
> >> 
> >> Is this best solved with a regular expression?
> >
> > Lots of twists and hairpin curves on this cliff... here's an another
> > approach though:
> >
> > while (<STDIN>) {
> >     s/[,.?!()]//g;         # remove these char's
> >
> >     #-- ignore if all num's or a mix of num's,non-alpha
> >     s/(?:^|\s) (?: \d+ | [^[:alpha:]] )+ (?:\s|$)//xg ;
> >
> >     say join("\n",split(" ",$_)) if /\S/;
> > }
> 
> The first step is avoidable as you can as well just split on
> /[\s,.?!()]+/ (regex untested). Of course, the split expression needs to
> be more complicated as this isn't an exhaustive set of punctuation
> characters, some others would be :, ; and ". Even assuming the split set
> was complete, the algorithm would still keep lone single quotes and
> hyphens, something it wasn't supposed to do ...

In spite of the OP's "keep hypens and apostrophes within words", it could get even messier since a lone apostrophe on the end of a word is, strictly speaking, grammatically correct for a plural possessive.  


------------------------------

Date: Mon, 18 May 2015 21:38:28 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Extract all "words"
Message-Id: <87twv9aaez.fsf@doppelsaurus.mobileactivedefense.com>

"C.DeRykus" <derykus@gmail.com> writes:
> On Monday, May 18, 2015 at 11:59:12 AM UTC-7, Rainer Weikusat wrote:
>> "C.DeRykus" <derykus@gmail.com> writes:
>> > On Friday, May 15, 2015 at 4:44:17 AM UTC-7, Robert Crandal wrote:
>> >> I would like to extract all "words" from a document, and output
>> >> in the order that they occur to a file named "out.txt".
>> >> 
>> >> For example, given this input text:
>> >> 
>> >> "His light's shone on the J2 building, making the
>> >> window-panes glow like so many fires."
>> 
>> [...]
>> 
>> >> I prefer to keep hyphens (-) and apostrophes (') that occur
>> >> within words.
>> >> 
>> >> All other characters may be removed, such as commas, periods,
>> >> question marks, exclamation points, parentheses, whitespaces,
>> >> etc. etc. etc...
>> >> 
>> >> Also, I prefer to ignore words that contain ALL numbers, or
>> >> are a mix of numbers and non-alpha characters.  For example,
>> >> ignore:  12345, 05/05/15, (123)456-1564, 234-34-122, etc....
>> >> 
>> >> Is this best solved with a regular expression?

[...]

>> > while (<STDIN>) {
>> >     s/[,.?!()]//g;         # remove these char's
>> >
>> >     #-- ignore if all num's or a mix of num's,non-alpha
>> >     s/(?:^|\s) (?: \d+ | [^[:alpha:]] )+ (?:\s|$)//xg ;
>> >
>> >     say join("\n",split(" ",$_)) if /\S/;
>> > }
>> 
>> The first step is avoidable as you can as well just split on
>> /[\s,.?!()]+/ (regex untested). Of course, the split expression needs to
>> be more complicated as this isn't an exhaustive set of punctuation
>> characters, some others would be :, ; and ". Even assuming the split set
>> was complete, the algorithm would still keep lone single quotes and
>> hyphens, something it wasn't supposed to do ...
>
> In spite of the OP's "keep hypens and apostrophes within words", it
> could get even messier since a lone apostrophe on the end of a word
> is, strictly speaking, grammatically correct for a plural possessive.

Yes, that's why I used \w[-\w']* as regex for matching words.



------------------------------

Date: Mon, 18 May 2015 14:48:39 -0700 (PDT)
From: "C.DeRykus" <derykus@gmail.com>
Subject: Re: Extract all "words"
Message-Id: <29a1a880-fc3c-45e8-bcef-c6f5a16a30e4@googlegroups.com>

On Monday, May 18, 2015 at 1:38:32 PM UTC-7, Rainer Weikusat wrote:
> "C.DeRykus" <derykus@gmail.com> writes:
> > On Monday, May 18, 2015 at 11:59:12 AM UTC-7, Rainer Weikusat wrote:
> >> "C.DeRykus" <derykus@gmail.com> writes:
> >> >> ...
> >> >> All other characters may be removed, such as commas, periods,
> >> >> question marks, exclamation points, parentheses, whitespaces,
> >> >> etc. etc. etc...
> 
> [...]
> 
> >> > while (<STDIN>) {
> >> >     s/[,.?!()]//g;         # remove these char's
> >> >
> >> >     #-- ignore if all num's or a mix of num's,non-alpha
> >> >     s/(?:^|\s) (?: \d+ | [^[:alpha:]] )+ (?:\s|$)//xg ;
> >> >
> >> >     say join("\n",split(" ",$_)) if /\S/;
> >> > }
> >> 
> >> The first step is avoidable as you can as well just split on
> >> /[\s,.?!()]+/ (regex untested). Of course, the split expression needs to
> >> be more complicated as this isn't an exhaustive set of punctuation
> >> characters, some others would be :, ; and ". Even assuming the split set
> >> was complete, the algorithm would still keep lone single quotes and
> >> hyphens, something it wasn't supposed to do ...
> >
> > In spite of the OP's "keep hypens and apostrophes within words", it
> > could get even messier since a lone apostrophe on the end of a word
> > is, strictly speaking, grammatically correct for a plural possessive.
> 
> Yes, that's why I used \w[-\w']* as regex for matching words.

Ah, it was hard to see... Hm, but that'll pick up a trailing hyphen too. 

Also IMO stylistically it's clearer to remove the unwanted's separately rather embedding them in the regex (even if you were to stick 'em in a variable).

Regexes get overloaded quickly and lots of white space via /x always helps.




------------------------------

Date: Mon, 18 May 2015 23:16:38 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Extract all "words"
Message-Id: <87pp5xa5vd.fsf@doppelsaurus.mobileactivedefense.com>

"C.DeRykus" <derykus@gmail.com> writes:
> On Monday, May 18, 2015 at 1:38:32 PM UTC-7, Rainer Weikusat wrote:
>> "C.DeRykus" <derykus@gmail.com> writes:
>> > On Monday, May 18, 2015 at 11:59:12 AM UTC-7, Rainer Weikusat wrote:
>> >> "C.DeRykus" <derykus@gmail.com> writes:
>> >> >> ...
>> >> >> All other characters may be removed, such as commas, periods,
>> >> >> question marks, exclamation points, parentheses, whitespaces,
>> >> >> etc. etc. etc...
>> 
>> [...]
>> 
>> >> > while (<STDIN>) {
>> >> >     s/[,.?!()]//g;         # remove these char's
>> >> >
>> >> >     #-- ignore if all num's or a mix of num's,non-alpha
>> >> >     s/(?:^|\s) (?: \d+ | [^[:alpha:]] )+ (?:\s|$)//xg ;
>> >> >
>> >> >     say join("\n",split(" ",$_)) if /\S/;
>> >> > }
>> >> 
>> >> The first step is avoidable as you can as well just split on
>> >> /[\s,.?!()]+/ (regex untested). Of course, the split expression needs to
>> >> be more complicated as this isn't an exhaustive set of punctuation
>> >> characters, some others would be :, ; and ". Even assuming the split set
>> >> was complete, the algorithm would still keep lone single quotes and
>> >> hyphens, something it wasn't supposed to do ...
>> >
>> > In spite of the OP's "keep hypens and apostrophes within words", it
>> > could get even messier since a lone apostrophe on the end of a word
>> > is, strictly speaking, grammatically correct for a plural possessive.
>> 
>> Yes, that's why I used \w[-\w']* as regex for matching words.
>
> Ah, it was hard to see... Hm, but that'll pick up a trailing hyphen
> too.

This will at least prevent it from sueing for grapheme discrimination ...

> Also IMO stylistically it's clearer to remove the unwanted's
> separately rather embedding them in the regex (even if you were to
> stick 'em in a variable).

Assuming a certain, mental model for viewing the problem (which could be
described as "a text is a sequence of white-space separated words which
may have been contaminated with punctuation") it probably is. But you
can equally well just regard the input as alternating sequence of
"sequence of characters we want" and "sequence of characters we don't
want". Hence, spliting the text on 'sequence of unwanted characters'
will result in a list of wanted sequences, as will 'matching sequences
of wanted characters'.

"It's all in the mind" :-)


------------------------------

Date: Mon, 18 May 2015 21:47:40 -0700 (PDT)
From: "C.DeRykus" <derykus@gmail.com>
Subject: Re: Extract all "words"
Message-Id: <29f3c630-dd2c-4eb4-ba44-967941a4b274@googlegroups.com>

On Monday, May 18, 2015 at 3:16:42 PM UTC-7, Rainer Weikusat wrote:
> "C.DeRykus" <derykus@gmail.com> writes:
> ...
> 
> This will at least prevent it from sueing for grapheme discrimination ...
> 
> > Also IMO stylistically it's clearer to remove the unwanted's
> > separately rather embedding them in the regex (even if you were to
> > stick 'em in a variable).
> 
> Assuming a certain, mental model for viewing the problem (which could be
> described as "a text is a sequence of white-space separated words which
> may have been contaminated with punctuation") it probably is. But you
> can equally well just regard the input as alternating sequence of
> "sequence of characters we want" and "sequence of characters we don't
> want". Hence, spliting the text on 'sequence of unwanted characters'
> will result in a list of wanted sequences, as will 'matching sequences
> of wanted characters'.
> 
> "It's all in the mind" :-)

[off topic]

Ah, the Zen of Perl practice. Smash a contaminated stack full of bad ch'i with a single split. Blindfold optional :)

[/off topic]


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 4435
***************************************


home help back first fref pref prev next nref lref last post