[33153] in Perl-Users-Digest


home	help	back	first	fref	pref	prev	next	nref	lref	last	post
Perl-Users Digest, Issue: 4432 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat May 16 03:09:21 2015

Date: Sat, 16 May 2015 00:09:05 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Sat, 16 May 2015     Volume: 11 Number: 4432

Today's topics:
        Extract all "words" <noreply2me@yahoo.com>
    Re: Extract all "words" <jurgenex@hotmail.com>
    Re: Extract all "words" <gamo@telecable.es>
    Re: Extract all "words" <rweikusat@mobileactivedefense.com>
    Re: interior arrows <rweikusat@mobileactivedefense.com>
    Re: wordx....not_wordx...wordy   pattern matching. deangwilliam30@gmail.com
    Re: wordx....not_wordx...wordy   pattern matching. <jurgenex@hotmail.com>
    Re: wordx....not_wordx...wordy   pattern matching. <jurgenex@hotmail.com>
    Re: wordx....not_wordx...wordy   pattern matching. <jurgenex@hotmail.com>
    Re: wordx....not_wordx...wordy   pattern matching. deangwilliam30@gmail.com
    Re: wordx....not_wordx...wordy   pattern matching. <bauhaus@futureapps.invalid>
    Re: wordx....not_wordx...wordy   pattern matching. deangwilliam30@gmail.com
    Re: wordx....not_wordx...wordy   pattern matching. deangwilliam30@gmail.com
    Re: wordx....not_wordx...wordy   pattern matching. deangwilliam30@gmail.com
    Re: wordx....not_wordx...wordy   pattern matching. <rweikusat@mobileactivedefense.com>
    Re: wordx....not_wordx...wordy   pattern matching. deangwilliam30@gmail.com
    Re: wordx....not_wordx...wordy   pattern matching. <rweikusat@mobileactivedefense.com>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Fri, 15 May 2015 04:44:11 -0700
From: "Robert Crandal" <noreply2me@yahoo.com>
Subject: Extract all "words"
Message-Id: <trWdncrja7WRQcjInZ2dnUVZ5o6dnZ2d@giganews.com>

I would like to extract all "words" from a document, and output
in the order that they occur to a file named "out.txt".

For example, given this input text:

"His light's shone on the J2 building, making the
window-panes glow like so many fires."

Then, the outfile should be:

His
light's
shone
on
the
J2
building
making
the
window-panes
glow
like
so
many
files

I prefer to keep hyphens (-) and apostrophes (') that occur
within words.

All other characters may be removed, such as commas, periods,
question marks, exclamation points, parentheses, whitespaces,
etc. etc. etc...

Also, I prefer to ignore words that contain ALL numbers, or
are a mix of numbers and non-alpha characters.  For example,
ignore:  12345, 05/05/15, (123)456-1564, 234-34-122, etc....

Is this best solved with a regular expression?





------------------------------

Date: Fri, 15 May 2015 04:59:18 -0700
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: Re: Extract all "words"
Message-Id: <nhnbladathkqh3b1ajf59mcgcrh6vvlha4@4ax.com>

"Robert Crandal" <noreply2me@yahoo.com> wrote:
>I would like to extract all "words" from a document, and output
>in the order that they occur to a file named "out.txt".
>
>For example, given this input text:
>
>"His light's shone on the J2 building, making the
>window-panes glow like so many fires."
>
>Then, the outfile should be:
>
>His
>light's
>shone
>on
>the
>J2
>building
>making
>the
>window-panes
>glow
>like
>so
>many
>files
>
>I prefer to keep hyphens (-) and apostrophes (') that occur
>within words.
>
>All other characters may be removed, such as commas, periods,
>question marks, exclamation points, parentheses, whitespaces,
>etc. etc. etc...
>
>Also, I prefer to ignore words that contain ALL numbers, or
>are a mix of numbers and non-alpha characters.  For example,
>ignore:  12345, 05/05/15, (123)456-1564, 234-34-122, etc....
>
>Is this best solved with a regular expression?

Thank you for providing a very good problem description together with a
sample input and desired output. It seems there are not many people
around who bother about those details any more.

Having said that: it looks to me as if split() does all of this already
except for the all-numbers part, which I would just do in a second step
with a simple grep() filter.
Is there some complication that I am missing with this simple solution?

jue 


------------------------------

Date: Fri, 15 May 2015 15:45:14 +0200
From: gamo <gamo@telecable.es>
Subject: Re: Extract all "words"
Message-Id: <mj4t99$cqr$1@speranza.aioe.org>

El 15/05/15 a las 13:44, Robert Crandal escribió:
> I would like to extract all "words" from a document, and output
> in the order that they occur to a file named "out.txt".
>
> For example, given this input text:
>
> "His light's shone on the J2 building, making the
> window-panes glow like so many fires."
>
> Then, the outfile should be:
>
> His
> light's
> shone
> on
> the
> J2
> building
> making
> the
> window-panes
> glow
> like
> so
> many
> files
>
> I prefer to keep hyphens (-) and apostrophes (') that occur
> within words.
>
> All other characters may be removed, such as commas, periods,
> question marks, exclamation points, parentheses, whitespaces,
> etc. etc. etc...
>
> Also, I prefer to ignore words that contain ALL numbers, or
> are a mix of numbers and non-alpha characters.  For example,
> ignore:  12345, 05/05/15, (123)456-1564, 234-34-122, etc....
>
> Is this best solved with a regular expression?
>
>
>

It could be something like this, to be improved by community:

#!/usr/bin/perl -w

while (<>){
     chomp;
     @words = split /\s+/;
     for (@words){
         $c++;
         $hash{$_} = ((defined $hash{$_}) ? $hash{$_} : $c) ;
         $word[$c] = $_;
     }
}

for (sort {$a<=>$b} values %hash){
     if ($word[$_] =~ /\w/){
         print $word[$_], "\n";
     }else{
     }
}




-- 
http://www.telecable.es/personales/gamo/
The generation of random numbers is too important to be left to chance


------------------------------

Date: Fri, 15 May 2015 14:46:24 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: Extract all "words"
Message-Id: <87fv6yq7gv.fsf@doppelsaurus.mobileactivedefense.com>

JÃ¼rgen Exner <jurgenex@hotmail.com> writes:
> "Robert Crandal" <noreply2me@yahoo.com> wrote:
>>I would like to extract all "words" from a document, and output
>>in the order that they occur to a file named "out.txt".
>>
>>For example, given this input text:
>>
>>"His light's shone on the J2 building, making the
>>window-panes glow like so many fires."

[...]

>>I prefer to keep hyphens (-) and apostrophes (') that occur
>>within words.

[...]

>>Also, I prefer to ignore words that contain ALL numbers, or
>>are a mix of numbers and non-alpha characters.  For example,
>>ignore:  12345, 05/05/15, (123)456-1564, 234-34-122, etc....

[...]

> Having said that: it looks to me as if split() does all of this already
> except for the all-numbers part, which I would just do in a second step
> with a simple grep() filter.
> Is there some complication that I am missing with this simple solution?

Neither - nor ' are considered word characters (\w) so a split
expression would need to be something like 'non-word character except if
- or ' if they're preceded by a word character'. This may be possible
but it's much more complicated than necessary: Matching the 'wanted'
parts of the text is easier. Two possible solutions (which may and
likely will miss all kinds of corner cases, most prominently, they
assume that 'letter' equals 'roman letter from the US alphabet'):

-----
while (<STDIN>) {
    /\G(\w[-\w']*)/gc && do {
	print("$1\n") unless $1 =~ /^[^[:alpha:]]+$/;
	redo;
    };
	
    /\G\W+/g and redo;
}
-----

This processes the input line by line, extracting subsequent words from
lines as they appear in the input text. Alternate solution which reads
all of the text in memory and employs a kind of pipeline for extracting
the output:

-----
@words = grep { /[A-Za-z]/ }  map { /\w[-\w']*/g } <STDIN>;
print(join("\n", @words), "\n");
-----

In addition to only considering 'POSIX letters' as letters, the 2nd also
assumes that letter encodings are continuous. That's true for ASCII and
anything build upon that but ASCII is an opinion about how to encode
letters and not everyone agrees with it.

It may also be possible to create a regexes which matches any 'word'
that's not solely composed of non-alphabetic characters although it may
contain some but IMHO, that's not worth the effort except 'for
educational purposes'.


------------------------------

Date: Thu, 14 May 2015 16:24:44 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: interior arrows
Message-Id: <87k2wbrxkz.fsf@doppelsaurus.mobileactivedefense.com>

ruben safir <ruben@mrbrklyn.com> writes:
> On 03/15/2015 05:32 PM, Rainer Weikusat wrote:
>> Omitting arrows between bracketed subscripts is possibly not such a good
>> idea as this may adversely affect legibility. Case in point:
>> 
>> sub get_obj_at
>> {
>>     return $_[0]->[IN_ORDER]->[$_[1]];
>> }
>> 
>> versus
>> 
>> sub get_obj_at
>> {
>>     return $_[0][IN_ORDER][$_[1]];
>> }
>
> the second one is clearer.

This statement makes little sense as they're both semantically
identical. The second one is closer to C syntax so people more familiar
with that _may_ find it easier to read. OTOH, for me, the similar
characters immediatley adjacent to each other (][) tend to blur into
each other.


------------------------------

Date: Thu, 14 May 2015 04:10:28 -0700 (PDT)
From: deangwilliam30@gmail.com
Subject: Re: wordx....not_wordx...wordy   pattern matching.
Message-Id: <d40b7c23-bc9a-4ce1-bdb4-5e164d66ef22@googlegroups.com>

Justin thank you for responding...I'd be grateful if you could please show me how to extract ubstrings like...

function blob1  closest function line preceding first begin
var
some var
begin

procedure blob2 closest proc preceding first begin
var
somevar...
somevar...
begin 

procedure blob3;
begin

from a Delphi unit i.e. the tighest procedure/function..begin pairings possible.
Again...the only words you can search for are "procedure", "function" and "begin" and only those that are at the start of a line save for perhaps <tab> and <space>.



------------------------------

Date: Thu, 14 May 2015 04:22:52 -0700
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: Re: wordx....not_wordx...wordy   pattern matching.
Message-Id: <j719la9qkicq2vn52r2478lhkp6gubts12@4ax.com>

deangwilliam30@gmail.com wrote:
>The skeleton above and the pseudocode seem pretty clear to me.

There is/was nothing above this line of yours. No skeleton.
And your posting didn't contain any pseudo-code either.

>Three lines wouldn't explain that.
>Thanks for your input though.

Who's input? What are you referring to?
You didn't reference anything and you didn't quote anything. So how is
anybody to know what input you mean and whom you are thanking?

jue 


------------------------------

Date: Thu, 14 May 2015 04:26:23 -0700
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: Re: wordx....not_wordx...wordy   pattern matching.
Message-Id: <mc19la102hna9voeq9lffmobcvgoa68jsm@4ax.com>

deangwilliam30@gmail.com wrote:
>btw asking for "lines" of "lines" is senseless. There's no such thing.

What are you talking about?
Usenet is a distributed, asynchronious medium. You can never assume that
a specific article has been delivered to a client, ever will be
delivered, or ever will be visible to the client.
Therefore for decades it has been a proven custom to quote enough
content such that each article is self-contained and can be understood
on its own content.

jue 


------------------------------

Date: Thu, 14 May 2015 04:28:47 -0700
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: Re: wordx....not_wordx...wordy   pattern matching.
Message-Id: <4k19lallrb51rgd72gicam3adqc0evpibv@4ax.com>

deangwilliam30@gmail.com wrote:
>To re-iterate...all we can rely on is that there are the words "procedure", "function" and "begin" at the start of some of the lines albeit possibly preceded by <space> or <tab> and the substrings to be extracted are...

This is getting truly irritating. Would you please limit your line
length to ~75 characters as has been customary in Usenet for decades?

And would you please quote sufficient context such that you postings
make sense?

Thank you

jue


------------------------------

Date: Thu, 14 May 2015 04:40:48 -0700 (PDT)
From: deangwilliam30@gmail.com
Subject: Re: wordx....not_wordx...wordy   pattern matching.
Message-Id: <57136c8b-b7e0-4bd8-8182-0ab1ba1643be@googlegroups.com>

jue thank you for responding...
I'd be grateful 
if you could please show me how to extract ubstrings like...

function blob1  closest function line preceding first begin
var
some var
begin

procedure blob2 closest proc preceding first begin
var
somevar...
somevar...
begin

procedure blob3;
begin

from a Delphi unit i.e. the tighest procedure/function...
begin pairings possible.
Again...the only words you can search for are 
"procedure", "function" and "begin" 
and only those that are at the start of a line save for 
perhaps <tab> and <space>. 

BTW I'm seeing everything that you're not on 
https://groups.google.com/forum/#!topic/comp.lang.perl.misc
I know nothing about usenet whatsoever.


------------------------------

Date: Thu, 14 May 2015 13:42:25 +0200
From: Georg Bauhaus <bauhaus@futureapps.invalid>
Subject: Re: wordx....not_wordx...wordy   pattern matching.
Message-Id: <mj21kt$7qn$1@dont-email.me>

On 13.05.15 22:35, deangwilliam30@gmail.com wrote:
> The pseudocode is this...

Let's say it's a problem statement?

I'm guessing that you want to find all (non-?)nested procedures and
functions of a Pascal/Delphi/... program text, specifically,
those from the "implementation" modules. (Only the innermost has
a "block of interest"?)

You'd assume that "begin" never follows some other word, does
not appear in a string literal, and always as a whole word on its own?

> make sure there's only one word at the start of a line called "implementation" and ignore everything before it.

Read lines, in Perl, then.
   while ($line = <$filehandle>) {

Ignore lines until one line start starts with "implementation"
   use a pattern that probably has '^' and '\s*' in addition to
   at least 'implementation'


What follows next would be alternating search of the text for
"procedure/function" and "begin", or something close to that. I'm not sure
how you want to handle:

  procedure N ...

     procedure M ...
     begin...
     end;

     function F ...
     begin...
     end;

begin  (* N *)

Does M not qualify as having a "block of interest"?

But seriously, what you are after, is, I think, tricky (if a Pascal
parser is), even when matching balanced "brackets". (Snobol-4 has a
BAL operator among its matchers, developed for use with context free
grammars, but limited to matching parentheses). A recursive approach
seems almost inescapable. Given that the input text seems to be some
dialect of Pascal, and if it is not very "regular", I'd expect to be
facing a few surprises. Professional services of translation or
analysis might be an existing option, if that's within reach.


I agree that the Googleian layout of your postings, none of them
showing structure other than very long lines, is difficult to read,
physically.



------------------------------

Date: Thu, 14 May 2015 04:56:23 -0700 (PDT)
From: deangwilliam30@gmail.com
Subject: Re: wordx....not_wordx...wordy   pattern matching.
Message-Id: <6b93ddbd-de5f-42fc-a18c-c58ee396fb09@googlegroups.com>

On Friday, May 8, 2015 at 9:07:20 AM UTC+1, deangwi...@gmail.com wrote:
> I've written a program in Powerbasic that matches the closest/least greedy "procedure/function" and "begin" pairs (in a Delphi program) so that I can insert a trace statement immediately after the first "begin" in procedures and functions.
> 
> I've looked at Perl and Tcl and see that there are things called negative lookaheads (that Powerbasic doesn't have) and can't help feeling that these might make the matching trivial but so far all of my efforts have failed.
> 
> What I'd like to do is match "proc2_edf_begin" in "proc1_abc_proc2_edf_begin_ghi_begin_end".
> 
> Any advice much appreciated.
> 
> BTW I'm not aware of the 2 in proc2 ahead of time...I'm just looking for the tighest procedure/function___begin couplings i.e. there is the odd stray procedure/function and there are potentially lots of begins in each proc/fn.

George
Thanks for your reply.
Sorry about my unfamiliarity with usenet requirements.
As stated...I can see everything very well from google groups.
There won't be any nesting of procedures functions.
It's just a question of (as you rightly say) getting past the 
line "implementation".
Then tracking the position of each "function/procedure" 
until you hit a "begin".
When you hit a begin...you want to return the text block
starting at the last procedure/function you saw upto the 
end of the begin.
You then want to repeat the process i.e. 
tracking all "procedure/function" words again 
until you hit another "begin".
Hope that explains.


------------------------------

Date: Thu, 14 May 2015 05:07:39 -0700 (PDT)
From: deangwilliam30@gmail.com
Subject: Re: wordx....not_wordx...wordy   pattern matching.
Message-Id: <c8bc760b-c213-44e1-ac05-602e549e092d@googlegroups.com>

I'd be grateful 
if someone could please show me how to extract ubstrings like...

function blob1  closest function line preceding first begin
var
some var
begin

procedure blob2 closest proc preceding first begin
var
somevar...
somevar...
begin

procedure blob3;
begin

from a Delphi unit i.e. the tighest procedure/function..begin pairings possible. 

Here's the recipe I used...
read file
assert there's only one line containing the sole word "implementation"
ignore all words before the line containing only "implementation"
label1:
  record the position of all instances of "procedure" or "function"..
    that starts a line (save for space or tab)
  until you see a line solely containing "begin" (save for space or tab)
  when you do see that line...return all lines from the last "procedure"..
    or "function" line upto and including the "begin" line.
goto label1
Hope this helps


------------------------------

Date: Thu, 14 May 2015 05:12:43 -0700 (PDT)
From: deangwilliam30@gmail.com
Subject: Re: wordx....not_wordx...wordy   pattern matching.
Message-Id: <7f39b91f-fd2d-4e8e-90c9-7419c8fa33e0@googlegroups.com>

How do I do this in Perl
read file
assert there's only one line containing the sole word "implementation"
ignore all words before the line containing only "implementation"
label1:
	record the position of all instances of "procedure" or "function"..
		that starts a line (save for space or tab)
	until you see a line solely containing "begin" (save for space or tab)
	when you do see that line...return all lines from the last "procedure"..
		or "function" line upto and including the "begin" line.
goto label1


------------------------------

Date: Thu, 14 May 2015 16:50:57 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: wordx....not_wordx...wordy   pattern matching.
Message-Id: <87d223rwda.fsf@doppelsaurus.mobileactivedefense.com>

deangwilliam30@gmail.com writes:
> here's a skeleton of the data file that get's slurped into the string.
> procedure blob0 a stray proc line
>
> procedure blob1 closest proc preceding first begin
> var
> somevar...
> begin
> <insert trace line here
>
> begin
> end;
>
> end;
>
>
> function blob2  closest function line preceding first begin
> var
> some var
> begin
> <insert trace line here
> begin
> end;
>
> end;

Below is a quickly done example which is good enough to process your
example (and almost certainly not good enough to process any real input)

---------
undef $/;
my $data = <STDIN>;
my $proc;

for ($data) {
    /\G([^pfb]+)/gc && do {
	print($1);
	redo;
    };
    
    /\G((?:procedure|function)\s+(\S+))/gc && do {
	$proc = $2;
	print($1);

	redo;
    };

    /\G(begin\s*?\n)/gc && do {
	print($1);

	if ($proc) {
	    print("trace(\'$proc\');\n");
	    $proc = undef;
	}

	redo;
    };

    /\G([pfb])/gc and print($1), redo;
}


------------------------------

Date: Fri, 15 May 2015 01:31:28 -0700 (PDT)
From: deangwilliam30@gmail.com
Subject: Re: wordx....not_wordx...wordy   pattern matching.
Message-Id: <ba335511-3209-4af5-bd0c-e9bda7cb1997@googlegroups.com>

below is a response to...
> here's a skeleton of the data file that get's slurped into the string.
> procedure blob0 a stray proc line
>
> procedure blob1 closest proc preceding first begin
> var
> somevar...
> begin
> <insert trace line here
>
> begin
> end;
>
> end;
>
>
> function blob2  closest function line preceding first begin
> var
> some var
> begin
> <insert trace line here
> begin
> end;
>
> end;

Below is a quickly done example which is good enough to process your
example (and almost certainly not good enough to process any real input)

---------
undef $/;
my $data = <STDIN>;
my $proc;

for ($data) {
    /\G([^pfb]+)/gc && do {
        print($1);
        redo;
    };
   
    /\G((?:procedure|function)\s+(\S+))/gc && do {
        $proc = $2;
        print($1);

        redo;
    };

    /\G(begin\s*?\n)/gc && do {
        print($1);

        if ($proc) {
            print("trace(\'$proc\');\n");
            $proc = undef;
        }

        redo;
    };

    /\G([pfb])/gc and print($1), redo;
}


now the response==========================================

Thankyou very much for your program...Here's it's output re the skeleton

perl test3.pl< test_fl.txt

procedure blob1 closest proc preceding first begin
trace('blob1');
var
somevar...
begin
<insert trace line here

begin
end;

end;


function blob2  closest function line preceding first begin
trace('line');
var
some var
begin
<insert trace line here
begin
end;

end;

Just to clarify I was hoping for... 

procedure blob1 closest proc preceding first begin
trace('blob1');
var
somevar...
begin

and 

function blob2  closest function line preceding first begin
trace('line');
var
some var
begin
 
Nonetheless...
I learned a lot from reading through your program 
to see how Perl worked so thank you very much 
for providing it. 
Best Regards Dean



------------------------------

Date: Fri, 15 May 2015 14:03:38 +0100
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: wordx....not_wordx...wordy   pattern matching.
Message-Id: <87mw16q9g5.fsf@doppelsaurus.mobileactivedefense.com>

deangwilliam30@gmail.com writes:
> below is a response to...
>> here's a skeleton of the data file that get's slurped into the string.
>> procedure blob0 a stray proc line
>>
>> procedure blob1 closest proc preceding first begin
>> var
>> somevar...
>> begin
>> <insert trace line here
>>
>> begin
>> end;
>>
>> end;
>>
>>
>> function blob2  closest function line preceding first begin
>> var
>> some var
>> begin
>> <insert trace line here
>> begin
>> end;
>>
>> end;
>
> Below is a quickly done example which is good enough to process your
> example (and almost certainly not good enough to process any real input)
>
> ---------

[...]

> function blob2  closest function line preceding first begin

[...]

Have you considered deleting the stupid inline comments so that the
'function preceeding the first begin is not the second function on this
line and the first begin not the begin at the end of it?


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 4432
***************************************

home	help	back	first	fref	pref	prev	next	nref	lref	last	post
[33153] in Perl-Users-Digest

Perl-Users Digest, Issue: 4432 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)Sat May 16 03:09:21 2015

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat May 16 03:09:21 2015