[24635] in Perl-Users-Digest
Perl-Users Digest, Issue: 6799 Volume: 10
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Aug 3 12:25:57 2004
Date: Tue, 3 Aug 2004 09:25:33 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Tue, 3 Aug 2004 Volume: 10 Number: 6799
Today's topics:
best way to do this? (MJL)
Re: best way to do this? <noreply@gunnar.cc>
Re: best way to do this? (Anno Siegel)
Re: best way to do this? <abigail@abigail.nl>
Re: best way to do this? <bowsayge@nomail.afraid.org>
Re: best way to do this? <bernard.el-haginDODGE_THIS@lido-tech.net>
Re: best way to do this? (MJL)
Re: best way to do this? <bowsayge@nomail.afraid.org>
Re: best way to do this? <dwall@fastmail.fm>
Re: best way to do this? <bernard.el-haginDODGE_THIS@lido-tech.net>
Re: best way to do this? <dwall@fastmail.fm>
Bizarre PerlScript/WSH/UTF-8 problem <corff@cis.fu-berlin.de>
Re: Bizarre PerlScript/WSH/UTF-8 problem <corff@cis.fu-berlin.de>
Re: Bizarre PerlScript/WSH/UTF-8 problem <corff@cis.fu-berlin.de>
bootstrap ?? <bsinha@qualcomm.com>
Re: bootstrap ?? <nobull@mail.com>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: 26 Jul 2004 13:12:59 -0700
From: mail@affordablemedicalsoftware.com (MJL)
Subject: best way to do this?
Message-Id: <29c5bb08.0407261212.13f50d78@posting.google.com>
I'm sure this is not the most efficient way to accomplish my goal of
taking a file of text and converting it into a list of individual
words and punctuation symbols. It works, but I am curious about how
to do it differently. Thanks!
#!/usr/bin/perl
open INF, "./testfile1.txt";
while (<INF>)
{
@words = split;
push @list, @words;
}
foreach(@list)
{
/\S+\w+/;
if ($& ne "") {push @list2, "$&\n";}
if ($' ne "") {push @list2, "$'\n";}
}
open OUTF, ">./testfile2.txt";
print OUTF @list2;
close INF;
close OUTF;
------------------------------
Date: Mon, 26 Jul 2004 23:38:52 +0200
From: Gunnar Hjalmarsson <noreply@gunnar.cc>
Subject: Re: best way to do this?
Message-Id: <2mlc30Fo3r25U1@uni-berlin.de>
MJL wrote:
> I'm sure this is not the most efficient way to accomplish my goal
> of taking a file of text and converting it into a list of
> individual words and punctuation symbols. It works, but I am
> curious about how to do it differently. Thanks!
>
> #!/usr/bin/perl
> open INF, "./testfile1.txt";
> while (<INF>)
> {
> @words = split;
> push @list, @words;
> }
>
> foreach(@list)
> {
> /\S+\w+/;
> if ($& ne "") {push @list2, "$&\n";}
> if ($' ne "") {push @list2, "$'\n";}
> }
>
>
> open OUTF, ">./testfile2.txt";
> print OUTF @list2;
> close INF;
> close OUTF;
Well, I think this accomplishes the same thing, but without the @arrays:
#!/usr/bin/perl
use strict;
use warnings;
open INF, './testfile1.txt' or die $!;
open OUTF, '> ./testfile2.txt' or die $!;
while (<INF>) {
while( /(\S+\w+)(\S+)?/g ) {
print OUTF "$1\n";
print OUTF "$2\n" if $2;
}
}
close INF;
close OUTF;
__END__
Another thing is whether it actually does what you want...
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
------------------------------
Date: 26 Jul 2004 22:03:06 GMT
From: anno4000@lublin.zrz.tu-berlin.de (Anno Siegel)
Subject: Re: best way to do this?
Message-Id: <ce3v2q$k62$1@mamenchi.zrz.TU-Berlin.DE>
MJL <mail@affordablemedicalsoftware.com> wrote in comp.lang.perl.misc:
> I'm sure this is not the most efficient way to accomplish my goal of
> taking a file of text and converting it into a list of individual
> words and punctuation symbols. It works, but I am curious about how
> to do it differently. Thanks!
>
> #!/usr/bin/perl
> open INF, "./testfile1.txt";
> while (<INF>)
> {
> @words = split;
> push @list, @words;
> }
>
> foreach(@list)
> {
> /\S+\w+/;
> if ($& ne "") {push @list2, "$&\n";}
> if ($' ne "") {push @list2, "$'\n";}
> }
>
>
> open OUTF, ">./testfile2.txt";
> print OUTF @list2;
> close INF;
> close OUTF;
You can gain more out of the first split, if you split not only on
white space, but word boundaries too. That way, the string neatly
separates in consecutive pieces of word-characters and punctuation,
with blanks removed.
There is also no good reason to collect the parts first. You might
as well separate them right in the loop. So:
my ( @words, @punct);
while ( <DATA> ) {
for ( split /\s+|\b/ ) {
if ( /\w/ ) {
push @words, $_;
} else {
push @punct, $_;
}
}
}
or, in more compact form
while ( <DATA> ) {
push @{ /\w/ ? \ @words : \ @punct}, $_ for split /\s+|\b/;
}
Anno
------------------------------
Date: 27 Jul 2004 20:39:25 GMT
From: Abigail <abigail@abigail.nl>
Subject: Re: best way to do this?
Message-Id: <slrncgdfap.6hs.abigail@alexandra.abigail.nl>
MJL (mail@affordablemedicalsoftware.com) wrote on MMMCMLXXXII September
MCMXCIII in <URL:news:29c5bb08.0407261212.13f50d78@posting.google.com>:
][ I'm sure this is not the most efficient way to accomplish my goal of
][ taking a file of text and converting it into a list of individual
][ words and punctuation symbols. It works, but I am curious about how
][ to do it differently. Thanks!
][
][ #!/usr/bin/perl
][ open INF, "./testfile1.txt";
][ while (<INF>)
][ {
][ @words = split;
][ push @list, @words;
][ }
][
][ foreach(@list)
][ {
][ /\S+\w+/;
][ if ($& ne "") {push @list2, "$&\n";}
][ if ($' ne "") {push @list2, "$'\n";}
][ }
I'd do it this way (untested):
while (<INF>) {
push @list => map {"$_\n"} /(\w+|[^\w\s]+)/g;
}
Or you could do:
while (<INF>) {
s/\s+//g;
push @list => map {"$_\n"} split /(\w+)/;
}
Abigail
--
perl -wle '$, = " "; sub AUTOLOAD {($AUTOLOAD =~ /::(.*)/) [0];}
print+Just (), another (), Perl (), Hacker ();'
------------------------------
Date: Wed, 28 Jul 2004 10:34:36 GMT
From: bowsayge <bowsayge@nomail.afraid.org>
Subject: Re: best way to do this?
Message-Id: <0jLNc.590$Jp6.87@newsread3.news.atl.earthlink.net>
Abigail said to us:
[ Splitting a file into words an symbols question ]
> I'd do it this way (untested):
>
> while (<INF>) {
> push @list => map {"$_\n"} /(\w+|[^\w\s]+)/g;
> }
push @list => /(\w+|[^\w\s]+)/g;
will also work
>
> Or you could do:
>
> while (<INF>) {
> s/\s+//g;
The above line folds all consecutive words together.
Change to: s/\s+/ /g;
> push @list => map {"$_\n"} split /(\w+)/;
> }
Bowsayge might do it like this:
while (<INF>) {
s/[^[:alnum:]]/ $& /g;
push @list, (split /[[:space:]]+/,$_);
}
--
bowsayge
bow-say-ge?
------------------------------
Date: Wed, 28 Jul 2004 13:01:06 +0200
From: "Bernard El-Hagin" <bernard.el-haginDODGE_THIS@lido-tech.net>
Subject: Re: best way to do this?
Message-Id: <Xns953484D707FF1elhber1lidotechnet@62.89.127.66>
bowsayge <bowsayge@nomail.afraid.org> wrote:
> Abigail said to us:
>
> [ Splitting a file into words an symbols question ]
>> I'd do it this way (untested):
>>
>> while (<INF>) {
>> push @list => map {"$_\n"} /(\w+|[^\w\s]+)/g;
>> }
>
> push @list => /(\w+|[^\w\s]+)/g;
> will also work
But give an entirely different result.
>> Or you could do:
>>
>> while (<INF>) {
>> s/\s+//g;
>
> The above line folds all consecutive words together.
Yes, now that Bowsayge removed the map() which prevented this.
> Change to: s/\s+/ /g;
No, don't. Just leave the correct answer Abigail gave alone.
>> push @list => map {"$_\n"} split /(\w+)/;
>> }
>
> Bowsayge might do it like this:
You are talking about yourself in the third person. You do realise
that, right?
> while (<INF>) {
> s/[^[:alnum:]]/ $& /g;
> push @list, (split /[[:space:]]+/,$_);
> }
Oh the irony. *That* solution "folds all consecutive words together".
--
Cheers,
Bernard
------------------------------
Date: 28 Jul 2004 12:12:57 -0700
From: mail@affordablemedicalsoftware.com (MJL)
Subject: Re: best way to do this?
Message-Id: <29c5bb08.0407281112.1d915fce@posting.google.com>
Thanks to all for great alternatives! I am having a great time
running and dissecting all of these suggestions.
I should clarify my goal: I want to write a program that takes a text
file or a text string and turn it into an html file/string. Each
individual word is to become a link to a definition of that word.
Punctuation is to be excluded of course and each word is to be defined
only once. I wrote a version that works as a cgi program. It still
needs a little work. I appologize for any poor or innefficient use of
the language. This is not a homework assignment or anything. I'm
just playing around, trying to learn a little perl. Thanks again!
#!/usr/bin/perl
# process a string and turn it into a webpage with internal links to
definitions...
use CGI qw(:standard);
$_ = param("mytext");
@list = split;
foreach(@list)
{
/\S+\w+/;
if ($& ne "")
{
push @list2, "<a href=\"#defn_$&\">$&</a> \n";
$ins =
"<a name=defn_$&>definition of $&:</a>
\n\n<p>\n\n\n</p>\n<hr>\n\n";
$chk = 0;
foreach(@list4)
{
if ($_ eq $ins) {$chk = 1;break;}
}
if ($chk == 0)
{
push @list4, $ins;
}
}
if ($' ne "") {push @list2, "$'\n";}
}
print header(), start_html("definitions"), h1("Definitions");
foreach(@list2) {print;}
print h1("definitions");
foreach(@list4) {print;}
------------------------------
Date: Fri, 30 Jul 2004 08:25:08 GMT
From: bowsayge <bowsayge@nomail.afraid.org>
Subject: Re: best way to do this?
Message-Id: <EBnOc.2342$Jp6.1221@newsread3.news.atl.earthlink.net>
Bernard El-Hagin said to us:
> bowsayge <bowsayge@nomail.afraid.org> wrote:
>
>> Abigail said to us:
>>
>> [ Splitting a file into words an symbols question ]
>>> I'd do it this way (untested):
>>>
>>> while (<INF>) {
>>> push @list => map {"$_\n"} /(\w+|[^\w\s]+)/g;
>>> }
>>
>> push @list => /(\w+|[^\w\s]+)/g;
>> will also work
>
>
> But give an entirely different result.
>
Oops, Bowsayge didn't even look at the OP's original code.
[...]
> You are talking about yourself in the third person. You do realise
> that, right?
>
Bowsayge knows.
[...]
> Oh the irony. *That* solution "folds all consecutive words together".
Not if you print the list with $" set to \n
:)
--
bowsayge
------------------------------
Date: Fri, 30 Jul 2004 14:53:30 -0000
From: "David K. Wall" <dwall@fastmail.fm>
Subject: Re: best way to do this?
Message-Id: <Xns95366ECB8FAC3dkwwashere@216.168.3.30>
Bernard El-Hagin <bernard.el-haginDODGE_THIS@lido-tech.net> wrote in
message <news:Xns953484D707FF1elhber1lidotechnet@62.89.127.66>:
> bowsayge <bowsayge@nomail.afraid.org> wrote:
>
>> Abigail said to us:
>>
>> [ Splitting a file into words an symbols question ]
[snip]
>>> Or you could do:
>>>
>>> while (<INF>) {
>>> s/\s+//g;
>>
>> The above line folds all consecutive words together.
>
>
> Yes, now that Bowsayge removed the map() which prevented this.
What map()?
>> Change to: s/\s+/ /g;
>
>
> No, don't. Just leave the correct answer Abigail gave alone.
See below...
>>> push @list => map {"$_\n"} split /(\w+)/;
>>> }
How is that correct? If I change INF to DATA to make it self-
contained:
<code>
my @list;
while (<DATA>) {
s/\s+//g;
push @list => map {"$_\n"} split /(\w+)/;
}
print @list;
__DATA__
The language is intended to be practical (easy to use,
efficient, complete) rather than beautiful (tiny,
elegant, minimal).
</code>
...then the above code produces this output:
<output>
Thelanguageisintendedtobepractical
(
easytouse
,
efficient
,
complete
)
ratherthanbeautiful
(
tiny
,
elegant
,
minimal
).
</output>
That doesn't look correct, and I was careful to cut-and-paste the
code from Abigail's post (not the followup), making only the change
mentioned. (INF to DATA)
------------------------------
Date: Mon, 2 Aug 2004 07:47:24 +0200
From: "Bernard El-Hagin" <bernard.el-haginDODGE_THIS@lido-tech.net>
Subject: Re: best way to do this?
Message-Id: <Xns95394FABD10BDelhber1lidotechnet@62.89.127.66>
"David K. Wall" <dwall@fastmail.fm> wrote:
> Bernard El-Hagin <bernard.el-haginDODGE_THIS@lido-tech.net> wrote
> in message <news:Xns953484D707FF1elhber1lidotechnet@62.89.127.66>:
>
>> bowsayge <bowsayge@nomail.afraid.org> wrote:
>>
>>> Abigail said to us:
>>>
>>> [ Splitting a file into words an symbols question ]
>
> [snip]
>
>>>> Or you could do:
>>>>
>>>> while (<INF>) {
>>>> s/\s+//g;
>>>
>>> The above line folds all consecutive words together.
>>
>>
>> Yes, now that Bowsayge removed the map() which prevented this.
>
> What map()?
The map() which he removed from Abigail's first example (which works
correctly).
>>> Change to: s/\s+/ /g;
>>
>>
>> No, don't. Just leave the correct answer Abigail gave alone.
>
> See below...
[...]
Yes, the second example is messed up.
--
Cheers,
Bernard
------------------------------
Date: Mon, 02 Aug 2004 13:41:47 -0000
From: "David K. Wall" <dwall@fastmail.fm>
Subject: Re: best way to do this?
Message-Id: <Xns953962A37C2EEdkwwashere@216.168.3.30>
Bernard El-Hagin <bernard.el-haginDODGE_THIS@lido-tech.net> wrote in
message <news:Xns95394FABD10BDelhber1lidotechnet@62.89.127.66>:
> "David K. Wall" <dwall@fastmail.fm> wrote:
>
>> Bernard El-Hagin <bernard.el-haginDODGE_THIS@lido-tech.net> wrote
>> in message
>> <news:Xns953484D707FF1elhber1lidotechnet@62.89.127.66>:
>>
>>> bowsayge <bowsayge@nomail.afraid.org> wrote:
>>>
>>>> Abigail said to us:
>>>>
>>>> [ Splitting a file into words an symbols question ]
>>
>> [snip]
>>
>>>>> Or you could do:
>>>>>
>>>>> while (<INF>) {
>>>>> s/\s+//g;
>>>>
>>>> The above line folds all consecutive words together.
>>>
>>>
>>> Yes, now that Bowsayge removed the map() which prevented this.
>>
>> What map()?
>
>
> The map() which he removed from Abigail's first example (which
> works correctly).
Ah, OK. I thought you meant the second example instead of the the
first. Never mind. :-)
------------------------------
Date: 23 Jul 2004 20:19:40 GMT
From: <corff@cis.fu-berlin.de>
Subject: Bizarre PerlScript/WSH/UTF-8 problem
Message-Id: <2mda6sFlaquiU1@uni-berlin.de>
Hi All,
I try to put utf8 material into a browser page via a Perl script
embedded in an HTML page. The whole thing runs under Windows XP
Professional, I am using ActivePerl 5.8.0 and IE 6.0. A minimal
file exhibiting the problem is given here:
<HTML>
<HEAD>
<TITLE>PerlScript Minimal Test</TITLE>
</HEAD>
<BODY>
<H2>A Chinese Character: 一</H2><!-- test, works well -->
<SCRIPT LANGUAGE="PerlScript">
use utf8; # Doesn't seem to make any difference here
#
$abwwide="\x{0410}\x{0411}\x{0412}" # Cyrillic ABW
$window->document->write($abwwide); # Doesn't work
#
$abw ="АБВ" # Again Cyrillic ABW, but in utf8
$window->document->write($abw); # Doesn't work
#
# Direct approach
$window->document->write("АБВ"); # doesn't work, either
#
$htmlified_char='АБВ'; The same, ABW
$window->document->write($htmlified_char); # works!
</SCRIPT>
</BODY>
</HTML>
I think I've browsed the complete documentation of AS Perl as far
as it is at least remotely related to either Unicode or WSH; I've
been writing Perl code for Linux which successfully digests thousands
of lines of utf8-encoded text in the wildest array of languages
(e.g., Mongolian, Arabic, Chinese, Tibetan all in one document)
and it works. However I fail to understand where to search for a
solution to the above-mentioned problem.
Thanks for any hints,
Oliver.
--
Dr. Oliver Corff e-mail: corff@zedat.fu-berlin.de
------------------------------
Date: 23 Jul 2004 20:40:42 GMT
From: <corff@cis.fu-berlin.de>
Subject: Re: Bizarre PerlScript/WSH/UTF-8 problem
Message-Id: <2mdbeaFlaquiU2@uni-berlin.de>
corff@cis.fu-berlin.de wrote:
: Hi All,
: I try to put utf8 material into a browser page via a Perl script
: embedded in an HTML page. The whole thing runs under Windows XP
: Professional, I am using ActivePerl 5.8.0 and IE 6.0. A minimal
: file exhibiting the problem is given here:
Of course I tried various settings of "View -> Encoding", and I tried
to set these as <META ...> statements, but this did not remove any
obstacle.
Oliver.
--
Dr. Oliver Corff e-mail: corff@zedat.fu-berlin.de
------------------------------
Date: 30 Jul 2004 20:14:54 GMT
From: <corff@cis.fu-berlin.de>
Subject: Re: Bizarre PerlScript/WSH/UTF-8 problem
Message-Id: <2mvohuFrfqeaU2@uni-berlin.de>
corff@cis.fu-berlin.de wrote:
With regard to the code below, I plead guilty for writing: "Doesn't work".
What I wanted to say instead is that the browser displays weird garbage
or question marks but not the desired output.
: Hi All,
: I try to put utf8 material into a browser page via a Perl script
: embedded in an HTML page. The whole thing runs under Windows XP
: Professional, I am using ActivePerl 5.8.0 and IE 6.0. A minimal
: file exhibiting the problem is given here:
: <HTML>
: <HEAD>
: <TITLE>PerlScript Minimal Test</TITLE>
: </HEAD>
: <BODY>
: <H2>A Chinese Character: 一</H2><!-- test, works well -->
: <SCRIPT LANGUAGE="PerlScript">
: use utf8; # Doesn't seem to make any difference here
: #
: $abwwide="\x{0410}\x{0411}\x{0412}" # Cyrillic ABW
: $window->document->write($abwwide); # Doesn't work
: #
: $abw ="???" # Again Cyrillic ABW, but in utf8
: $window->document->write($abw); # Doesn't work
: #
: # Direct approach
: $window->document->write("???"); # doesn't work, either
: #
: $htmlified_char='АБВ'; The same, ABW
: $window->document->write($htmlified_char); # works!
: </SCRIPT>
: </BODY>
: </HTML>
If anybody feels that this is not a perl-related question which is better
dealt with in a different newsgroup, I'll be happy to receive and follow
suggestions where to look/write.
Thanks,
Oliver.
--
Dr. Oliver Corff e-mail: corff@zedat.fu-berlin.de
------------------------------
Date: Thu, 29 Jul 2004 13:49:50 -0700
From: "Bharat Sinha" <bsinha@qualcomm.com>
Subject: bootstrap ??
Message-Id: <cebnt1$of4$1@fair.qualcomm.com>
Hi,
I got a package from cpan, and it contains a
"bootstrap 'nameof module'" command. When I use this module perl interpreter
fails to recognize this module unless I remove the bootstrap command.
Does anyone know what this command does and why the interpreter doesn't
recognize the package with this command in it.
------------------------------
Date: 29 Jul 2004 22:44:52 +0100
From: Brian McCauley <nobull@mail.com>
Subject: Re: bootstrap ??
Message-Id: <u9y8l21oa3.fsf@wcl-l.bham.ac.uk>
"Bharat Sinha" <bsinha@qualcomm.com> writes:
> I got a package from cpan,
Define "got". Did you _install_ it?
> and it contains a "bootstrap 'nameof module'" command.
This loads the non-Perl part of the module.
> When I use this module perl interpreter
> fails to recognize this module unless I remove the bootstrap command.
Define "fails to recognise".
--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc. For subscription or unsubscription requests, send
#the single line:
#
# subscribe perl-users
#or:
# unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.
NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V10 Issue 6799
***************************************