[24660] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 6824 Volume: 10

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Aug 3 14:26:58 2004

Date: Tue, 3 Aug 2004 11:26:28 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Tue, 3 Aug 2004     Volume: 10 Number: 6824

Today's topics:
        SOLUTION IDENTIFIED Re: Parsing form POST without CGI.p <aaron@deloachcorp.com>
    Re: SOLUTION IDENTIFIED Re: Parsing form POST without C <matthew.garrish@sympatico.ca>
    Re: SOLUTION IDENTIFIED Re: Parsing form POST without C <noreply@gunnar.cc>
    Re: SOLUTION IDENTIFIED Re: Parsing form POST without C <matthew.garrish@sympatico.ca>
    Re: SOLUTION IDENTIFIED Re: Parsing form POST without C <matthew.garrish@sympatico.ca>
    Re: SOLUTION IDENTIFIED Re: Parsing form POST without C <noreply@gunnar.cc>
    Re: SOLUTION IDENTIFIED Re: Parsing form POST without C <ceo@nospam.on.net>
    Re: SOLUTION IDENTIFIED Re: Parsing form POST without C <bik.mido@tiscalinet.it>
    Re: SOLUTION IDENTIFIED Re: Parsing form POST without C <flavell@ph.gla.ac.uk>
        SOLUTION: NEWBIE: Perls System -command and Cygwin bash <pekka.niiranen@wlanmail.com>
    Re: Sort Part of a string (David Combs)
        splitting paragraph into sentences <mr@sandman.net>
    Re: splitting paragraph into sentences <noreply@gunnar.cc>
    Re: splitting paragraph into sentences <mr@sandman.net>
    Re: splitting paragraph into sentences <noreply@gunnar.cc>
    Re: splitting paragraph into sentences <mritty@gmail.com>
    Re: splitting paragraph into sentences <bowsayge@nomail.afraid.org>
    Re: splitting paragraph into sentences (Anno Siegel)
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Mon, 2 Aug 2004 10:29:30 -0500
From: "Aaron DeLoach" <aaron@deloachcorp.com>
Subject: SOLUTION IDENTIFIED Re: Parsing form POST without CGI.pm on Win32
Message-Id: <hoidnWqoMPt7wJPcRVn-rw@eatel.net>

"Aaron DeLoach" <aaron@deloachcorp.com> wrote in message
news:v4udnXi4E8kxUpDcRVn-hQ@eatel.net...
> "Aaron DeLoach" <aaron@deloachcorp.com> wrote in message
> news:KM6dnaNCa5IO8pHcRVn-hA@eatel.net...
> > My Perl programs are developed in the Win32 environment. Some of my work
> > gets ported to the Unix OS.
> >
> > I use the CGI.pm module to 'paramitize' form post data. Everything works
> > well with this great module.
> >
> > However, I have a program that will be ran every ten seconds or so
(maybe
> > more?).  I use the CGI.pm just to parse the initial form post data into
> > parameters that I immediately place and work with in hashes (I love
> hashes).
> > We control the form post data, so I'm not terribly worried about
problems
> > that the CGI.pm module tends too regarding such. This seems like a bit
of
> > overkill just to parse parameters I know, but in Windows there is no
STDIN
> > to parse form posts from like the Unix OS.
> >
> > Does anybody have a work-around/solution/tip/anything to get around
using
> > the CGI.pm for this instance?
> >
> > Regards,
> > Aaron
> >
> >
> >
>
> I am posting this message to update the thread. Maybe it will help someone
> else.
>
> Throughout my trials with this subject I was lead to believe that WinXP
Home
> did not expose the STDIN object. The problem is an Internet Explorer 6
issue
> (I don't know about earlier versions) . I could not read the STDIN via
Perl
> when a form was submitted with IE 6. On NN and Opera the STDIN was
> available. Now I'll try to find the solution... (I'll update the ng)
>
> Regards,
> Aaron
>
>
>
I am updating this thread to share a solution to the original problem of
accessing the STDIN object.

It has been discussed in other newsgroups and is an IE issue. Here is a post
from microsoft.public.inetserver.iis

"The problem is that the data in STDIN is encoded in
application/x-www-form-urlencoded, not plain text. So, <STDIN> waits for a
CR/LF
which never comes. The solution is to read() $ENV{'CONTENT-LENGTH'} bytes.
To be
safe, e.g. when using multipart/form-data, binmode(STDIN) as well."

So...
{
 ...
binmode(STDIN);
read(STDIN, $buffer, $len) == $len
 ...
}

Works in IE, NN and Opera

Thanks to all that have helped!

Regards,
Aaron

>




------------------------------

Date: Mon, 2 Aug 2004 14:45:51 -0400
From: "Matt Garrish" <matthew.garrish@sympatico.ca>
Subject: Re: SOLUTION IDENTIFIED Re: Parsing form POST without CGI.pm on Win32
Message-Id: <rZvPc.20055$Vm1.270103@news20.bellglobal.com>


"Aaron DeLoach" <aaron@deloachcorp.com> wrote in message
news:hoidnWqoMPt7wJPcRVn-rw@eatel.net...
>
> I am updating this thread to share a solution to the original problem of
> accessing the STDIN object.
>
> It has been discussed in other newsgroups and is an IE issue. Here is a
post
> from microsoft.public.inetserver.iis
>

So you huffed and you puffed and you insulted regulars here just to find out
that you don't have a Perl problem? My hunch is that you wouldn't have had
this problem had you used CGI.pm. Maybe now you'll understand why it's the
recommended method, and why few here care about your problems trying to do
it on your own.

Matt




------------------------------

Date: Mon, 02 Aug 2004 23:26:27 +0200
From: Gunnar Hjalmarsson <noreply@gunnar.cc>
Subject: Re: SOLUTION IDENTIFIED Re: Parsing form POST without CGI.pm on Win32
Message-Id: <2n7pt1Ftk3p8U1@uni-berlin.de>

Matt Garrish wrote:
> Aaron DeLoach wrote:
>> I am updating this thread to share a solution to the original
>> problem of accessing the STDIN object.
>> 
>> It has been discussed in other newsgroups and is an IE issue.
>> Here is a post from microsoft.public.inetserver.iis
> 
> So you huffed and you puffed and you insulted regulars here just to
> find out that you don't have a Perl problem? My hunch is that you
> wouldn't have had this problem had you used CGI.pm. Maybe now
> you'll understand why it's the recommended method, and why few here
> care about your problems trying to do it on your own.

Matt, if you had read the whole thread, you'd know that a failure to
parse POSTed data with CGI.pm was the starting-point of the thread.

To Aaron:
Thanks for the update! You may have found a few replies in this thread
somewhat odd. In that case, the explanation may be that a few persons
who post here are suffering from "Matt Wright phobia". One of the
symptoms of that disease is that they have no ability to imagine that
the CGI.pm module may have any shortcoming of any kind.

As regards the solution you mentioned, binmoding STDIN (which btw was
mentioned by Brian), I had a quick glance at the CGI.pm source, and it
seems to me that the module does that by default on Windows (but I'm
far from certain). Consequently, I'm slightly confused.

Could the explanation why CGI.pm didn't work for you possibly be that
you are using an old version? I for one would find it valuable if you
could make a try with the latest CGI.pm version, and let us know if it
makes a difference.

-- 
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl


------------------------------

Date: Mon, 2 Aug 2004 18:41:30 -0400
From: "Matt Garrish" <matthew.garrish@sympatico.ca>
Subject: Re: SOLUTION IDENTIFIED Re: Parsing form POST without CGI.pm on Win32
Message-Id: <qqzPc.11723$Jq2.496566@news20.bellglobal.com>


"Gunnar Hjalmarsson" <noreply@gunnar.cc> wrote in message
news:2n7pt1Ftk3p8U1@uni-berlin.de...
> Matt Garrish wrote:
> > Aaron DeLoach wrote:
> >> I am updating this thread to share a solution to the original
> >> problem of accessing the STDIN object.
> >>
> >> It has been discussed in other newsgroups and is an IE issue.
> >> Here is a post from microsoft.public.inetserver.iis
> >
> > So you huffed and you puffed and you insulted regulars here just to
> > find out that you don't have a Perl problem? My hunch is that you
> > wouldn't have had this problem had you used CGI.pm. Maybe now
> > you'll understand why it's the recommended method, and why few here
> > care about your problems trying to do it on your own.
>
> Matt, if you had read the whole thread, you'd know that a failure to
> parse POSTed data with CGI.pm was the starting-point of the thread.
>

I don't recall his having this problem with CGI.pm:

<quote>
I use the CGI.pm module to 'paramitize' form post data. Everything works
well with this great module.
</quote>

>
> To Aaron:
> Thanks for the update! You may have found a few replies in this thread
> somewhat odd. In that case, the explanation may be that a few persons
> who post here are suffering from "Matt Wright phobia". One of the
> symptoms of that disease is that they have no ability to imagine that
> the CGI.pm module may have any shortcoming of any kind.
>

It's hardly a new argument that CGI.pm has shortcomings. I even felt the
need to make it once upon a time myself:

http://www.perlmonks.org/index.pl?node_id=122267

The fact remains, however, that rolling one's own solution is fraught with
perils. No one has a problem with you wanting to learn or advocating
learning (or at least not me), but the majority of posters not using CGI.pm
should be. Whatever shortcomings the module has, I would still use it in a
production environment any day over a hand-rolled solution.

And as I mentioned in another post, the days of the bloat argument are fast
passing if not gone. So until someone provides me with a better reason to
parse my own parameters, I'll keep using/advocating CGI.pm.

Matt




------------------------------

Date: Mon, 2 Aug 2004 18:50:19 -0400
From: "Matt Garrish" <matthew.garrish@sympatico.ca>
Subject: Re: SOLUTION IDENTIFIED Re: Parsing form POST without CGI.pm on Win32
Message-Id: <HyzPc.11726$Jq2.499070@news20.bellglobal.com>


"Matt Garrish" <matthew.garrish@sympatico.ca> wrote in message
news:qqzPc.11723$Jq2.496566@news20.bellglobal.com...
>
> It's hardly a new argument that CGI.pm has shortcomings. I even felt the
> need to make it once upon a time myself:
>
> http://www.perlmonks.org/index.pl?node_id=122267
>

And looking back, I was never a very well-behaved monk, I must admit... : )

Matt




------------------------------

Date: Tue, 03 Aug 2004 01:18:35 +0200
From: Gunnar Hjalmarsson <noreply@gunnar.cc>
Subject: Re: SOLUTION IDENTIFIED Re: Parsing form POST without CGI.pm on Win32
Message-Id: <2n80fbFu0airU1@uni-berlin.de>

Matt Garrish wrote:
> Gunnar Hjalmarsson wrote:
>> Matt Garrish wrote:
>>> So you huffed and you puffed and you insulted regulars here
>>> just to find out that you don't have a Perl problem? My hunch
>>> is that you wouldn't have had this problem had you used CGI.pm.
>>> Maybe now you'll understand why it's the recommended method,
>>> and why few here care about your problems trying to do it on
>>> your own.
>> 
>> Matt, if you had read the whole thread, you'd know that a failure
>> to parse POSTed data with CGI.pm was the starting-point of the
>> thread.
> 
> I don't recall his having this problem with CGI.pm:
> 
> <quote>
> I use the CGI.pm module to 'paramitize' form post data. Everything
> works well with this great module.
> </quote>

Alan made me realize that I had (probably) misunderstood that. Apologies.

-- 
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl


------------------------------

Date: Tue, 03 Aug 2004 02:51:07 GMT
From: ChrisO <ceo@nospam.on.net>
Subject: Re: SOLUTION IDENTIFIED Re: Parsing form POST without CGI.pm on Win32
Message-Id: <v4DPc.1780$TR5.1571@newssvr16.news.prodigy.com>

Gunnar Hjalmarsson wrote:

> Matt Garrish wrote:
> 
>> Aaron DeLoach wrote:
>>
>>> I am updating this thread to share a solution to the original
>>> problem of accessing the STDIN object.
>>>
>>> It has been discussed in other newsgroups and is an IE issue.
>>> Here is a post from microsoft.public.inetserver.iis
>>
>>
>> So you huffed and you puffed and you insulted regulars here just to
>> find out that you don't have a Perl problem? My hunch is that you
>> wouldn't have had this problem had you used CGI.pm. Maybe now
>> you'll understand why it's the recommended method, and why few here
>> care about your problems trying to do it on your own.
> 
> 
> Matt, if you had read the whole thread, you'd know that a failure to
> parse POSTed data with CGI.pm was the starting-point of the thread.
> 
> To Aaron:
> Thanks for the update! You may have found a few replies in this thread
> somewhat odd. In that case, the explanation may be that a few persons
> who post here are suffering from "Matt Wright phobia". One of the
> symptoms of that disease is that they have no ability to imagine that
> the CGI.pm module may have any shortcoming of any kind.
> 
> As regards the solution you mentioned, binmoding STDIN (which btw was
> mentioned by Brian), I had a quick glance at the CGI.pm source, and it
> seems to me that the module does that by default on Windows (but I'm
> far from certain). Consequently, I'm slightly confused.
> 
> Could the explanation why CGI.pm didn't work for you possibly be that
> you are using an old version? I for one would find it valuable if you
> could make a try with the latest CGI.pm version, and let us know if it
> makes a difference.
> 

I'm still curious -- IE completely aside -- why...

perl -e "print while (<STDIN>)"

was reported as NOT working.  This should have NOTHING to do with IE. 
(Then again, under Windows, one never knows.)  I understand the OP glee 
that things are "working now."  But I'm really curious about the other 
and it were me, there's NO WAY I would be able to rest until both issues 
were resolved.  But that's just me...

-ceo


------------------------------

Date: Tue, 03 Aug 2004 10:28:28 +0200
From: Michele Dondi <bik.mido@tiscalinet.it>
Subject: Re: SOLUTION IDENTIFIED Re: Parsing form POST without CGI.pm on Win32
Message-Id: <4bctg01h3tccm9l16nm212cc8d2u5ttfmu@4ax.com>

On Mon, 2 Aug 2004 14:45:51 -0400, "Matt Garrish"
<matthew.garrish@sympatico.ca> wrote:

>> I am updating this thread to share a solution to the original problem of
>> accessing the STDIN object.
[snip]
>So you huffed and you puffed and you insulted regulars here just to find out
>that you don't have a Perl problem? My hunch is that you wouldn't have had
>this problem had you used CGI.pm. Maybe now you'll understand why it's the
>recommended method, and why few here care about your problems trying to do
>it on your own.

This is not at all an excuse for having insulted helping regulars here
and without any good (or even bad!) reason, which indeed he did, but
IIRC (and I only gave a brief peek into that thread) he may have
actually had reasonably good reasons *not* to use CGI.pm. In fact
Gunnar Hjalmarsson provided non-CGI.pm solution even though he
regularly uses it per all the examples by him I've seen so far...


Michele
-- 
you'll see that it shouldn't be so. AND, the writting as usuall is
fantastic incompetent. To illustrate, i quote:
- Xah Lee trolling on clpmisc,
  "perl bug File::Basename and Perl's nature"


------------------------------

Date: Mon, 2 Aug 2004 22:48:27 +0100
From: "Alan J. Flavell" <flavell@ph.gla.ac.uk>
Subject: Re: SOLUTION IDENTIFIED Re: Parsing form POST without CGI.pm on Win32
Message-Id: <Pine.LNX.4.61.0408022232300.567@ppepc56.ph.gla.ac.uk>

On Mon, 2 Aug 2004, Gunnar Hjalmarsson wrote:

> Matt, if you had read the whole thread, you'd know that a failure to
> parse POSTed data with CGI.pm was the starting-point of the thread.

You must be reading an entirely different posting from what I'm 
seeing, then:

| I use the CGI.pm module to 'paramitize' form post data. Everything 
| works well with this great module.

Looks good so far.  It which then went on to say:

|However, I have a program that will be ran every ten seconds or so 
|(maybe more?).  I use the CGI.pm just to parse the initial form post 
|data into parameters that I immediately place and work with in hashes 
|(I love hashes). We control the form post data, so I'm not terribly 
|worried about problems that the CGI.pm module tends too regarding 
|such. This seems like a bit of overkill just to parse parameters I 
|know, but in Windows there is no STDIN to parse form posts from like 
|the Unix OS.
|
|Does anybody have a work-around/solution/tip/anything to get around 
|using the CGI.pm for this instance?

I don't see anything there which says "CGI.pm cannot parse submissions 
on a Windows server" - do you?  To me it reads like "let's optimise 
prematurely".  Since the poster decided to then apply the flame torch 
instead of the microscalpel, I suppose we'll never really know what 
was intended...

[muddle omitted for brevity]

> As regards the solution you mentioned, binmoding STDIN (which btw was
> mentioned by Brian), I had a quick glance at the CGI.pm source,

CGI.pm handles it just fine.  What you have to grasp, I think, is that 
the claim to be "using" CGI.pm in this sample code:

  ___
/
use CGI qw/:standard/;
$CGI::POST_MAX=1024 * 100;
$CGI::DISABLE_UPLOADS = 1;

my (%in, $buffer);
if ($ENV{REQUEST_METHOD} eq 'POST') {
  my $len = $ENV{CONTENT_LENGTH};
  $len <= 131072 or die "Too much data submitted.\n";
  read(STDIN, $buffer, $len) == $len
    or die "Reading of posted data failed.\n";
} else { $buffer = $ENV{QUERY_STRING};
}
\___

was - how shall I put it? - "much exaggerated".

> Could the explanation why CGI.pm didn't work for you

Please, where did it say that?  The code that was presented on this 
thread was hand-knitted attempts at decoding POST submissions.  The 
poster expressed surprise that forms submissions weren't transmitted 
as plain text, for heaven's sake!  Doesn't that tell you something?

ttfn


------------------------------

Date: Fri, 23 Jul 2004 10:24:20 GMT
From: pekka niiranen <pekka.niiranen@wlanmail.com>
Subject: SOLUTION: NEWBIE: Perls System -command and Cygwin bash-shell, More details
Message-Id: <4100E754.8060901@wlanmail.com>

Got it,

"readdir" does not return the full path;
I must built the full path to the file when calling "system":
system("chmod 660 <add path here>/$tmpfile")

Thanks anyway,

-pekka-

pekka niiranen wrote:

> Ok here goes,
> 
> ---the perl script named "p" starts--
> #!/usr/local/bin/perl -w
> unless (opendir(TMPDIR, "/cygdrive/c/home/cygwin/tmp")) {
>     print "Can't open temporary directory !";
> }
> system("/cygdrive/c/home/cygwin/s");    # run the shell script
> while( defined ($tmpfile = readdir TMPDIR) ) {
>     next if $tmpfile =~ /^\.\.?$/;
>     system("chmod 660 $tmpfile") or die;   
> }
> system("ls -l ./tmp");
> closedir(TMPDIR);
> 
> ---the perl script named "p" stops---
> 
> ---the bash script named "s" starts---
> 
> #!/usr/bin/bash
> echo "Blah Blah" > ./tmp/tfile
> exit
> ---the bash script named "s" stops---
> 
> The perl script and shell script are run from the same directory.
> The file "tfile" is created into subdirectory "tmp"
> 
> "ls -l" from script directory gives (among other things):
> drwxr-xr-x+   2 treniira Administ        0 Jul 23 12:11 tmp/
> 
> Perl scripts output is:
> [vat58008:~] $ ./p
> chmod: getting attributes of `tfile': No such file or directory
> total 1
> -rw-r--r--    1 treniira Administ       10 Jul 23 12:23 tfile
> 
> 
> -pekka-
> 
> Anno Siegel wrote:
> 
>> Pekka Niiranen  <pekka.niiranen@wlanmail.com> wrote in 
>> comp.lang.perl.misc:
>>
>>> Hi there,
>>>
>>> I am having problem in W2K when using Cygwin's Perl.
>>> My Perl script starts Shell script with System -command; 
>>> system(scriptfile). The shell script creates a temporary file
>>> like this:
>>>
>>> #!/usr/bin/sh
>>> cat "blah blah" > tempfile
>>>
>>> and then exits back to Perl. When I then try to remove
>>> the created "tempfile" from the same Perl script with
>>> "unlink" I find out that Perl does not have access rights to the file.
>>
>>
>>
>> So what *are* the permissions and ownership of the file?  What
>> does "ls -l" say?  You are not giving us vital information.
>>
>>
>>> If I use "readdir" Perl finds the file but nor command
>>> "system(rm $file)" or "unlink($file)" does not work either. However, 
>>
>>
>>
>> Does $file contain what you think it does?
>>
>>
>>> when Perl script exits I can remove the file with
>>> normal Bash shell command: "rm tempfile".
>>>
>>> It seems that Perl is running with different user rights
>>> than the Shell script. My question is therefore: "How can I remove 
>>> file created by the Bash shell script started as subshell from Perl 
>>> script?"
>>
>>
>>
>> No idea.  Show complete code that demonstrates the problem.
>>
>> Anno


------------------------------

Date: Tue, 27 Jul 2004 05:00:13 +0000 (UTC)
From: dkcombs@panix.com (David Combs)
Subject: Re: Sort Part of a string
Message-Id: <ce4ngt$2ai$1@reader2.panix.com>

In article <cdnuhf$71d$1@mamenchi.zrz.TU-Berlin.DE>,
Anno Siegel <anno4000@lublin.zrz.tu-berlin.de> wrote:
>Randal L. Schwartz <merlyn@stonehenge.com> wrote in comp.lang.perl.misc:
>> >>>>> "David" == David Combs <dkcombs@panix.com> writes:
>> 
>> David> Aren't you actually the *child* of Schwartz, and thus sadly
>> David> *limited* by that fact, such that while out walking on 
>> David> especially dark nights, you come perilously close to
>> David> falling into deep holes?
>> 
>> This makes absolutely no sense to me.  If it's a reference
>> to a story somewhere, would you mind giving me a pointer?
>> 
>> Jokes only work when the audience understands the basis of the joke. :)
>
>Took me a while too.
>
>Karl Schwarzschild, 1873-1916, Astrophysicist, namesake of the
>Schwarzschild limit, which describes the conditions under which
>black holes are formed.  Now s/zs/tz/ and you're there.
>
>Anno

Some (many!) years back, Scientific American, for a while,
had it seems like a whole series of articles on the
I guess new idea of black holes, and his name was
all over the place.  Like a couple of decades ago (or more?).

I suppose that today it's strings, branes, all that stuff --
not that I understand anything about it.

From watching cspan's cspan-2's 48-hours each weekend "book-tv",
which has had Brian Green on twice, for his two books

[] *** Brian Green, "the fabric of the cosmos"


[] ***** "The Elegant Universe", by Brian Greene (not! layman's expl).

(Winner of the 2000 Aventis Prize for the best scientific book of
the year, and a Pulitzer finalist.)

This guy (green) is some super-duper physicist (max planck lab?)
who works on string theory, and seems to have written two
really good books on that and other maybe-related subjects.

The "reviews" at Amazon are worth looking at.

David





------------------------------

Date: Mon, 02 Aug 2004 14:17:06 +0200
From: Sandman <mr@sandman.net>
Subject: splitting paragraph into sentences
Message-Id: <mr-2F19A0.14170602082004@individual.net>

I've searched the docs, but I can't seem to get it right... I want to split a 
paragraph into sentences, but this doesn't work:

#!/usr/bin/perl
use strict;
use warnings;

my $string = "This is a sentence. This is also a sentence. This as well, right? 
Yes, it is!";

my @list = split / ?(?=[\.\?\!])/, $string;

foreach (@list){
    print "$_\n";
}

__END__
This is a sentence
 . This is also a sentence
 . This as well, right
? Yes, it is
!


The delimiter is kept, but to the wrong item - how do I keep it attached to the 
correct item?

Or is there a special var that keeps the matched delimiter in a split() 
operation, so I could do something like this:


#!/usr/bin/perl
use strict;
use warnings;

my $string = "This is a sentence. This is also a sentence. This as well, right? 
Yes, it is!";

my @list = split /[\.\?\!] /, $string;

foreach (@list){
    print "$_$SPECIALVARIABLE\n";
}


As you may have understood, the wanted output is:

This is a sentence.
This is also a sentence.
This as well, right?
Yes, it is!

-- 
Sandman[.net]


------------------------------

Date: Mon, 02 Aug 2004 14:39:37 +0200
From: Gunnar Hjalmarsson <noreply@gunnar.cc>
Subject: Re: splitting paragraph into sentences
Message-Id: <2n6r0pFtnnvbU1@uni-berlin.de>

Sandman wrote:
> I've searched the docs, but I can't seem to get it right... I want
> to split a paragraph into sentences, but this doesn't work:
> 
> #!/usr/bin/perl
> use strict;
> use warnings;
> 
> my $string = "This is a sentence. This is also a sentence. This as
> well, right? Yes, it is!";
> 
> my @list = split / ?(?=[\.\?\!])/, $string;
> 
> foreach (@list){
>     print "$_\n";
> }
> 
> __END__
> This is a sentence
> . This is also a sentence
> . This as well, right
> ? Yes, it is
> !
> 
> The delimiter is kept, but to the wrong item - how do I keep it
> attached to the correct item?

Try a look-behind instead. This may be what you want:

     my @list = split /(?<=[.?!])\s*/, $string;

(Note that '.', '?' etc. are not special in a character class, and do
therefore not need to be escaped.)

-- 
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl


------------------------------

Date: Mon, 02 Aug 2004 14:48:10 +0200
From: Sandman <mr@sandman.net>
Subject: Re: splitting paragraph into sentences
Message-Id: <mr-45EE5E.14481002082004@individual.net>

In article <2n6r0pFtnnvbU1@uni-berlin.de>,
 Gunnar Hjalmarsson <noreply@gunnar.cc> wrote:

> Sandman wrote:
> > I've searched the docs, but I can't seem to get it right... I want
> > to split a paragraph into sentences, but this doesn't work:
> > 
> > #!/usr/bin/perl
> > use strict;
> > use warnings;
> > 
> > my $string = "This is a sentence. This is also a sentence. This as
> > well, right? Yes, it is!";
> > 
> > my @list = split / ?(?=[\.\?\!])/, $string;
> > 
> > foreach (@list){
> >     print "$_\n";
> > }
> > 
> > __END__
> > This is a sentence
> > . This is also a sentence
> > . This as well, right
> > ? Yes, it is
> > !
> > 
> > The delimiter is kept, but to the wrong item - how do I keep it
> > attached to the correct item?
> 
> Try a look-behind instead. This may be what you want:
> 
>      my @list = split /(?<=[.?!])\s*/, $string;
> 
> (Note that '.', '?' etc. are not special in a character class, and do
> therefore not need to be escaped.)

Thanks again Gunnar, I didn't even know you could do a look-behind. I ddidn't 
find it documented anywhere I looked.

-- 
Sandman[.net]


------------------------------

Date: Mon, 02 Aug 2004 14:50:19 +0200
From: Gunnar Hjalmarsson <noreply@gunnar.cc>
Subject: Re: splitting paragraph into sentences
Message-Id: <2n6rkrFtngaoU1@uni-berlin.de>

Sandman wrote:
> Thanks again Gunnar, I didn't even know you could do a look-behind.
> I ddidn't find it documented anywhere I looked.

See "Extended Patterns" in "perldoc perlre".

-- 
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl


------------------------------

Date: Mon, 2 Aug 2004 08:51:07 -0400
From: Paul Lalli <mritty@gmail.com>
Subject: Re: splitting paragraph into sentences
Message-Id: <20040802084259.D14831@barbara.cs.rpi.edu>

On Mon, 2 Aug 2004, Sandman wrote:

> I've searched the docs, but I can't seem to get it right... I want to split a
> paragraph into sentences, but this doesn't work:
>
> #!/usr/bin/perl
> use strict;
> use warnings;
>
> my $string = "This is a sentence. This is also a sentence. This as well, right?
> Yes, it is!";
>
> my @list = split / ?(?=[\.\?\!])/, $string;
>
> foreach (@list){
>     print "$_\n";
> }
>
> __END__
> This is a sentence
> . This is also a sentence
> . This as well, right
> ? Yes, it is
> !
>
>
> The delimiter is kept, but to the wrong item - how do I keep it attached to the
> correct item?

You haven't well defined the characters you actually want to split on.
The terms you want to capture are seperated by one or more whitespaces -
but only those whitespaces that follow a punctuation mark.  This sounds
like a good job for look-behind assertions:

#!/usr/bin/perl
use strict;
use warnings;

my $string = "This is a sentence. This is also a sentence. This as well, right? Yes, it is!";

my @list = split /(?<=[.!?])\s+/, $string;
print "$_\n" for @list;
__END__
This is a sentence.
This is also a sentence.
This as well, right?
Yes, it is!


Alternatively, you could try to define what you want to capture, rather
than what you want to throw away...

my @list = $string =~ /((?:\w+\s*)+[.!?])+/g;

But that's a little messier.


Paul Lalli


------------------------------

Date: Mon, 02 Aug 2004 14:40:20 GMT
From: bowsayge <bowsayge@nomail.afraid.org>
Subject: Re: splitting paragraph into sentences
Message-Id: <onsPc.6858$Jp6.2652@newsread3.news.atl.earthlink.net>

Sandman said to us:

> I've searched the docs, but I can't seem to get it right... I want to
> split a paragraph into sentences

I hope this helps you, but, as you will see below, determining when a
sentence ends might require you to know all of the abbreviations:

use strict;
use warnings;

my $string = "This is a sentence. This is also a sentence. 
This as well, right? 
Yes, it is! Dr. Montgomery will see you now.";

my @list = split /([\.\?\!])[\s]*/, $string; 
for (my $m = 0; $m < $#list; $m += 2) {
    $list[$m] .= $list[$m+1];
    undef($list[$m+1]);
}

foreach (@list){
    next if !defined($_);
    print "$_\n";
}


-- 
bowsayge



------------------------------

Date: 2 Aug 2004 19:06:55 GMT
From: anno4000@lublin.zrz.tu-berlin.de (Anno Siegel)
Subject: Re: splitting paragraph into sentences
Message-Id: <cem3cf$pvl$1@mamenchi.zrz.TU-Berlin.DE>

bowsayge  <bowsayge@nomail.afraid.org> wrote in comp.lang.perl.misc:
> Sandman said to us:
> 
> > I've searched the docs, but I can't seem to get it right... I want to
> > split a paragraph into sentences
> 
> I hope this helps you, but, as you will see below, determining when a
> sentence ends might require you to know all of the abbreviations:

That's only the beginning of it.  If an abbreviation is the last word of
a sentence, lexical analysis won't do.  One would have to parse enough
of the language to understand where the sentence ends.

> use strict;
> use warnings;
> 
> my $string = "This is a sentence. This is also a sentence. 
> This as well, right? 
> Yes, it is! Dr. Montgomery will see you now.";
> 
> my @list = split /([\.\?\!])[\s]*/, $string; 

The escapes in the character class are not necessary, [.?!] is valid.

> for (my $m = 0; $m < $#list; $m += 2) {
>     $list[$m] .= $list[$m+1];
>     undef($list[$m+1]);
> }

It would be easier to declare another array and collect the sentences
there.  That way, you don't have to weed out undef's later.  You can
also avoid all index arithmetic.  Like this (untested):

    my @sentences;
    while ( @list ) {
        push @sentences, shift( @list) . shift( @list);
    }

But see below for a solution that doesn't need another variable.

> foreach (@list){
>     next if !defined($_);
>     print "$_\n";
> }

If you have to weed out undefined elements from a list, there's an idiom:

    foreach ( grep defined, @list ) { ...

The "grep" function has many useful applications.  See "perldoc -f grep"
for how it works.

The basis of your method is sound.  You use split with capturing to
get a list of alternating a sentence and the closing punctuation.
If I wanted to join each sentence with the punctuation in place, I'd use
splice():

    my @list = split /([.?!])[\s]*/, $string;
    $list[ $_] .= splice @list, $_ + 1, 1 for 0 .. @list/2 - 1;
    print "$_\n" for @list;

In the unlikely case that the sequence of the sentences doesn't matter,
a hash can do the pairing:

    my %h = split /([.?!])[\s]*/, $string;
    print "$_$h{ $_}\n" for keys %h;

That is not a serious suggestion for the given situation, but it
is a technique worth considering if you have a list of pairs.

Anno


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc.  For subscription or unsubscription requests, send
#the single line:
#
#	subscribe perl-users
#or:
#	unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.  

NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice. 

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V10 Issue 6824
***************************************


home help back first fref pref prev next nref lref last post