[31708] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 2971 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Jun 3 00:09:23 2010

Date: Wed, 2 Jun 2010 21:09:06 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Wed, 2 Jun 2010     Volume: 11 Number: 2971

Today's topics:
        General natural language analysis question: where do I  <r.ted.byers@gmail.com>
    Re: General natural language analysis question: where d <cartercc@gmail.com>
    Re: General natural language analysis question: where d <r.ted.byers@gmail.com>
    Re: General natural language analysis question: where d <jurgenex@hotmail.com>
    Re: How to generate random number without replacement? <tzz@lifelogs.com>
    Re: How to take two input streams? <derykus@gmail.com>
    Re: How to take two input streams? <cdalten@gmail.com>
    Re: How to take two input streams? <RedGrittyBrick@spamweary.invalid>
        PDF::Template - Create cascade PDF <kalyanrajsista@gmail.com>
    Re: reload sub but call an old one from the new <peter@vereshagin.org>
    Re: Where to install perl modules? <spamtrap@shermpendley.com>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Wed, 2 Jun 2010 09:10:52 -0700 (PDT)
From: Ted Byers <r.ted.byers@gmail.com>
Subject: General natural language analysis question: where do I start?
Message-Id: <5cc41f11-003d-4e48-89fc-1c58f507bd6a@y12g2000vbg.googlegroups.com>

At this point, I don't even know what sort of query to submit to
google to find resources to help find an automated solution to this
problem.  I can do it manually, but that is quite tedious as I have a
couple thousand distinct strings to process, and for all I know, I
could have thousands more a month from now.

This is a business problem, in that the data represents company data
in which the company has provided a description of the business.
E.g.:

"barber & hair salon"
"barber /beauty salon"
"barber college"
"barber salon"
"barber school"
"barber shop "
"barber shop & hair salon"
"barber shop and beauty salon"
"barber shop"
"barber shop/ bar & grill "
"barber shop/ hair salon"
"barber shop/natural hair salon"
"barbershop"
"barbershop/hair salon"
"hair salon  "
"hair salon "
"hair salon & day spa"
"hair salon and spa"
"hair salon"
"hair salon, nails, tanning, products, bistro, crafts & food
consignments"
"hair salon, spa, herbal clinic, boutique all in 1"
"hair salon/ club"
"hair salon/ spa"
"hair salon/nail shop"
"hair school"
"hair store"
"hair studio and hair product distribution"
"hair supply store"

What I need to do is reduce the number of "business types" in the data
to a few rational choices.  I can tell, from visual inspection, that
the businesses with most of the above listed labels, can be grouped as
"personal grooming services".  However, the school/college type
businesses would not be appropriately included in such a group.
Neither would those with the last three labels be appropriately
included in such a group.

This task, as I said, is rather easy, but tedious and time consuming,
to handle manually.

The question is, "Is there a perl package or other resource that would
make this task something I can automate?"  Or, if you have experience
with this sort of thing, can you advise on a suitable search in google
that will produce more useful information that random noise?  I ask
here because this strikes me as a kind of task that perl would be
particularly good at (I have already made a start, using perl, to
clean up the data: e.g. to remove irrelevant characters, spelling
mistakes, &c.).

Any information you can provide would be appreciated.

Thanks

Ted


------------------------------

Date: Wed, 2 Jun 2010 14:12:48 -0700 (PDT)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: General natural language analysis question: where do I start?
Message-Id: <573c4e51-d1ca-48c6-b78d-172b3d177c95@u7g2000vbq.googlegroups.com>

On Jun 2, 12:10=A0pm, Ted Byers <r.ted.by...@gmail.com> wrote:

Ted,

Looking at your data, I see that every row contains either 'barber' or
'hair' and that it would be trivial to filter your data according to
this criterion, like this maybe:

push @grooming, $_ if $_ =3D~ /(barber|hair)/;

Obviously, you need some sets of eyes to decide if a 'barber school'
or 'hair supply store' should be included. My approach might be to use
automation to do some gross sorting and use humans to fine tune your
data.

At the same time, you might develop some heuristics to improve your
automation, realizing that you can't depend on automation for absolute
perfection.

CC
> "barber & hair salon"
> "barber /beauty salon"
> "barber college"
> "barber salon"
> "barber school"
> "barber shop "
> "barber shop & hair salon"
> "barber shop and beauty salon"
> "barber shop"
> "barber shop/ bar & grill "
> "barber shop/ hair salon"
> "barber shop/natural hair salon"
> "barbershop"
> "barbershop/hair salon"
> "hair salon =A0"
> "hair salon "
> "hair salon & day spa"
> "hair salon and spa"
> "hair salon"
> "hair salon, nails, tanning, products, bistro, crafts & food
> consignments"
> "hair salon, spa, herbal clinic, boutique all in 1"
> "hair salon/ club"
> "hair salon/ spa"
> "hair salon/nail shop"
> "hair school"
> "hair store"
> "hair studio and hair product distribution"
> "hair supply store"


------------------------------

Date: Wed, 2 Jun 2010 16:17:33 -0700 (PDT)
From: Ted Byers <r.ted.byers@gmail.com>
Subject: Re: General natural language analysis question: where do I start?
Message-Id: <6c683c5c-d0bc-47e3-881b-c81a0c35806d@o39g2000vbd.googlegroups.com>

On Jun 2, 5:12=A0pm, ccc31807 <carte...@gmail.com> wrote:
> On Jun 2, 12:10=A0pm, Ted Byers <r.ted.by...@gmail.com> wrote:
>
> Ted,
>
> Looking at your data, I see that every row contains either 'barber' or
> 'hair' and that it would be trivial to filter your data according to
> this criterion, like this maybe:
>
> push @grooming, $_ if $_ =3D~ /(barber|hair)/;
>
> Obviously, you need some sets of eyes to decide if a 'barber school'
> or 'hair supply store' should be included. My approach might be to use
> automation to do some gross sorting and use humans to fine tune your
> data.
>
> At the same time, you might develop some heuristics to improve your
> automation, realizing that you can't depend on automation for absolute
> perfection.
>
> CC
>
> > "barber & hair salon"
> > "barber /beauty salon"
> > "barber college"
> > "barber salon"
> > "barber school"
> > "barber shop "
> > "barber shop & hair salon"
> > "barber shop and beauty salon"
> > "barber shop"
> > "barber shop/ bar & grill "
> > "barber shop/ hair salon"
> > "barber shop/natural hair salon"
> > "barbershop"
> > "barbershop/hair salon"
> > "hair salon =A0"
> > "hair salon "
> > "hair salon & day spa"
> > "hair salon and spa"
> > "hair salon"
> > "hair salon, nails, tanning, products, bistro, crafts & food
> > consignments"
> > "hair salon, spa, herbal clinic, boutique all in 1"
> > "hair salon/ club"
> > "hair salon/ spa"
> > "hair salon/nail shop"
> > "hair school"
> > "hair store"
> > "hair studio and hair product distribution"
> > "hair supply store"
>
>

Thanks.

I had noticed, but that was but one illustrative example selection,
and in fact going through the rest of the data since I originally
posted, I found other items that ought to be grouped with barber
shops, but which include neither hair nor barber.  I have, in fact, a
file with almost 3000 records covering every imaginable kind of
business, and some for which I have no idea what the business actually
does.

As we're looking at a "simple" classification with something of the
order of 100 logical groups, it would be at least as time consuming to
manually come up with a filter for each group as it is to simply
manually reclassify each using any decent spreadsheet.  I was hoping
that there was a package, with a dictionary, that was able to produce
a relation between a set of phrases and a set of synonymous words that
would accelerate the process.

Thanks again,

Ted


------------------------------

Date: Wed, 02 Jun 2010 18:17:38 -0700
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: Re: General natural language analysis question: where do I start?
Message-Id: <2f0e06prdo2lmjhlcqs7t43oeefetvht3d@4ax.com>

Ted Byers <r.ted.byers@gmail.com> wrote:
>At this point, I don't even know what sort of query to submit to
>google to find resources to help find an automated solution to this
>problem.  I can do it manually, but that is quite tedious as I have a
>couple thousand distinct strings to process, and for all I know, I
>could have thousands more a month from now.
>
>This is a business problem, in that the data represents company data
>in which the company has provided a description of the business.
>E.g.:
>
>"barber & hair salon"
>"barber /beauty salon"
>"barber college"
>"barber salon"
>"barber school"
>"barber shop "
>"barber shop & hair salon"
>"barber shop and beauty salon"
>"barber shop"
>"barber shop/ bar & grill "
>"barber shop/ hair salon"
>"barber shop/natural hair salon"
>"barbershop"
>"barbershop/hair salon"
>"hair salon  "
>"hair salon "
>"hair salon & day spa"
>"hair salon and spa"
>"hair salon"
>"hair salon, nails, tanning, products, bistro, crafts & food
>consignments"
>"hair salon, spa, herbal clinic, boutique all in 1"
>"hair salon/ club"
>"hair salon/ spa"
>"hair salon/nail shop"
>"hair school"
>"hair store"
>"hair studio and hair product distribution"
>"hair supply store"
>
>What I need to do is reduce the number of "business types" in the data
>to a few rational choices. 

There are people who have done that already. You can find their
classification and "business types" in any yellow pages book.

>I can tell, from visual inspection, that
>the businesses with most of the above listed labels, can be grouped as
>"personal grooming services".  However, the school/college type
>businesses would not be appropriately included in such a group.
>Neither would those with the last three labels be appropriately
>included in such a group.

No chance but to manually classify them. You might be able to automate
some of it (e.g. for "barber shop"), but otherwise you need semantic
knowledge.

jue


------------------------------

Date: Wed, 02 Jun 2010 08:37:19 -0500
From: Ted Zlatanov <tzz@lifelogs.com>
Subject: Re: How to generate random number without replacement?
Message-Id: <87y6exwo6o.fsf@lifelogs.com>

On Tue, 01 Jun 2010 20:54:29 -0700 Xho Jingleheimerschmidt <xhoster@gmail.com> wrote: 

XJ> Ted Zlatanov wrote:
>> On Tue, 01 Jun 2010 04:50:36 -0400 "Uri Guttman"
>> <uri@StemSystems.com> wrote: 
>> 
UG> my %seen ;
UG> while( 1 ) { $x = int rand( 100_000_000 ) ; $seen{$x} and next ;
UG> $seen{$x} = 1;  print $x }
>> 
>> This will grow pretty quickly with a hash.  Bit::Vector already has
>> Bit_On($index) and bit_test($index) so memory usage and probably
>> performance will be a bit (heh) better.

XJ> I'd be very surprised if Bit::Vector had faster performance, at least
XJ> until the other method started swapping.

It's C internally so performance is pretty good.  But you're probably
right anyhow, I wasn't thinking.

On Tue, 01 Jun 2010 15:37:34 -0400 "Uri Guttman" <uri@StemSystems.com> wrote: 

UG> he said he wanted 1k random numbers out of a large range so a hash would
UG> be fine for that.

You're right, I wasn't paying enough attention.

Ted


------------------------------

Date: Wed, 2 Jun 2010 03:35:40 -0700 (PDT)
From: "C.DeRykus" <derykus@gmail.com>
Subject: Re: How to take two input streams?
Message-Id: <f54b82cf-b10b-4909-bda8-7bf9d9e06a93@j12g2000pri.googlegroups.com>

On Jun 1, 2:26=A0am, Peng Yu <pengyu...@gmail.com> wrote:
> On Jun 1, 3:24=A0am, Martijn Lievaart <m...@rtij.nl.invlalid> wrote:
>
 ...
>
> #!/usr/bin/env perl
>
> use warnings;
>
> open(IN1, $ARGV[0]);
> open(IN2, $ARGV[1]);
>
> while(<IN1>) {
> =A0 print
>
> }
>
> print "------\n";
>
> while(<IN2>) {
> =A0 print
>
> }
>
>

Perl provides a handy command line shortcut
if that's all you need (perldoc perlrun):


perl -pwe 'print "------\n" if eof' file1 file2 ...

--
Charles DeRykus



------------------------------

Date: Wed, 2 Jun 2010 08:08:13 -0700 (PDT)
From: chad <cdalten@gmail.com>
Subject: Re: How to take two input streams?
Message-Id: <f3a09dae-01d6-4f14-91e7-287007f05c6e@a20g2000vbc.googlegroups.com>

On Jun 1, 10:22=A0am, Martijn Lievaart <m...@rtij.nl.invlalid> wrote:
> On Tue, 01 Jun 2010 02:26:59 -0700, Peng Yu wrote:
> > On Jun 1, 3:24=A0am, Martijn Lievaart <m...@rtij.nl.invlalid> wrote:
> >> On Mon, 31 May 2010 20:47:22 -0700, Peng Yu wrote:
> >> > diff can take two input streams in the following example (if my
> >> > interpretation is correct).
>
> >> > diff <(gunzip <a.gz) <(gunzip b.gz)
>
> >> > I'm wondering how to take two streams in a perl program.
>
> >> This has nothing to do with diff or with perl, it's a function of your
> >> shell. So it works the same for diff as for perl.
>
> > I think that I understand what you mean. <(cmd) is just like a filename=
,
> > right?
>
> It actually gets passed to your program as a filename, although it really
> is a pipe to the command between the brackets.
>
> [martijn@cow t]$ perl -e 'print "@ARGV\n"' <(cat t.pl) <(cat t.pl~)
> /proc/self/fd/63 /proc/self/fd/62
> [martijn@cow t]$
>

Why do you use the brackets in '<(cmd)'? Ie, why can't you just do
something like '<cmd' ?


------------------------------

Date: Wed, 02 Jun 2010 16:55:53 +0100
From: RedGrittyBrick <RedGrittyBrick@spamweary.invalid>
Subject: Re: How to take two input streams?
Message-Id: <4c067f0b$0$12167$fa0fcedb@news.zen.co.uk>

On 02/06/2010 16:08, chad wrote:
> On Jun 1, 10:22 am, Martijn Lievaart<m...@rtij.nl.invlalid>  wrote:
>> On Tue, 01 Jun 2010 02:26:59 -0700, Peng Yu wrote:
>>> On Jun 1, 3:24 am, Martijn Lievaart<m...@rtij.nl.invlalid>  wrote:
>>>> On Mon, 31 May 2010 20:47:22 -0700, Peng Yu wrote:
>>>>> diff can take two input streams in the following example (if my
>>>>> interpretation is correct).
>>
>>>>> diff<(gunzip<a.gz)<(gunzip b.gz)
>>
>>>>> I'm wondering how to take two streams in a perl program.
>>
>>>> This has nothing to do with diff or with perl, it's a function of your
>>>> shell. So it works the same for diff as for perl.
>>
>>> I think that I understand what you mean.<(cmd) is just like a filename,
>>> right?
>>
>> It actually gets passed to your program as a filename, although it really
>> is a pipe to the command between the brackets.
>>
>> [martijn@cow t]$ perl -e 'print "@ARGV\n"'<(cat t.pl)<(cat t.pl~)
>> /proc/self/fd/63 /proc/self/fd/62
>> [martijn@cow t]$
>>
>
> Why do you use the brackets in '<(cmd)'? Ie, why can't you just do
> something like '<cmd' ?

Because the shell would look for a data file named 'cmd' in the current 
directory and would not execute it as a command.

$ wc -l <ls
-bash: ls: No such file or directory

$ wc -l <(ls)
     105 /dev/fd/63


-- 
RGB


------------------------------

Date: Wed, 2 Jun 2010 02:48:06 -0700 (PDT)
From: alwaysonnet <kalyanrajsista@gmail.com>
Subject: PDF::Template - Create cascade PDF
Message-Id: <f24b1d84-fa22-4622-9eb6-5805f539b1a4@k31g2000vbu.googlegroups.com>

Hello

I've been using PDF::Template Module to generate PDF based on my input
datahash. Please check the working code below

I want to generate one more set of data in another page without
changing the XML File. As data comes dynamically I'm not sure of how
many times I must include the LOOP attribute in my XML File

Any ideas or suggestions are appreciated.
Many Thanks

use strict;
use PDF::Template;
my $rpt = new PDF::Template( FILENAME => 'sample.xml' );

my %Inputs = (
    'svctype'     => 'Voice',
    'fromcur'     => 'EUR',
    'LOOPDETAIL2' => [
        {
            'seqnum'  => 878,
            'date'    => '01-Apr-2010',
            'tax'     => '0.000',
            'posttax' => '0.000',
            'pretax'  => '0.000',
            'israp'   => 'FALSE'
        },
        {
            'seqnum'  => 879,
            'date'    => '02-Apr-2010',
            'tax'     => '0.000',
            'posttax' => '0.000',
            'pretax'  => '0.000',
            'israp'   => 'FALSE'
        },
        {
            'seqnum'  => 880,
            'date'    => '03-Apr-2010',
            'tax'     => '0.000',
            'posttax' => '0.000',
            'pretax'  => '0.000',
            'israp'   => 'FALSE'
        },
    ],
    'pretax_totals'  => '0.000',
    'tax_total'      => '0.000',
    'posttax_totals' => '0.000',
);

$rpt->param(%Inputs);

$rpt->write_file('/export/home/kars/sample.pdf');

Input XML File is as follows

<pdftemplate name="test">
	<pagedef margins="1i" pagesize="A4" nopagenumber="0">
		<font face="Helvetica" h="10">
			<if name="svctype" op="ne" value="SMS">
				<font face="Helvetica-Bold">
                    <row>
	    			    <textbox w="100%" h="*2" text=""/>
    	    		</row>
					<row h="*1.5">
						<textbox w="20%" border="1" justify="center">Date</textbox>
						<textbox w="20%" border="1" justify="center">Sequence Number
Range</textbox>
						<textbox w="20%" border="1" justify="center">Pre-Tax Value in
<var name="fromcur"/>
						</textbox>
						<textbox w="20%" border="1" justify="center">  Tax Value
in        <var name="fromcur"/>
						</textbox>
						<textbox w="20%" border="1" justify="center">Post-Tax Value in
<var name="fromcur"/>
						</textbox>
					</row>
				</font>
				<loop name="LOOPDETAIL2">
					<row h="*1.5">
						<if name="israp" op="eq" value="FALSE">
							<textbox w="20%" border="1" justify="center" text="$date"/>
							<textbox w="20%" border="1" justify="center" text="$seqnum"/>
							<textbox w="20%" border="1" justify="right" text="$pretax"/>
							<textbox w="20%" border="1" justify="right" text="$tax"/>
							<textbox w="20%" border="1" justify="right" text="$posttax"/>
						</if>
					</row>
				</loop>
				<font face="Helvetica-Bold">
					<row h="*1.5">
						<textbox w="20%" border="1" justify="center" text=""/>
						<textbox w="20%" border="1" justify="center" text="Total"/>
						<textbox w="20%" border="1" justify="right"
text="$pretax_totals"/>
						<textbox w="20%" border="1" justify="right" text="$tax_total"/>
						<textbox w="20%" border="1" justify="right"
text="$posttax_totals"/>
					</row>
				</font>
			</if>
       </font>
	</pagedef>
</pdftemplate>


------------------------------

Date: Wed, 2 Jun 2010 09:38:31 +0000 (UTC)
From: Peter Vereshagin <peter@vereshagin.org>
Subject: Re: reload sub but call an old one from the new
Message-Id: <20100602093829.GA52760@screwed.box>

God love is hard to find. You got lucky Ted!
2010/06/01 13:27:19 -0500 Ted Zlatanov <tzz@lifelogs.com> =>  comp.lang.perl.misc:

PV>>   *old_sub = *some_sub;
PV>>   sub some_sub{
PV>>     $ENV{ FOO } = 'BAR';
PV>>     &old_sub( @_ );
PV>>   }


TZ> Have you tried Aspect from CPAN?  Its "advice" facility is, I think,
TZ> what you're looking for.

Very likely, thanks a lot!

73! Peter pgp: A0E26627 (4A42 6841 2871 5EA7 52AB  12F8 0CE1 4AAC A0E2 6627)
-- 
http://vereshagin.org


------------------------------

Date: Wed, 02 Jun 2010 09:59:03 -0400
From: Sherm Pendley <spamtrap@shermpendley.com>
Subject: Re: Where to install perl modules?
Message-Id: <m2hblly1qw.fsf@shermpendley.com>

Peng Yu <pengyu.ut@gmail.com> writes:

> I'm wondering where the package should install the perl modules to.

Why do you think you need to specify the install location? The normal
"perl Makefile.PL; make; make test; make install" will do the right
thing, no matter what install prefix you used when you installed your
copy of Perl.

sherm--

-- 
Sherm Pendley                <www.shermpendley.com>
                             <www.camelbones.org>
Cocoa Developer


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 2971
***************************************


home help back first fref pref prev next nref lref last post