[31755] in Perl-Users-Digest


home	help	back	first	fref	pref	prev	next	nref	lref	last	post
Perl-Users Digest, Issue: 3018 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Jul 6 11:26:54 2010

Date: Tue, 6 Jul 2010 08:16:13 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Tue, 6 Jul 2010     Volume: 11 Number: 3018

Today's topics:
        [OT] Re: Parsing file names with spaces <m@rtij.nl.invlalid>
        Archive::Tar, difference in size of output file <justin.1007@purestblue.com>
    Re: Archive::Tar, difference in size of output file <peter@makholm.net>
    Re: Archive::Tar, difference in size of output file <ben@morrow.me.uk>
    Re: Are there any MySQL queries or software packages fo <cartercc@gmail.com>
    Re: Are there any MySQL queries or software packages fo <erick.use-net@ardane.c.o.m>
    Re: Are there any MySQL queries or software packages fo <axel.schwenke@gmx.de>
    Re: Are there any MySQL queries or software packages fo <axel.schwenke@gmx.de>
    Re: Are there any MySQL queries or software packages fo <pDOTpagel@wzw.tum.de>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Mon, 5 Jul 2010 16:51:40 +0200
From: Martijn Lievaart <m@rtij.nl.invlalid>
Subject: [OT] Re: Parsing file names with spaces
Message-Id: <s489g7-2sd.ln1@news.rtij.nl>

On Mon, 05 Jul 2010 13:55:34 +0200, Dr.Ruud wrote:

> Martijn Lievaart wrote:
> 
>> The bigger the project
> 
> , the bigger the failure.

:-)

One customer I work at has the ability to start a huge project to unify 
the N tools in use for a certain purpose. With the next round of budget 
cuts, the project is reduced in scope and we end up with N+1 tools for 
the same purpose.

M4



------------------------------

Date: Tue, 6 Jul 2010 13:37:59 +0100
From: Justin C <justin.1007@purestblue.com>
Subject: Archive::Tar, difference in size of output file
Message-Id: <slrni368t7.q1e.justin.1007@zem.masonsmusic.co.uk>

I'm working on a program to create .tgz archives of catalogue images for
our customers to download. Initially I was doing this:

my $tar = Archive::Tar->new();
foreach my $dir (0..9, 'a'..'z') {
    $tar->add_files(glob "$dir/*jpg");
}
$tar->write($fname, COMPRESS_GZIP, "catalogue_images")

This was creating .tgz files much, much larger than the total
uncompressed size of images. I decided to try a different way of
creating the archive, and now do this:

my @files;
foreach my $dir (0..9, 'a'..'z') {
    push @files, glob "$dir/*jpg";
}
Archive::Tar->create_archive($fname, COMPRESS_GZIP, @files);

and the file sizes are, as I would expect, much smaller. 

Can someone tell me why this is?

	Justin.

-- 
Justin C, by the sea.


------------------------------

Date: Tue, 06 Jul 2010 15:21:47 +0200
From: Peter Makholm <peter@makholm.net>
Subject: Re: Archive::Tar, difference in size of output file
Message-Id: <87630soidw.fsf@vps1.hacking.dk>

Justin C <justin.1007@purestblue.com> writes:

> my $tar = Archive::Tar->new();
> foreach my $dir (0..9, 'a'..'z') {
>     $tar->add_files(glob "$dir/*jpg");
> }
> $tar->write($fname, COMPRESS_GZIP, "catalogue_images")
>
> This was creating .tgz files much, much larger than the total
> uncompressed size of images.

If you examine the resulting file, does it contain exactly what you
expect?

> I decided to try a different way of creating the archive, and now do
> this:
>
> my @files;
> foreach my $dir (0..9, 'a'..'z') {
>     push @files, glob "$dir/*jpg";
> }
> Archive::Tar->create_archive($fname, COMPRESS_GZIP, @files);

When I lookup the implementation of create_archive, it looks exactly
like this:

sub create_archive {
    my $class = shift;

    my $file    = shift; return unless defined $file;
    my $gzip    = shift || 0;
    my @files   = @_;

    unless( @files ) {
        return $class->_error( qq[Cowardly refusing to create empty
        archive!] );
    }

    my $tar = $class->new;
    $tar->add_files( @files );
    return $tar->write( $file, $gzip );
}

That is almost the same as you initial version, except for the use of
a prefix. 

//Makholm


------------------------------

Date: Tue, 6 Jul 2010 14:30:23 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Archive::Tar, difference in size of output file
Message-Id: <fonbg7-fas2.ln1@osiris.mauzo.dyndns.org>


Quoth Justin C <justin.1007@purestblue.com>:
> I'm working on a program to create .tgz archives of catalogue images for
> our customers to download. Initially I was doing this:
> 
> my $tar = Archive::Tar->new();
> foreach my $dir (0..9, 'a'..'z') {
>     $tar->add_files(glob "$dir/*jpg");
> }
> $tar->write($fname, COMPRESS_GZIP, "catalogue_images")
> 
> This was creating .tgz files much, much larger than the total
> uncompressed size of images. I decided to try a different way of
> creating the archive, and now do this:
> 
> my @files;
> foreach my $dir (0..9, 'a'..'z') {
>     push @files, glob "$dir/*jpg";
> }
> Archive::Tar->create_archive($fname, COMPRESS_GZIP, @files);
> 
> and the file sizes are, as I would expect, much smaller. 
> 
> Can someone tell me why this is?

I don't see that here: the two files are exactly the same size. What
version of Archive::Tar are you using? Can you see what the difference
is between the two files: is one of them simply not compressed?

Ben



------------------------------

Date: Tue, 6 Jul 2010 08:02:45 -0700 (PDT)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: Are there any MySQL queries or software packages for "finding  similar items"
Message-Id: <ecad6db4-5bc5-4ec3-8317-2fe8e952912e@t10g2000yqg.googlegroups.com>

On Jul 5, 4:16=A0pm, Ignoramus12110 <ignoramus12...@NOSPAM.
12110.invalid> wrote:
> I have a MySQL database of answered algebra questions. Questions are
> stored as text strings.
> When students ask questions, often (if not usually) there is already
> something similar answered in the database. Note that I am not
> defining what is "similar" and I do realize that it is a difficult
> definition to make.

As strong as Perl is at string manipulation, this is the kind of
problem domain that Lisp is ideally suited for. At least one
introduction, Lisp 3rd (Winston and Horn) devotes the last half of the
book to consideration of these kinds of problems, and can be had for
$1.68 as a used book on amazon.com, half.com, etc.

I don't know what kind of time you have to devote to solving the
problem, or the strength of your interest, or your previous
experience, but I would strongly suggest that if you have the time,
interest, and experience, that you would do well to read through W&H,
Lisp 3rd.

If you want something meatier, Paradigms of Artificial Intelligence
Programming: Case Studies in Common Lisp (Norvig) seems to make the
top ten list of everyone's Best Books in computer science.

Essentially, what you would want to do is parse the student query for
key words, perhaps building a database of common search terms, and
match them against your database, perhaps iteratively using random
permutations, using the standard Lisp pattern matching techniques.

CC.


------------------------------

Date: 5 Jul 2010 20:38:32 GMT
From: "Erick T. Barkhuis" <erick.use-net@ardane.c.o.m>
Subject: Re: Are there any MySQL queries or software packages for "finding similar items"
Message-Id: <89eu68F28qU1@mid.individual.net>

Ignoramus12110:
 ...
>So... Any suggestion for software to ran strings by similarity and
>provide "top 5" or something like that?

All I can come up with is Levenshtein (not much experience using it,
though).
May I suggest you use "levenshtein mysql" or "levenshtein php" as a
search phrase?


-- 
Erick


------------------------------

Date: Mon, 5 Jul 2010 23:33:13 +0200
From: Axel Schwenke <axel.schwenke@gmx.de>
Subject: Re: Are there any MySQL queries or software packages for "finding similar items"
Message-Id: <plv9g7-5sa.ln1@xl.homelinux.org>

Ignoramus12110 <ignoramus12110@NOSPAM.12110.invalid> wrote:
>
> I am hoping that, perhaps, there is some free package that could take
> a few hundreds of thousands of text strings and could provide me with
> "find similar" functionality.
>
> Realizing the potential difficulty of the task, I would be content if
> it worked only moderately well. I just want something along the lines.
>
> Are there any MySQL functions or other software packages or perl
> modules that provide something of the sort.

CPAN has some packages for approximate string matching. Levenstein has
been named. And virtually all SQL databases have SOUNDEX(). Another
approach is trigram counting.

The problem ist hard, especially when you look for a solution that runs
faster than O(n). Outside the database you cannot be faster than O(n)
anyway. For "few thousands" candidates it will however be fast enough.


XL


------------------------------

Date: Tue, 6 Jul 2010 07:30:05 +0200
From: Axel Schwenke <axel.schwenke@gmx.de>
Subject: Re: Are there any MySQL queries or software packages for "finding similar items"
Message-Id: <tjrag7-qas.ln1@xl.homelinux.org>

Ignoramus12110 <ignoramus12110@NOSPAM.12110.invalid> wrote:
> On 2010-07-05, Axel Schwenke <axel.schwenke@gmx.de> wrote:
>>
>> CPAN has some packages for approximate string matching. Levenstein has
>> been named. And virtually all SQL databases have SOUNDEX(). Another
>> approach is trigram counting.
>
> Thanks. Do you know any package names?

CPAN does:

http://search.cpan.org/search?query=levenshtein&mode=all

This also lists some non-Levenstein implementations

>> The problem ist hard, especially when you look for a solution that runs
>> faster than O(n). Outside the database you cannot be faster than O(n)
>> anyway. For "few thousands" candidates it will however be fast enough.
>
> Right now I have 208,919 candidates and the number is growing by
> appx. 200 per day.

Then non-indexed solutions migh be a little slow.

Approximate matching is one facette of fulltext search engines. So you
might want to try one of those. MySQL itself comes with a (limited)
implementation of a FULLTEXT index. And there is a wealth of fulltext
engines: Sphinx, Lucene, Xapian, Swish++, mnogosearch, ...

SOUNDEX() I named only for completeness. The nice thing about it is
that it is virtually everywhere available. Though its usefulness is
quite limited because it is very coarse and works only for English.


XL


------------------------------

Date: Tue, 6 Jul 2010 13:15:24 +0000 (UTC)
From: Philipp Pagel <pDOTpagel@wzw.tum.de>
Subject: Re: Are there any MySQL queries or software packages for "finding similar items"
Message-Id: <i0va9c$ktf$1@news.lrz-muenchen.de>

In comp.os.linux.misc Ignoramus12110 <ignoramus12110@nospam.12110.invalid> wrote:

> ``A flagpole casts a shadow of 32 ft, Nearby, a 10-ft tree casts a
> shadow of 2 ft. What is the height of the flag pole?''

> ``A flag pole casts a shadow of 32 feet. Nearby, a 10 foot tree
> casts a shadow of 2 ft. Find the height of the flag pole?''

> are similar. 
>
> I am hoping that, perhaps, there is some free package that could take
> a few hundreds of thousands of text strings and could provide me with
> "find similar" functionality. 

The BLAST software roughly does what you are looking for - but outside
of mysql and specifically written for finding similar sequences in DNA
or protein databases. That said, it should be possible to tweak it to
accept input with arbitrary alphabets. Another programm from the
bioinformatics world would be FASTA. At least BLAST can be found as a
debian package (and probably for others, too).

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
http://faculty.virginia.edu/wrpearson/fasta/

For a few hundred or thousand comparisons and a little patience, you
may also consider using the sequence alignment algorithms by Needleman
and Wunsch or Smith and Waterman -- depending on your needs. If you
would like to explore this route and are willing to do some code
tweaking/expanding I could provide a simple implementation written in
Python (just email me). And of course you can find tons of other
implementations on the web because it's a classic for bioinformatics
students.


cu
	Philipp

-- 
Dr. Philipp Pagel
Lehrstuhl f. Genomorientierte Bioinformatik
Technische Universität München
http://webclu.bio.wzw.tum.de/~pagel/


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3018
***************************************

home	help	back	first	fref	pref	prev	next	nref	lref	last	post
[31755] in Perl-Users-Digest

Perl-Users Digest, Issue: 3018 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)Tue Jul 6 11:26:54 2010

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Jul 6 11:26:54 2010