[23843] in Perl-Users-Digest


home	help	back	first	fref	pref	prev	next	nref	lref	last	post
Perl-Users Digest, Issue: 6046 Volume: 10

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Jan 29 22:36:59 2004

Date: Thu, 29 Jan 2004 19:31:18 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Thu, 29 Jan 2004     Volume: 10 Number: 6046

Today's topics:
        SMTP/Bad File descripter problem (marx)
    Re: SMTP/Bad File descripter problem <emschwar@pobox.com>
    Re: SMTP/Bad File descripter problem <gnari@simnet.is>
    Re: SMTP/Bad File descripter problem <jwillmore@remove.adelphia.net>
        Sorry! 70352 s44@spam.mis.stu.edu.tw
        SpreadSheet::WriteExcel support string > 255 characters <wkwang@cisco.com>
    Re: SpreadSheet::WriteExcel support string > 255 charac <glex_nospam@qwest.invalid>
    Re: SQL insert/update duplicate entries <trammell+usenet@hypersloth.invalid>
    Re: SQL insert/update duplicate entries (Cheez)
    Re: SQL insert/update duplicate entries <tore@aursand.no>
    Re: SQL insert/update duplicate entries <jgibson@mail.arc.nasa.gov>
    Re: SQL insert/update duplicate entries (Cheez)
    Re: SQL insert/update duplicate entries ctcgag@hotmail.com
    Re: SQL insert/update duplicate entries (Cheez)
    Re: SQL insert/update duplicate entries <spamblock@nodomain.nodomain.us>
    Re: SQL insert/update duplicate entries ctcgag@hotmail.com
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: 26 Jan 2004 13:48:02 -0800
From: marx@idiom.com (marx)
Subject: SMTP/Bad File descripter problem
Message-Id: <afb5bbb9.0401261348.22b988e@posting.google.com>

I'm having difficulty sending e-mail via SMTP.  I keep getting the
following error.  Attached is my code and below that is the form, any
help would be much apprecieated.


Error: 
Bad file descriptor




#!c:\perl\bin -w
#
# Edit here          =
#
# Set only ONE of either $mail_path which should go to a 
# mail program like sendmail, or $smtp_server which will use
# a SMTP server. Do not set both at once!

#$mail_path='/usr/sbin/sendmail';  #example /use/lib/sendmail
$smtp_server='localhost';		 #example smtp.mydomain.com

#Email address (Who will be receiving completed forms)

$toemail='testemail@yahoo.com';

#Check access to script? If yes, please specify your domain as
mydomain.com
$domain='localhost';

#Collect email addresses in file? if yes, specify unix/nt path to file
where emails will be stored
$store='';

#Enable autoresponder?

$autoresponder=0;

#email message for responder

$aresp=qq~ Hello!

Thank you for form filling! I check it later and reply to you.

~;

#Default subject

$subject='Form submitted!';

#Default redirect

$redirect="http://my.yahoo.com";

#It all!
#See info about usage at readme.txt
################################################################################
$CRLF="\015\012"; 
eval { &main; };
if ($@) { error("Content-type: text/html\n\n <html>fatal error:
$@</html>"); }
sub main{
%in=&parse;
if ($smtp_server) {use Socket;$proto =
(getprotobyname('tcp'))[2];	$port  = getservbyname('smtp',
'tcp');	$smtp_server =~ s/^\s+//g;	$smtp_server =~
s/\s+$//g;	$smtpaddr = ($smtp_server =~
/^(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$/) ?
pack('C4',$1,$2,$3,$4) : (gethostbyname($smtp_server))[4];	if
(!$smtpaddr) {die "SMTP host not found!";exit;}}

#Check for requied fields as $email, subject, message

if ($in{'fromemail'} !~ /.+@.+\..+/) {error("Invalid email address");}
if (!$in{'subject'}) {$in{'subject'} = $subject;}

#Check access if need

if ($domain) {
		if ($ENV{'HTTP_REFERER'} !~ /$domain/is) {
								error("Access forbidden by administrator");
							}
}

#Collect email address if need

if ($store) {
 	open (OUT, ">>$store") or error("Cannot write to file : $!");
	print OUT "$in{'fromemail'}\n";
	close OUT;
}



foreach $item (keys %in) {
$var="$item";
$val="$in{$item}";
if ($var =~ /\_val/is) {
	if ($val eq "") {$var=~ s#\_val##is;$var=~ s#^(.*?)\_##is;error("$var
cannot be empty!");}

		}
}



#Definition of variables

$fromemail=$in{'fromemail'};         #Sended email
$subject = $in{'subject'};

$url=$in{'redirect'}; # redirect if success

if (!$url) {$url=$redirect;}

$ccopy=$in{'cc'};     # carbon copy
$from=$in{'from'};    # From (name)

if ($ccopy) {
               if ($ccopy !~ /.+@.+\..+/) {error("Invalid CC email
address");}
}

$efrom=$fromemail;

if ($from) {$efrom="$from <$fromemail>";}


delete $in{'cc'}; delete $in{'from'};
delete $in{'redirect'};
delete $in{'fromemail'};

#Generation of Message

$etext=qq~Hello!
This is results of submitted form by $ENV{'REMOTE_ADDR'}.

Submitter Email : $efrom

Fields:
~;

foreach $key (sort keys %in) {
$etext.="$key = $in{$key}\n";
}
$etext.="\n\n";

#send email



if (!$smtp_server) {
 open (SM, "|$mail_path -t") or error("Error sending mail! Reason :
$!");
 print SM "From: ",     $efrom,    "\n";
 print SM "To: $toemail",      "\n";
 print SM "Cc: $ccopy\n", if ($ccopy);
 print SM "X-Mailer: Email manager (http://localhost/)\n";     
 print SM "Subject: ",  $subject, "\n\n";
 print SM $etext;
 close SM;

if ($autoresponder) {

 open (SM, "|$mail_path -t") or error("Error sending mail! Reason :
$!");
 print SM "From: ",     $toemail,    "\n";
 print SM "To: $efrom",      "\n";
 print SM "X-Mailer: Email manager (http://localhost/)\n";     
 print SM "Subject: ",  $subject, "\n\n";
 print SM $aresp;
 close SM;
}


}else{
init_smtp($smtp_server);
 socket(SOCK, AF_INET, SOCK_STREAM, $proto)  or error ("Socket
operation failed : Reason $!");
 connect(SOCK, pack('Sna4x8', AF_INET, $port, $smtpaddr)) or error
("Connection failed : Reason $!");
 $oldfh = select(SOCK);
 $| = 1; select($oldfh);$_ = <SOCK>;
 if (/^[45]/) { close SOCK; error("Service not available : Reason
$!"); }
 print SOCK "helo localhost$CRLF"; $_ = <SOCK>;
 if (/^[45]/) { close SOCK; error ("Communication error : Reason $!");
}
 print SOCK "mail from: <", $efrom, ">$CRLF"; $_ = <SOCK>; 
 if (/^[45]/) { close SOCK; error ("Communication error : $proto
$toemail $mail_path $smtp_server  Reason $!"); }
 foreach (split(/,/, $toemail)) {
  (/<(.*)>/) ? print SOCK "rcpt to: $1$CRLF" : print SOCK "rcpt to:
<$_>$CRLF";
  $_ = <SOCK>; if (/^[45]/) { close SOCK;error("Unknown user. Email
address not valid"); }
 }
 print SOCK "data$CRLF";$_ = <SOCK>; 
 if (/^[45]/) { close SOCK; error("Communication error : Reason $port
$!"); }
 print SOCK "To: $toemail",      $CRLF;
 print SOCK "From: $efrom",$CRLF;
 print SOCK "Cc: $ccopy\n", if ($ccopy);
 print SOCK "X-Mailer: Email manager (http://localhost/)$CRLF";
 print SOCK "Subject: $subject",$CRLF,$CRLF; 
 $message  =~ s/\r//g;
 $message  =~ s/\n/$CRLF/g;
 print SOCK $etext;
 print SOCK $CRLF, '.', $CRLF;$_ = <SOCK>; 
 if (/^[45]/) { close SOCK;error("Transfer failed : Reason $!"); }
 print SOCK "quit", $CRLF;$_ = <SOCK>;close SOCK;


if ($autoresponder) {

 socket(SOCK, AF_INET, SOCK_STREAM, $proto)  or error("Socket
operation failed : Reason $!");
 connect(SOCK, pack('Sna4x8', AF_INET, $port, $smtpaddr)) or
error("Connection failed : Reason $!");
 $oldfh = select(SOCK);
 $| = 1; select($oldfh);$_ = <SOCK>;
 if (/^[45]/) { close SOCK; error("Service not available : Reason
$!"); }
 print SOCK "helo localhost$CRLF"; $_ = <SOCK>;
 if (/^[45]/) { close SOCK; error("Communication error : Reason $!");
}
 print SOCK "mail from: <", $toemail, ">$CRLF"; $_ = <SOCK>; 
 if (/^[45]/) { close SOCK; error("Communication error : Reason $!");
}
 foreach (split(/,/, $efrom)) {
  (/<(.*)>/) ? print SOCK "rcpt to: $1$CRLF" : print SOCK "rcpt to:
<$_>$CRLF";
  $_ = <SOCK>; if (/^[45]/) { close SOCK;error("Unknown user. Email
address not valid"); }
 }
 print SOCK "data$CRLF";$_ = <SOCK>; 
 if (/^[45]/) { close SOCK; error("Communication error : Reason $!");
}
 print SOCK "To: $efrom",      $CRLF;
 print SOCK "From: $toemail",$CRLF;
 print SOCK "X-Mailer: Email manager (http://localhost/)$CRLF";
 print SOCK "Subject: $subject",$CRLF,$CRLF; 
 $aresp  =~ s/\r//g;
 $aresp  =~ s/\n/$CRLF/g;
 print SOCK $aresp;
 print SOCK $CRLF, '.', $CRLF;$_ = <SOCK>; 
 if (/^[45]/) { close SOCK;error("Transfer failed : Reason $!"); }
 print SOCK "quit", $CRLF;$_ = <SOCK>;close SOCK;


}


}




}

if ($url) {
	if (!$smtp_server) {
		print "Location: $url \n\n"; exit;
		}else{
		print "Content-type: text/html\n\n"; #for some servers.
		print qq~<html><META HTTP-EQUIV="Refresh" Content=
"0;	URL=$url"></html>~;exit;
		}
}
print "Content-type: text/html\n\n";

print qq~
<html><head><title>Email results</title></head>
<body>
<font face="Verdana">
<center><font size="3"><b>Email is sent</b><font
face="2"><br></center>
<br><br><br>
<small><center>(c)-2001, <a
href="localhost">localhost</a></center></font>

</body>
</html>
~;
exit;

sub error{
$mes=shift;
print qq~Content-type: text/html\n\n
<font face="Verdana" size="2"><center><b>Error</b> : $mes<br><br>
<a href="javascript:history.go(-1)">Back</a>.<br><br><br>
<small><a href="http://localhost">localhost</a>
~;
exit;
}
sub parse{my (@pairs, %in);my (@pairs, %in);my ($buffer, $pair, $name,
$value);if ($ENV{'REQUEST_METHOD'} eq 'GET') {@pairs = split(/&/,
$ENV{'QUERY_STRING'});}elsif($ENV{'REQUEST_METHOD'} eq 'POST')
{read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});@pairs = split(/&/,
$buffer);}PAIR: foreach $pair (@pairs) {($name, $value) = split(/=/,
$pair);$name =~ tr/+/ /;$name =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C",
hex($1))/eg;$value =~ tr/+/ /;$value =~
s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;($value eq "---")
and next PAIR;exists $in{$name} ? ($in{$name} .= "~~$value") :
($in{$name}  = $value);}return %in;}

sub init_smtp{$smtp_server=shift;use Socket;$proto =
(getprotobyname('tcp'))[2];$port  = getservbyname('smtp',
'tcp');$smtp_server =~ s/^\s+//g;$smtp_server =~ s/\s+$//g;$smtpaddr =
($smtp_server =~ /^(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$/) ?
pack('C4',$1,$2,$3,$4) : (gethostbyname($smtp_server))[4];	if
(!$smtpaddr) {return "SMTP host not found!";}return "ok";}

#form 
<form method="post" action="/mail/formmail/formmail.pl">
  
  <tr>
    <td width="32%"><font size="2" face="Verdana">Your
name</font></td>
    <td width="68%"><input type="text" name="from" size="20"></td>
  </tr>
  <tr>
    <td width="32%"><font size="2" face="Verdana">Your
email</font></td>
    <td width="68%"><input type="text" name="fromemail"
size="20"></td>
  </tr>
  <tr>
    <td width="32%"><font size="2" face="Verdana">Subject</font></td>
    <td width="68%"><select size="1" name="subject">
        <option selected>Please contact me tomorrow</option>
        <option>Please contact me today</option>
      </select></td>
  </tr>
  <tr>
    <td width="100%" colspan="2">
      <p align="center"><font size="2"
face="Verdana">Fields</font></td>
  </tr>
  <tr>
    <td width="50%"><font size="2" face="Verdana">Age</font></td>
    <td width="50%"><input type="text" name="01_AGE_val"
size="20"></td>
  </tr>
  <tr>
    <td width="50%"><font size="2" face="Verdana">Town</font></td>
    <td width="50%"><input type="text" name="02_Town" size="20"></td>
  </tr>
  <tr>
    <td width="50%"><font size="2" face="Verdana">Income</font></td>
    <td width="50%"><select size="1" name="03_Income">
        <option>Under $1000 / month</option>
        <option>$1000 - 2000 per month</option>
        <option>$2000 - $3000 per month</option>
        <option>$3000 and more per month</option>
        <option selected>Not specifed</option>
      </select></td>
  </tr>
  <tr>
    <td width="50%"><font size="2" face="Verdana">Interesting in our
services?</font></td>
    <td width="50%"><select size="1"
name="04_Interested_in_our_services">
        <option selected>Yes</option>
        <option>No</option>
      </select></td>
  </tr>
  <tr>
    <td width="50%"><font size="2" face="Verdana">Country</font></td>
    <td width="50%"><input type="text" name="05_Country_val"
size="20"></td>
  </tr>
  <tr>
    <td width="50%"></td>
    <td width="50%"><input type="submit" value="Send it!"
name="B1"></td>
  </tr>
</form>


------------------------------

Date: Mon, 26 Jan 2004 15:00:12 -0700
From: Eric Schwartz <emschwar@pobox.com>
Subject: Re: SMTP/Bad File descripter problem
Message-Id: <eto7jze1ic3.fsf@fc.hp.com>

marx@idiom.com (marx) writes:
> I'm having difficulty sending e-mail via SMTP.  I keep getting the
> following error.  Attached is my code and below that is the form, any
> help would be much apprecieated.
>
>
> Error: 
> Bad file descriptor

<snip ensuing 366 lines>

Your bug is that you're not using Net::SMTP.  It is cross-platform, so
should work on any system you can run Perl on.  I believe it's bundled
with Perl, but if not, I know you can get it on CPAN.

Your problem, which is separate from your bug, is that you either
haven't read the Posting Guidelines, which are posted regularly here,
and archived on groups.google.com, or you don't know how to post a
*small* (20 lines or so should do the trick) complete, runnable
program that exhibits the problem.

Though clpm has a bad rap for it, people here really do like to help
others out.  But nobody wants to wade through an excessively long
script like that for such a vaguely-described problem.

-=Eric
-- 
Come to think of it, there are already a million monkeys on a million
typewriters, and Usenet is NOTHING like Shakespeare.
		-- Blair Houghton.


------------------------------

Date: Mon, 26 Jan 2004 22:37:43 -0000
From: "gnari" <gnari@simnet.is>
Subject: Re: SMTP/Bad File descripter problem
Message-Id: <bv451a$k1f$1@news.simnet.is>

"marx" <marx@idiom.com> wrote in message
news:afb5bbb9.0401261348.22b988e@posting.google.com...
[snipped perl4 cgi/smtp script
that does not use CGI, Net:SMTP, nor strict]

there is everything wrong with your post

you probably are not running a SMTP server either.

simplify your problem by cutting out everything that does not have
to do with SMTP. replace the rest with something that uses Net::SMTP.
fix your indenting in that, and add use strict.
if it compiles, but still does not run, post it here with any new info,
like error line, and so on.

gnari






------------------------------

Date: Tue, 27 Jan 2004 00:35:32 -0500
From: James Willmore <jwillmore@remove.adelphia.net>
Subject: Re: SMTP/Bad File descripter problem
Message-Id: <pan.2004.01.27.05.35.31.176721@remove.adelphia.net>

On Mon, 26 Jan 2004 13:48:02 -0800, marx wrote:

> I'm having difficulty sending e-mail via SMTP.  I keep getting the
> following error.  Attached is my code and below that is the form, any
> help would be much apprecieated.
> 
> 
> Error: 
> Bad file descriptor

<snip>

In short - this is a mess.

Let me offer a few suggestions:
1) enable strict and warnings
2) since you appear to want to run this as a CGI script, enable taint
checking as well
3) for the love of everything, please use the CGI module

Finally - *please* start reading up on how to write CGI scripts.  Visit
http://www.w3.org/Security/Faq/ for how to write secure CGI scripts.  And
use Google to search USENET (aka Google Groups) for writing CGI scripts.

HTH

-- 
Jim

Copyright notice: all code written by the author in this post is
 released under the GPL. http://www.gnu.org/licenses/gpl.txt 
for more information.

a fortune quote ...
Ask five economists and you'll get five different explanations
<(six if one went to Harvard).   -- Edgar R. Fiedler 



------------------------------

Date: Tuesday, 27 Jan 2004 15:56:27 -0600
From: s44@spam.mis.stu.edu.tw
Subject: Sorry! 70352
Message-Id: <27010415.5627@spam.mis.stu.edu.tw>

Sorry sir!
     I'm a student now.I have a studying about Spam Mail issuse case.Therefore I need more and more spam mail for this study.

s1@spam.mis.stu.edu.tw
s2@spam.mis.stu.edu.tw
s3@spam.mis.stu.edu.tw
s4@spam.mis.stu.edu.tw
s5@spam.mis.stu.edu.tw
s6@spam.mis.stu.edu.tw
s7@spam.mis.stu.edu.tw
s8@spam.mis.stu.edu.tw
s9@spam.mis.stu.edu.tw
s10@spam.mis.stu.edu.tw
s11@spam.mis.stu.edu.tw
s12@spam.mis.stu.edu.tw
s13@spam.mis.stu.edu.tw
s14@spam.mis.stu.edu.tw
s15@spam.mis.stu.edu.tw
s16@spam.mis.stu.edu.tw
s17@spam.mis.stu.edu.tw
s18@spam.mis.stu.edu.tw
s19@spam.mis.stu.edu.tw
s20@spam.mis.stu.edu.tw
s21@spam.mis.stu.edu.tw
s22@spam.mis.stu.edu.tw
s23@spam.mis.stu.edu.tw
s24@spam.mis.stu.edu.tw
s25@spam.mis.stu.edu.tw
s26@spam.mis.stu.edu.tw
s27@spam.mis.stu.edu.tw
s28@spam.mis.stu.edu.tw
s29@spam.mis.stu.edu.tw
s30@spam.mis.stu.edu.tw
s31@spam.mis.stu.edu.tw
s32@spam.mis.stu.edu.tw
s33@spam.mis.stu.edu.tw
s34@spam.mis.stu.edu.tw
s35@spam.mis.stu.edu.tw
s36@spam.mis.stu.edu.tw
s37@spam.mis.stu.edu.tw
s38@spam.mis.stu.edu.tw
s39@spam.mis.stu.edu.tw
s40@spam.mis.stu.edu.tw
s41@spam.mis.stu.edu.tw
s42@spam.mis.stu.edu.tw
s43@spam.mis.stu.edu.tw
s44@spam.mis.stu.edu.tw
s45@spam.mis.stu.edu.tw
s46@spam.mis.stu.edu.tw
s47@spam.mis.stu.edu.tw
s48@spam.mis.stu.edu.tw
s49@spam.mis.stu.edu.tw
s50@spam.mis.stu.edu.tw



------------------------------

Date: Thu, 29 Jan 2004 16:26:28 -0500
From: "Peter Wang" <wkwang@cisco.com>
Subject: SpreadSheet::WriteExcel support string > 255 characters
Message-Id: <1075411739.874109@sj-nntpcache-3>

Hi All,

Do you which version of SpreadSheet::WriteExcel module could support a
string more than 255 characters?
Could you please provide the link if you know this?


Thanks
Peter




------------------------------

Date: Thu, 29 Jan 2004 16:13:04 -0600
From: "J. Gleixner" <glex_nospam@qwest.invalid>
Subject: Re: SpreadSheet::WriteExcel support string > 255 characters
Message-Id: <RzfSb.54$6W6.78678@news.uswest.net>

Peter Wang wrote:
> Hi All,
> 
> Do you which version of SpreadSheet::WriteExcel module could support a
> string more than 255 characters?
> Could you please provide the link if you know this?

Spent about a minute looking at the documentation for this module, on 
http://search.cpan.org, in the TO DO area for this module, there's a 
link to a version on freshmeat.net that supports > 255.

Pretty easy to find it, once you read the documentation. :-)



------------------------------

Date: Wed, 21 Jan 2004 14:05:28 +0000 (UTC)
From: "John J. Trammell" <trammell+usenet@hypersloth.invalid>
Subject: Re: SQL insert/update duplicate entries
Message-Id: <slrnc0t1p8.bde.trammell+usenet@hypersloth.el-swifto.com.invalid>

On 20 Jan 2004 21:02:57 -0800, Cheez <cheez2112@yahoo.com> wrote:
> Hi there, I am using a pretty basic Perl script to read data from a
> file into a mysql table. I have no provisions in my SQL statement to
> deal with duplicate entries.  In this situation, I want to increment a
> frequency counter.

You might want to use the "REPLACE INTO" statement.  Hard to say
from what you posted.



------------------------------

Date: 21 Jan 2004 16:47:50 -0800
From: cheez2112@yahoo.com (Cheez)
Subject: Re: SQL insert/update duplicate entries
Message-Id: <1e85f7c8.0401211647.1171e595@posting.google.com>

"John J. Trammell" <trammell+usenet@hypersloth.invalid> wrote in message news:<slrnc0t1p8.bde.trammell+usenet@hypersloth.el-swifto.com.invalid>...
> On 20 Jan 2004 21:02:57 -0800, Cheez <cheez2112@yahoo.com> wrote: [snip]
>
> You might want to use the "REPLACE INTO" statement.  Hard to say
> from what you posted.

Howdy, 

I use a fairly basic Perl script to parse 16 flatfiles.  For each
parse I create a tab delimited dataset made up of a 7 digit
identifier, followed by <tab>, followed by a 10-character sequence. 
For instance:

6001456     GHIOGLKJNJ
6001234     SDRQIOPMNB
6310098     CCVNEEQPLL
 ...(100,000 lines long)

I then INSERT this data into a mySQL table using the script from my
original post.  My 4 fields in the table are id, sequence, frequency,
identifier.  Upon duplicate sequence entries, I wanted to use an SQL
command to increment the frequency counter by one and also concatanate
the existing identifier string with the identifier.  I could have a
table that looks like:

id    sequence       frequency       identifier
1     GHIOGLKJNJ     1               6001456
2     CCVNEEQPLL     3               6310098, 7890123, 4567812

As it stands right now, when I do a simple INSERT, I do not check for
duplicates and they simply get inserted with a unique id.

A previous poster had suggested performing a SELECT query for each
line of my dataset to check for duplicates.  'IF duplicates exist
UPDATE frequency++, ELSE INSERT new data' will be my approach.

How long will that take?  Right now, to INSERT this data without
checking for duplicates takes about 30sec/100,000 records.  Performing
a SELECT I imagine is going to be a time consuming process?

As it stands right now, duplicates and all, my table has 3.5 million
records.  I bet 20% are duplicates.

I find Perl to be such a liberating language (compared to Java for
instance).  I have both the skill and enthusiasm of a newbie.  Thanks
for bearing with me if this whole post comes across a bit basic.

Cheers,
Cheez


------------------------------

Date: Thu, 22 Jan 2004 10:25:45 +0100
From: Tore Aursand <tore@aursand.no>
Subject: Re: SQL insert/update duplicate entries
Message-Id: <pan.2004.01.22.08.57.57.165961@aursand.no>

On Wed, 21 Jan 2004 16:47:50 -0800, Cheez wrote:
> I use a fairly basic Perl script to parse 16 flatfiles.  For each
> parse I create a tab delimited dataset made up of a 7 digit
> identifier, followed by <tab>, followed by a 10-character sequence. 
> For instance:
> 
> 6001456     GHIOGLKJNJ
> 6001234     SDRQIOPMNB
> 6310098     CCVNEEQPLL
> ...(100,000 lines long)
> 
> I then INSERT this data into a mySQL table using the script from my
> original post.  My 4 fields in the table are id, sequence, frequency,
> identifier.  Upon duplicate sequence entries, I wanted to use an SQL
> command to increment the frequency counter by one and also concatanate
> the existing identifier string with the identifier.  I could have a
> table that looks like:
> 
> id    sequence       frequency       identifier
> 1     GHIOGLKJNJ     1               6001456
> 2     CCVNEEQPLL     3               6310098, 7890123, 4567812

Do you really save the data in the 'identifier' as a string separated with
comma?  That means you have to parse it _again_ when you SELECT from the
table, you know.  Not the best solution I would say.

Personally, I would have used more than one table for storing the data,
and I'm not really sure that I would have had the 'frequency' field stored
in any of them either;

  CREATE TABLE sequences (
      id          int unsigned auto_increment primary key,
      sequence    varchar(10) not null,

      INDEX(sequence)
  );

  CREATE TABLE sequence_identifier (
      id          int unsigned not null,
      identifier  int unsigned not null,

      INDEX(identifier)
  );

Inserting the data in the database would proceed something like this:

  my $stInsSequence = $dbh->prepare('INSERT INTO sequence (sequence)
                                     VALUES (?)' );
  my $stInsSeqIdent = $dbh->prepare('INSERT INTO sequence_identifier (id, identifier)
                                     VALUES (?, ?)' );

  my %lookup = ();
  while ( <DATA> ) {
      chomp;
      my ( $identifier, $sequence ) = split( /\t+/, $_, 2 );
      my $id = $lookup{$sequence} || 0;
      unless ( $id > 0 ) {
          $stInsSequence->execute( $sequence );
          $id = $stInsSequence->{'mysql_insertid'};
          $lookup{$sequence} = $id;
      }
      $stInsSeqIdent->execute( $id, $identifier );
  }

  $stInsSeqIdent->finish();
  $stInsSequence->finish();

If I read your specification correctly, this should be all you need.  A
note on the %lookup hash, though:  It's intended for caching the id for
each sequence, thus it can grow quite large if you have a lot of data.

To get a list of all the identifiers for a specified sequence:

  my $stSeqIdent = $dbh->prepare('SELECT si.identifier
                                  FROM sequences s, sequence_identifier si
                                  WHERE s.sequence = ? AND s.id = si.id');
  $stSeqIdent->execute( $sequence );
  while ( my ( $identifier ) = $stSeqIdent->fetchrow_array() ) {
      print $identifier . "\n";
  }
  $stSeqIdent->finish();

To get frequence for each sequence, you just count the occurrances of it
in the 'sequence_identifier' table, or you could list all and group them
by the sequence.

> A previous poster had suggested performing a SELECT query for each
> line of my dataset to check for duplicates.  'IF duplicates exist
> UPDATE frequency++, ELSE INSERT new data' will be my approach.

With the approach described above (totally untested, by the way), you
don't need to check for that as long as the %lookup hash will fit into
memory.  If the %lookup hash grows too big, you will need the lookup.

> How long will that take?

Try it. :)


-- 
Tore Aursand <tore@aursand.no>
"First, God created idiots. That was just for practice. Then He created
 school boards." -- Mark Twain


------------------------------

Date: Thu, 22 Jan 2004 10:40:25 -0800
From: Jim Gibson <jgibson@mail.arc.nasa.gov>
Subject: Re: SQL insert/update duplicate entries
Message-Id: <220120041040258942%jgibson@mail.arc.nasa.gov>

In article <1e85f7c8.0401211647.1171e595@posting.google.com>, Cheez
<cheez2112@yahoo.com> wrote:

> "John J. Trammell" <trammell+usenet@hypersloth.invalid> wrote in message
> news:<slrnc0t1p8.bde.trammell+usenet@hypersloth.el-swifto.com.invalid>...
> > On 20 Jan 2004 21:02:57 -0800, Cheez <cheez2112@yahoo.com> wrote: [snip]
> >
> > You might want to use the "REPLACE INTO" statement.  Hard to say
> > from what you posted.
> 
> Howdy, 
> 
> I use a fairly basic Perl script to parse 16 flatfiles.  For each
> parse I create a tab delimited dataset made up of a 7 digit
> identifier, followed by <tab>, followed by a 10-character sequence. 
> For instance:
> 
> 6001456     GHIOGLKJNJ
> 6001234     SDRQIOPMNB
> 6310098     CCVNEEQPLL
> ...(100,000 lines long)
> 
> I then INSERT this data into a mySQL table using the script from my
> original post.  My 4 fields in the table are id, sequence, frequency,
> identifier.  Upon duplicate sequence entries, I wanted to use an SQL
> command to increment the frequency counter by one and also concatanate
> the existing identifier string with the identifier.  I could have a
> table that looks like:
> 
> id    sequence       frequency       identifier
> 1     GHIOGLKJNJ     1               6001456
> 2     CCVNEEQPLL     3               6310098, 7890123, 4567812
> 

[ SQL discussion snipped ]

Why don't you do the counting and concatenating of the identifier field
in Perl? (Provided, of course, that your data will all fit in memory.)
It if does, it will definitely be faster that doing it in SQL. After
reading and parsing all 16 files, you can then insert the results into
the database.


------------------------------

Date: 22 Jan 2004 11:15:07 -0800
From: cheez2112@yahoo.com (Cheez)
Subject: Re: SQL insert/update duplicate entries
Message-Id: <1e85f7c8.0401221115.4b47c9f6@posting.google.com>

Tore Aursand <tore@aursand.no> wrote in message news:<pan.2004.01.22.08.57.57.165961@aursand.no>...
> On Wed, 21 Jan 2004 16:47:50 -0800, Cheez wrote: [snip]

> To get frequence for each sequence, you just count the occurrances of it
> in the 'sequence_identifier' table, or you could list all and group them
> by the sequence.


In general, wow, thanks for the code.  I did end up making up a script
posted here:

http://tinyurl.com/ywphd

Your approach doesn't include the incredibly inefficient step of doing
a SELECT preceeding a INSERT or UPDATE.  However, the frequency value
is the one that I am using to identify sequences of interest.  I don't
know 'a priori' which sequences are of interest to me therefor I want
to sort by frequency number not by sequence.  I did experiment with
your strategy, knowing a particular sequence I can SELECT * FROM table
WHERE sequence LIKE '%XXXXXXXX%'.  This will return all records and of
course, by virtue of the number of these records I will know
frequency.  Again, I do not know which of these 3 million sequences is
the interesting one for me.  That's why the added step of doing an
intial SELECT.

I suppose I could make a Perl script to anaylze the table, with
duplicates,  after I populate it.  Undoubtedly I would have to do this
by creating a huge hash or array.   Not sure if I can do this with 3
million records.
Thanks again for your comments and the rest of the replies on this
post.

Cheers,
Cheez


------------------------------

Date: 22 Jan 2004 20:59:15 GMT
From: ctcgag@hotmail.com
Subject: Re: SQL insert/update duplicate entries
Message-Id: <20040122155915.272$23@newsreader.com>

cheez2112@yahoo.com (Cheez) wrote:
>
> In general, wow, thanks for the code.  I did end up making up a script
> posted here:
>
> http://tinyurl.com/ywphd
>
> Your approach doesn't include the incredibly inefficient step of doing
> a SELECT preceeding a INSERT or UPDATE.

If that is incredibly inefficient, you apparently aren't using indices.
If you aren't using indices, what's the point of using mySQL for this
in the first place?


> However, the frequency value
> is the one that I am using to identify sequences of interest.  I don't
> know 'a priori' which sequences are of interest to me therefor I want
> to sort by frequency number not by sequence.

I'd consider forgetting the identifiers for the moment.  Just count
the frequencies, and once you now which sequences are interesting, you
can go back re-parse the files to get the identifiers of just those
sequences.


> I did experiment with
> your strategy, knowing a particular sequence I can SELECT * FROM table
> WHERE sequence LIKE '%XXXXXXXX%'.  This will return all records and of
> course, by virtue of the number of these records I will know
> frequency.  Again, I do not know which of these 3 million sequences is
> the interesting one for me.  That's why the added step of doing an
> intial SELECT.

select sequence, count(*) from table group by sequence.

>
> I suppose I could make a Perl script to anaylze the table, with
> duplicates,  after I populate it.  Undoubtedly I would have to do this
> by creating a huge hash or array.

If I were going to do it this way, I wouldn't bother to put into the
database table in the first place.  Just analyze it in Perl directly
from the text files.

Or I'd just do it from the linux command line:
"cat all_my_files|cut -f2|sort|uniq -c|sort -n > out"

> Not sure if I can do this with 3
> million records.

I can fit a 3 million member hash (10 char key, 8 char value) in ~300M
memory, or in ~220M if I squeeze (value is integer count only)

Xho

-- 
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service              New Rate! $9.95/Month 50GB


------------------------------

Date: 22 Jan 2004 22:24:15 -0800
From: cheez2112@yahoo.com (Cheez)
Subject: Re: SQL insert/update duplicate entries
Message-Id: <1e85f7c8.0401222224.6f3512b4@posting.google.com>

ctcgag@hotmail.com wrote in message news:<20040122155915.272$23@newsreader.com>...
> cheez2112@yahoo.com (Cheez) wrote:
> >
> > In general, wow, thanks for the code.  I did end up making up a script
> > posted here:
> >
> > http://tinyurl.com/ywphd
> >
> > Your approach doesn't include the incredibly inefficient step of doing
> > a SELECT preceeding a INSERT or UPDATE.
> 
> If that is incredibly inefficient, you apparently aren't using indices.
> If you aren't using indices, what's the point of using mySQL for this
> in the first place?

My 'sequence' field is a primary index in this particular table. I
could very well be misinterpreting your question.
> 
> > However, the frequency value
> > is the one that I am using to identify sequences of interest.  I don't
> > know 'a priori' which sequences are of interest to me therefor I want
> > to sort by frequency number not by sequence.
> 
> I'd consider forgetting the identifiers for the moment.  Just count
> the frequencies, and once you now which sequences are interesting, you
> can go back re-parse the files to get the identifiers of just those
> sequences.

Excellent idea...thx.
 
> Or I'd just do it from the linux command line:
> "cat all_my_files|cut -f2|sort|uniq -c|sort -n > out"

I am literally using 100s of lines of code for this task!  
 
> > Not sure if I can do this with 3
> > million records.
> 
> I can fit a 3 million member hash (10 char key, 8 char value) in ~300M
> memory, or in ~220M if I squeeze (value is integer count only)

I'll give it shot.  Thanks for the comments.

Cheers,
Cheez


------------------------------

Date: Fri, 23 Jan 2004 17:09:05 +0000
From: Bruce Horrocks <spamblock@nodomain.nodomain.us>
Subject: Re: SQL insert/update duplicate entries
Message-Id: <p13TudFxUVEAFwNg@nodomain.nodomain.us>

In message <1e85f7c8.0401202102.28c33407@posting.google.com>, Cheez 
<cheez2112@yahoo.com> writes
># I would like to do this (but doesn't work):
>
>     INSERT INTO mytable (wherefound, sequence) VALUES (?,?)
>         -> ON DUPLICATE KEY UPDATE frequency=frequency+1

Have you tried:

INSERT INTO mytable (wherefound, sequence, frequency)
VALUES (?, ?, 1)
-> ON DUPLICATE KEY UPDATE frequency = frequency + 1;

The difference being that you've explicitly told it about the frequency 
column.

You also need to uniquely index mytable on sequence for the on duplicate 
statement to work.

Regards,
-- 
Bruce Horrocks
Surrey
England
<firstname>@<surname>.plus.com -- fix the obvious for email


------------------------------

Date: 24 Jan 2004 00:19:54 GMT
From: ctcgag@hotmail.com
Subject: Re: SQL insert/update duplicate entries
Message-Id: <20040123191954.307$Nt@newsreader.com>

cheez2112@yahoo.com (Cheez) wrote:
> ctcgag@hotmail.com wrote in message
> news:<20040122155915.272$23@newsreader.com>...
> > cheez2112@yahoo.com (Cheez) wrote:
> > >
> > > In general, wow, thanks for the code.  I did end up making up a
> > > script posted here:
> > >
> > > http://tinyurl.com/ywphd
> > >
> > > Your approach doesn't include the incredibly inefficient step of
> > > doing a SELECT preceeding a INSERT or UPDATE.
> >
> > If that is incredibly inefficient, you apparently aren't using indices.
> > If you aren't using indices, what's the point of using mySQL for this
> > in the first place?
>
> My 'sequence' field is a primary index in this particular table. I
> could very well be misinterpreting your question.

Are you absolutely sure that it is a primary key?  You reported about 20,
000 rows processed per hour, I processed 3 million rows 33 minutes. (The
first 20,000 took 7 seconds.  The last 20,000 took 27 seconds).  Unless you
are running a 33.3MHz machine, something seems very wrong.

Xho


use DBI ;
use strict;
$|++;

my $dbh = DBI->connect("DBI:mysql:x:localhost","me",
          "meme", { RaiseError => 1 })  ;

$dbh->do('drop table if exists xx_test');
$dbh->do('create table xx_test (sequence varchar(20) primary key, freq int,
identifiers text)');

my $check=$dbh->prepare('select identifiers from xx_test where
sequence=?'); my $insert=$dbh->prepare('insert into xx_test
        (sequence, freq, identifiers) values (?,1,?)');
my $update=$dbh->prepare('update xx_test set freq=freq+1, identifiers=
         concat(identifiers,",",?) where sequence=?');

timer();

foreach (1..3_000_000) {
  my $seq = join "", map {chr(65+rand(20))} 1..10;
  if (rand()< 0.01) { #force some duplicates to occur
    push @x,$seq unless @x>5000;
    $seq=$x[rand(@x)] if @x>500;
  };
  my $id=rand(1000000);

  $check->execute($seq);
  if (my $ident=$check->fetchrow_array()) {
    $update->execute($id,$seq);
  } else {
     $insert->execute($seq,$id);
  };
  print "$_\t",timer(),"\n" unless $_%10000;
};

-- 
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service              New Rate! $9.95/Month 50GB


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

#The Perl-Users Digest is a retransmission of the USENET newsgroup
#comp.lang.perl.misc.  For subscription or unsubscription requests, send
#the single line:
#
#	subscribe perl-users
#or:
#	unsubscribe perl-users
#
#to almanac@ruby.oce.orst.edu.  

NOTE: due to the current flood of worm email banging on ruby, the smtp
server on ruby has been shut off until further notice. 

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

#To request back copies (available for a week or so), send your request
#to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
#where x is the volume number and y is the issue number.

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V10 Issue 6046
***************************************

home	help	back	first	fref	pref	prev	next	nref	lref	last	post
[23843] in Perl-Users-Digest

Perl-Users Digest, Issue: 6046 Volume: 10

daemon@ATHENA.MIT.EDU (Perl-Users Digest)Thu Jan 29 22:36:59 2004

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Jan 29 22:36:59 2004