[13411] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 821 Volume: 9

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Sep 16 16:07:18 1999

Date: Thu, 16 Sep 1999 13:05:14 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Message-Id: <937512313-v9-i821@ruby.oce.orst.edu>
Content-Type: text

Perl-Users Digest           Thu, 16 Sep 1999     Volume: 9 Number: 821

Today's topics:
    Re: ? about bare word. <aqumsieh@matrox.com>
    Re: CGI cannot open relative path <ehpoole@ingress.com>
    Re: Diff of string arrays. <makkulka@cisco.com>
    Re: Encrypting (and decrypting) password <ehpoole@ingress.com>
    Re: Help with a regular expression <koharik@primenet.com>
    Re: How can I know if string have points? <makkulka@cisco.com>
    Re: How to assign filehandle to scalar? (Kragen Sitaker)
    Re: How to assign filehandle to scalar? <tchrist@mox.perl.com>
    Re: How to match with m/$variable/ ?? (Michael Friendly)
    Re: How to match with m/$variable/ ?? (Kragen Sitaker)
    Re: IIS 3.0 - CGI - @INC - problem finding libraries <ehpoole@ingress.com>
        Labelling warns and dies <brundlefly76@hotmail.com>
    Re: need to write www search engine <uri@sysarch.com>
    Re: need to write www search engine (Randal L. Schwartz)
    Re: PERL (cgi) and Databases -> How To? (Kragen Sitaker)
    Re: PERL (cgi) and Databases -> How To? <aqumsieh@matrox.com>
        perl prog for big/little indian conversion <nwsread@cloudband.com>
    Re: Some e-mails get sent, some don't (Randal L. Schwartz)
    Re: testing data types <makkulka@cisco.com>
    Re: testing data types <sariq@texas.net>
    Re: trimming spaces from a string (Greg Bacon)
    Re: Where do I get perl2exe for Win32? <tbornhol@prioritytech.com>
    Re: Where do I get perl2exe for Win32? <makkulka@cisco.com>
        Digest Administrivia (Last modified: 1 Jul 99) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Thu, 16 Sep 1999 13:28:18 -0400
From: Ala Qumsieh <aqumsieh@matrox.com>
Subject: Re: ? about bare word.
Message-Id: <x3y9066hikt.fsf@tigre.matrox.com>


scmpoper <scmpoper@scmp.com> writes:

> Do anyone know what's wrong with the code below:
> print "\${\"value\"} is ${"value"}\n";
                            ^     ^
                            ^     ^
                            ^     ^

> With -w option, found error message:
> Missing right bracket at cd-ref line 25, within string
> syntax error at cd-ref line 25, at EOF
> Execution of cd-ref aborted due to compilation errors.

You are using double quotes inside double quotes. Have a look at
qq//.

--Ala



------------------------------

Date: Thu, 16 Sep 1999 15:29:17 -0400
From: "Ethan H. Poole" <ehpoole@ingress.com>
Subject: Re: CGI cannot open relative path
Message-Id: <37E1450D.DEC6F09C@ingress.com>

Randal L. Schwartz wrote:
> 
> >>>>> "Ethan" == Ethan H Poole <ehpoole@ingress.com> writes:
> 
> Ethan> In many instances, you can calculate the base directory of your
> Ethan> executing script by examining $0 ($ zero).
> 
> This is merely a hint.  It's user-defined data, and so the moment
> you trust it, you will lose when someone trivially defeats that.

Actually it is user *definable* not user-defined.  At startup it contains
the name of the script.

Yes, the user *can* change it (on some platforms only).  However, in most
instances the 'user' is the script and the programmer has control over
that possibility.  If Perl later decides to change the behavior of $0, the
inclusion of a user-defined override path provides continued
compatibility.

Most scripts do not modify the data in $0 unless doing so would actually
provide useful status information.

The original user was looking for some method to determine what directory
their current script was executing in.  There really isn't any other
method to obtain this information transparently since Cwd() will only
return the current working directory which is not by any means guaranteed
to be the same as that the script is actually located in.  It would be my
hope that a programmer using $0 would at least be smart enough not to
destroy before obtaining their needed information.  If they aren't that
smart, they, too, deserve to shoot themselves in the foot.

Just my $0.02 on the subject.

-- 
Ethan H. Poole           ****   BUSINESS   ****
ehpoole@ingress.com      ==Interact2Day, Inc.==
(personal)               http://www.interact2day.com/


------------------------------

Date: Thu, 16 Sep 1999 12:36:23 -0700
From: Makarand Kulkarni <makkulka@cisco.com>
Subject: Re: Diff of string arrays.
Message-Id: <37E146B7.FAFE4A7B@cisco.com>

[ Andy Cragg wrote:

> Anyone know how to do a quick comparison between two arrays (of strings) and
> return the differences in another array?  Perhaps using cmp?

This is perl faq (4)
topic: How do I compute the difference of two arrays? How do I compute the
intersection of two
arrays?
You cannot use cmp here.
--



------------------------------

Date: Thu, 16 Sep 1999 15:46:43 -0400
From: "Ethan H. Poole" <ehpoole@ingress.com>
Subject: Re: Encrypting (and decrypting) password
Message-Id: <37E14923.DE1C6921@ingress.com>

Steve Button wrote:
> 
> Hello,
> I'm trying to encrypt a password that a user types in, so that I can pass it
> from program to program without it being intercepted inbetween. I can see
> that there are lots of modules on CPAN that can sort of do things like this,
> but I'm not sure of the best thing to use. I had a look at crypt, but that
> is something quite different (may be useful for other things though, as I'm
> also a Unix sysadm).
> 
> I confess that it is for a CGI program, so that I can pass the password as
> part of the URL (is that POST or GET, I can never remember*) when soneone
> has "logged in" to my web site. I know that this is not ideal, but we're not
> dealing with money or credit card numbers or anything (yet) I just don't
> want people messing around with other peoples adverts.
> 
> Any pointers, help would be much appreciated. If you could reply to me
> directly, that would be nice as I don't often get time to check usenet (got
> a lot of real work to do :-(  ) but please post to the group too.

Ideally, you should be letting the server and client handle the encryption
using SSL, otherwise your initial login presents a weak link.

For simple encryption (and there isn't much point to anything more complex
if you are still passing everything cleartext including the initial
login), you could use XOR encryption -- and making certain the key
contained in the script's code does not become compromised.

However, you had better make certain that there is some random data (but
known to the script to authenticate the encrypted data against) being
encrypted with each login or the encrypted username/password itself
becomes the plain-text key.  Even then, you will be left with the
equivalent of a session id which can be used to 'breakin' until it
expires.

If you are looking for genuine encryption and not just something that
makes life a tiny bit more complicated to ward off crackers you should be
using SSL and allowing the client and server to handle encrypting the
entire data stream.

-- 
Ethan H. Poole           ****   BUSINESS   ****
ehpoole@ingress.com      ==Interact2Day, Inc.==
(personal)               http://www.interact2day.com/


------------------------------

Date: Thu, 16 Sep 1999 12:55:12 -0700
From: Chris Koharik <koharik@primenet.com>
Subject: Re: Help with a regular expression
Message-Id: <Pine.BSI.3.96.990916125116.24696A-100000@usr05.primenet.com>

> Date: Thu, 16 Sep 1999 18:48:32 GMT
> From: Kragen Sitaker <kragen@dnaco.net>
> 
> In article <Pine.BSI.3.96.990916102741.19323C-100000@usr05.primenet.com>,
> Chris Koharik  <koharik@primenet.com> wrote:
> >Here is a little test script I tried.  I'm sure the others out there have
> >better suggestions.
> 
> Good regex.  Let me comment on your script, as others have done on
> mine, so people will know what not to emulate.

Please do.  I just love to get bashed. ;)

> >#!/bin/perl -w
> >use strict;
> 
> Good!
> 
> >my $string1 = "img src=\"/dir1/graphic.gif\"";
> >my $string2 = "img src=\"http://www.server.com/dir1/dir2/graphic.gif\"";
> >my @strings = (${string1},${string2});
> 
> Easier:
> my @strings = ( 
> 	q(img src="/dir1/graphic.gif"),
> 	q(img src="http://www.example.com/dir1/dir2/graphic.gif") 
> );

Yeah, yeah, I know.  Just not in the habit.  Will do better in the future.

> >foreach (@strings) {
> >  print STDOUT "$_\n";
> 
> You can just say print "$_\n", or print $_, "\n" if you like.

Another habit I've picked up because I usually have several file handles
being used.  One more you forgot to mention is print $_ . "\n"

> >  $_ =~ s/.*\//$newsrc/;
> 
> s/// operates on $_ by default; no need for the $_ =~.

Bad habit 3.  But it never hurts to be specific.  Makes it a bit more
readable to someone who is having problems. <--lame excuse

> Very nice script, though.  It uses strict and -w, and appears to have
> been tested :)

Thanks.

-Chris



------------------------------

Date: Thu, 16 Sep 1999 11:57:43 -0700
From: Makarand Kulkarni <makkulka@cisco.com>
Subject: Re: How can I know if string have points?
Message-Id: <37E13DA7.37112A5B@cisco.com>

[ Abel Almazán wrote:

> How can I know if string have points?

use tr()

$str = "../whatever/....\.\.\." ;
$count  = $str =~ tr/\./\./;
--



------------------------------

Date: Thu, 16 Sep 1999 19:12:15 GMT
From: kragen@dnaco.net (Kragen Sitaker)
Subject: Re: How to assign filehandle to scalar?
Message-Id: <jibE3.14835$N77.1102121@typ11.nn.bcandid.com>

In article <7rred2$fav$1@nntp.itservices.ubc.ca>,
Rod B. Nussbaumer <bomr@lin01.triumf.ca> wrote:
>perl Gurus:

I am not a guru, but I will answer your question anyway.

>        $OutFile = FH1;     #Switch back to File1

$OutFile = \*FH1;

That's a reference to a "typeglob".  I don't understand typeglobs,
sorry.  I just know I can use a reference to a typeglob containing a
filehandle whenever I need to use a filehandle, and the reference can
be stored in a variable.
-- 
<kragen@pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
Thu Sep 16 1999
53 days until the Internet stock bubble bursts on Monday, 1999-11-08.
<URL:http://www.pobox.com/~kragen/bubble.html>


------------------------------

Date: 16 Sep 1999 13:15:03 -0700
From: Tom Christiansen <tchrist@mox.perl.com>
Subject: Re: How to assign filehandle to scalar?
Message-Id: <37e141b7@cs.colorado.edu>

     [courtesy cc of this posting mailed to cited author]

In comp.lang.perl.misc, 
    "Rod B. Nussbaumer" <bomr@lin01.triumf.ca> writes:
:I have a need to read data from a file, and according to
:flags embedded within the file, copy said data to any of
:an arbitrary number of files.  

% man perlfaq7
 ...

  How can I make a filehandle local to a subroutine?  How do I pass
  filehandles between subroutines?  How do I make an array of filehandles?

    The fastest, simplest, and most direct way is to localize the
    typeglob of the filehandle in question:

        local *TmpHandle;

    Typeglobs are fast (especially compared with the alternatives)
    and reasonably easy to use, but they also have one subtle
    drawback. If you had, for example, a function named
    TmpHandle(), or a variable named %TmpHandle, you just hid it
    from yourself.

        sub findme {
            local *HostFile;
            open(HostFile, "</etc/hosts") or die "no /etc/hosts: $!";
            local $_;               # <- VERY IMPORTANT
            while (<HostFile>) {
                print if /\b127\.(0\.0\.)?1\b/;
            }
            # *HostFile automatically closes/disappears here
        }

    Here's how to use this in a loop to open and store a bunch of
    filehandles. We'll use as values of the hash an ordered pair
    to make it easy to sort the hash in insertion order.

        @names = qw(motd termcap passwd hosts);
        my $i = 0;
        foreach $filename (@names) {
            local *FH;
            open(FH, "/etc/$filename") || die "$filename: $!";
            $file{$filename} = [ $i++, *FH ];
        }

        # Using the filehandles in the array
        foreach $name (sort { $file{$a}[0] <=> $file{$b}[0] } keys %file) {
            my $fh = $file{$name}[1];
            my $line = <$fh>;
            print "$name $. $line";
        }

    For passing filehandles to functions, the easiest way is to
    preface them with a star, as in func(*STDIN). See the section
    on "Passing Filehandles" in the perlfaq7 manpage for details.

    If you want to create many anonymous handles, you should check
    out the Symbol, FileHandle, or IO::Handle (etc.) modules.
    Here's the equivalent code with Symbol::gensym, which is
    reasonably light-weight:

        foreach $filename (@names) {
            use Symbol;
            my $fh = gensym();
            open($fh, "/etc/$filename") || die "open /etc/$filename: $!";
            $file{$filename} = [ $i++, $fh ];
        }

    Or here using the semi-object-oriented FileHandle module,
    which certainly isn't light-weight:

        use FileHandle;

        foreach $filename (@names) {
            my $fh = FileHandle->new("/etc/$filename") or die "$filename: $!";
            $file{$filename} = [ $i++, $fh ];
        }

    Please understand that whether the filehandle happens to be a
    (probably localized) typeglob or an anonymous handle from one
    of the modules, in no way affects the bizarre rules for
    managing indirect handles. See the next question.

  How can I use a filehandle indirectly?

    An indirect filehandle is using something other than a symbol
    in a place that a filehandle is expected. Here are ways to get
    those:

        $fh =   SOME_FH;       # bareword is strict-subs hostile
        $fh =  "SOME_FH";      # strict-refs hostile; same package only
        $fh =  *SOME_FH;       # typeglob
        $fh = \*SOME_FH;       # ref to typeglob (bless-able)
        $fh =  *SOME_FH{IO};   # blessed IO::Handle from *SOME_FH typeglob

    Or to use the `new' method from the FileHandle or IO modules
    to create an anonymous filehandle, store that in a scalar
    variable, and use it as though it were a normal filehandle.

        use FileHandle;
        $fh = FileHandle->new();

        use IO::Handle;                     # 5.004 or higher
        $fh = IO::Handle->new();

    Then use any of those as you would a normal filehandle.
    Anywhere that Perl is expecting a filehandle, an indirect
    filehandle may be used instead. An indirect filehandle is just
    a scalar variable that contains a filehandle. Functions like
    `print', `open', `seek', or the `<FH>' diamond operator will
    accept either a read filehandle or a scalar variable
    containing one:

        ($ifh, $ofh, $efh) = (*STDIN, *STDOUT, *STDERR);
        print $ofh "Type it: ";
        $got = <$ifh>
        print $efh "What was that: $got";

    If you're passing a filehandle to a function, you can write
    the function in two ways:

        sub accept_fh {
            my $fh = shift;
            print $fh "Sending to indirect filehandle\n";
        }

    Or it can localize a typeglob and use the filehandle directly:

        sub accept_fh {
            local *FH = shift;
            print  FH "Sending to localized filehandle\n";
        }

    Both styles work with either objects or typeglobs of real
    filehandles. (They might also work with strings under some
    circumstances, but this is risky.)

        accept_fh(*STDOUT);
        accept_fh($handle);

    In the examples above, we assigned the filehandle to a scalar
    variable before using it. That is because only simple scalar
    variables, not expressions or subscripts into hashes or
    arrays, can be used with built-ins like `print', `printf', or
    the diamond operator. These are illegal and won't even
    compile:

        @fd = (*STDIN, *STDOUT, *STDERR);
        print $fd[1] "Type it: ";                           # WRONG
        $got = <$fd[0]>                                     # WRONG
        print $fd[2] "What was that: $got";                 # WRONG

    With `print' and `printf', you get around this by using a
    block and an expression where you would place the filehandle:

        print  { $fd[1] } "funny stuff\n";
        printf { $fd[1] } "Pity the poor %x.\n", 3_735_928_559;
        # Pity the poor deadbeef.

    That block is a proper block like any other, so you can put
    more complicated code there. This sends the message out to one
    of two places:

        $ok = -x "/bin/cat";                
        print { $ok ? $fd[1] : $fd[2] } "cat stat $ok\n";
        print { $fd[ 1+ ($ok || 0) ]  } "cat stat $ok\n";           

    This approach of treating `print' and `printf' like object
    methods calls doesn't work for the diamond operator. That's
    because it's a real operator, not just a function with a
    comma-less argument. Assuming you've been storing typeglobs in
    your structure as we did above, you can use the built-in
    function named `readline' to reads a record just as `<>' does.
    Given the initialization shown above for @fd, this would work,
    but only because readline() require a typeglob. It doesn't
    work with objects or strings, which might be a bug we haven't
    fixed yet.

        $got = readline($fd[0]);

    Let it be noted that the flakiness of indirect filehandles is
    not related to whether they're strings, typeglobs, objects, or
    anything else. It's the syntax of the fundamental operators.
    Playing the object game doesn't help you at all here.

--tom
-- 
    "Most of what I've learned over the years has come from signatures."
    	--Larry Wall


------------------------------

Date: 16 Sep 1999 19:03:42 GMT
From: friendly@hotspur.psych.yorku.ca (Michael Friendly)
Subject: Re: How to match with m/$variable/ ??
Message-Id: <7rreue$ma1$1@sunburst.ccs.yorku.ca>

In article <0hSD3.13262$N77.956658@typ11.nn.bcandid.com>  writes:
|In article <7roph7$1a8$1@sunburst.ccs.yorku.ca>,
|Michael Friendly <friendly@hotspur.psych.yorku.ca> wrote:
|>I've obviously forgotten something, because I can't get variable patterns
|>to work in matches.  E.g., I want to use
|>
|># Patterns for include-type statements, leaving filename in #1
|>$include_pat = join('|',
|>      ('\\input\b\{?(\S+)\}',
|>       '\\include\s*\{(\S+)\}'
|>      ));
|
|perl -e "print '\\\\', qq(\n)" outputs
|\
|. . . i.e. \\ in '' becomes \.  So by the time the regex sees it, it 
|thinks it's looking for \i, not \ followed by i.  (What's \i?)
|


Actually,
perl -e "print '\\\\', qq(\n)" outputs
\\

as do other forms:
 % perl -e "print '\\\\', qq(\n)"
\\
 % perl -e "print qq(\\\\), qq(\n)"
\\
 % perl -e "print q(\\\\), qq(\n)"
\\

I think I see the problem, but I still cant tell how to do what I want, which
is to define $include_pat so that I can match strings like
   \input{filename}
   \include{filename}

using
   if (m#$include_pat#) {
	$file = $1;
   }

--
Michael Friendly     Email: friendly@yorku.ca (NeXTmail OK)
Psychology Dept
York University      Voice: 416 736-5115  Fax: 416 736-5814
4700 Keele Street    http://www.math.yorku.ca/SCS/friendly.html
Toronto, ONT  M3J 1P3 CANADA

--
Michael Friendly     Email: friendly@yorku.ca (NeXTmail OK)
Psychology Dept
York University      Voice: 416 736-5115  Fax: 416 736-5814
4700 Keele Street    http://www.math.yorku.ca/SCS/friendly.html
Toronto, ONT  M3J 1P3 CANADA


------------------------------

Date: Thu, 16 Sep 1999 19:40:50 GMT
From: kragen@dnaco.net (Kragen Sitaker)
Subject: Re: How to match with m/$variable/ ??
Message-Id: <6JbE3.14859$N77.1105112@typ11.nn.bcandid.com>

In article <7rreue$ma1$1@sunburst.ccs.yorku.ca>,
Michael Friendly <friendly@hotspur.psych.yorku.ca> wrote:
>|Michael Friendly <friendly@hotspur.psych.yorku.ca> wrote:
>|># Patterns for include-type statements, leaving filename in #1
>|>$include_pat = join('|',
>|>      ('\\input\b\{?(\S+)\}',
>|>       '\\include\s*\{(\S+)\}'
>|>      ));
>
>Actually,
>perl -e "print '\\\\', qq(\n)" outputs
>\\

Oops!  Don't know what I was thinking.  Sorry.

Nevertheless, \\ does become \ in ''.

Cut and pasted:
kirk:/home/kragen/ perl
print '\\', "\n";
\D
kirk:/home/kragen/

(The D is from me hitting control-D.)

#!/usr/bin/perl -w
use strict;
my $string = '\input{foogledy}';
print "string is $string\n";
my $pat = '\\\\input\b\{?(\S+)\}';
print "matches $pat\n" if $string =~ /$pat/;

This outputs:

string is \input{foogledy}
matches \\input\b\{?(\S+)\}

BTW, if you have a ? after \{, you might want one after \} too.

>I think I see the problem, but I still cant tell how to do what I want, which
>is to define $include_pat so that I can match strings like
>   \input{filename}
>   \include{filename}
>
>using
>   if (m#$include_pat#) {
>	$file = $1;
>   }

Does this help?
-- 
<kragen@pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
Thu Sep 16 1999
53 days until the Internet stock bubble bursts on Monday, 1999-11-08.
<URL:http://www.pobox.com/~kragen/bubble.html>


------------------------------

Date: Thu, 16 Sep 1999 15:36:18 -0400
From: "Ethan H. Poole" <ehpoole@ingress.com>
Subject: Re: IIS 3.0 - CGI - @INC - problem finding libraries
Message-Id: <37E146B2.C0892A84@ingress.com>

Alain BORGO wrote:
> 
> This is not a beginner question. It is just a IIS problem !
> 
> Normally, when your script is launched, the working directory is the
> directory where the script resides. Unfortunatly, with IIS, the
> working directory is ALWAYS the root directory of your scripts.
> 
> URL = /cgi-bin/survey           PHYSICAL PATH :
> c:\somewhere\else\cgi-bin\survey
> When you call /cgi-bin/survey/mysurvey.pl, the current directory is
> c:\somewhere\else\cgi-bin (the same as before !) and the require
> "mysurveylib.pl" requires that mysurveylib.pl is in this directory.
> And then your script failed !
> 
> In the last case, you have to say  :
> require "/somewhere/else/cgi-bin/survey/mysurveylib.pl";

Or you can simply define a script alias for
"/somewhere/else/cgi-bin/survey" and the working directory will be that of
".../somewhere/else/cgi-bin/survey/".

This isn't an "unfortunate" IIS behavior, it is the result of CGI
specifications which never defined what the working directory should be. 
As such, each server is free to use whatever it wishes.  Be glad that IIS
was kind enough to give you a predictable working directory -- the
standards would allow for a totally random or unrelated working directory.

-- 
Ethan H. Poole           ****   BUSINESS   ****
ehpoole@ingress.com      ==Interact2Day, Inc.==
(personal)               http://www.interact2day.com/


------------------------------

Date: Thu, 16 Sep 1999 19:42:11 GMT
From: Brundle <brundlefly76@hotmail.com>
Subject: Labelling warns and dies
Message-Id: <7rrh6f$rg2$1@nnrp1.deja.com>

One can fairly straightforwardly trap warns and dies, and label them as
such, but I think this would be an excellent built-in feature.

It is of great significance to the programmer whether an error stopped
execution or not.



Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.


------------------------------

Date: 16 Sep 1999 15:07:55 -0400
From: Uri Guttman <uri@sysarch.com>
Subject: Re: need to write www search engine
Message-Id: <x71zbyk73o.fsf@home.sysarch.com>

>>>>> "RJ" == Roger Jacques <channel@metalab.unc.edu> writes:

  RJ> Thank you very much for your help so far.  I do have financial
  RJ> backing for this project and would be able to farm it out to a
  RJ> professional service, or perhaps find an ISP with the required
  RJ> resources willing to take it on.  Therefore, could you give me
  RJ> figures for the disk space needed and the amount of time it would
  RJ> take through a fractional T3 with a reasonably efficient crawler,
  RJ> plus any other specifications I would need to to provide to such a
  RJ> service.

i may be the only one (or one of very few) here who has actually written
a high end web crawler (northern light's). that one currently has
slurped the largest index on the web and it is still has seen only 16%
of the whole mess. the project took 6 months of 3 full time people to do
it. this was only crawling, not indexing or searching which were done by
other groups. this is not a trivial project when you scale it up to the
full size of the web. here are some of the major issues:

supporting 2000 parallel page fetches.

parsing fetched pages for links, meta tags, base tags, etc.

storing the fetched pages and/or parsed data in a format that can be
indexed.

doing dns lookups so you don't crawl a site both by name and by ip. the
dns lookups have to be non-blocking to the crawler.

managing a database of what sites and pages you have seen and what to
fetch next. this was a very difficult design issue. you can't keep all
the info you need in ram so you have to create a complex set of in ram
and on disk structures. integrating newly parsed links is especially
tricky.

the architecture was 2 processes, the first did the fetching and
parsing. the second managed the sites and url database. they
communicated via a socket and the commands and responses were in a
flexible format. the system was written in c for speed. i wouldn't dare
try this in perl (though many parts would have been much easier to
code). both processes ran under the same generic event engine. 

the crawler could slurp over 10GB a day and at one point was throttled
back since the indexer couldn't keep up. that has since changed.

  RJ> What I want is a massive database with all URLs and the contents
  RJ> of each <title>, meta:title and first headers <h1/h2>.  I have
  RJ> found a use for such a database that would definitely fill a
  RJ> needed gap, and that would be popular and profitable to my group.

that involves at least reading about 1k (or more) from each page. and
too short reads could ruin parsing those pages. also you will miss
many/some links unless you fetch and parse the whole page. we did
truncate pages after some largish amount (i forget the size).

  RJ> Also I could use your further recommendations and suggestions as
  RJ> to the best crawler (harvester agent?) to use or to write.

i don't know of any large scale crawlers on the market. this is usually
write your own since each one has different needs. the url and site
management is critical and varies from crawler to crawler. also the
depth of page parsing and the output of pages for the index vary
greatly. 

you need an experienced systems hacker for this. don't try it yourself.

uri


-- 
Uri Guttman  -----------------  SYStems ARCHitecture and Software Engineering
uri@sysarch.com  ---------------------------  Perl, Internet, UNIX Consulting
Have Perl, Will Travel  -----------------------------  http://www.sysarch.com
The Best Search Engine on the Net -------------  http://www.northernlight.com
"F**king Windows 98", said the general in South Park before shooting Bill.


------------------------------

Date: 16 Sep 1999 12:19:24 -0700
From: merlyn@stonehenge.com (Randal L. Schwartz)
Subject: Re: need to write www search engine
Message-Id: <m1n1um64w3.fsf@halfdome.holdit.com>

>>>>> "Roger" == Roger Jacques <channel@metalab.unc.edu> writes:

Roger> What I want is a massive database with all URLs and the contents of each
Roger> <title>, meta:title and first headers <h1/h2>.

Wooo hooo.  If you think you'll find things in h1/h2, then you have
definitely pulled something over on those "investors".

I *rarely* see h1/h2 as first and second order heads.  Instead, I see
CSS tagging or <font size=18> stuff to make it "BIG" because the
author wants it "BIG".  Never mind that there must be a *reason* that
it is big, and thus should have been an <h2> or something.

So, again, I'll join the voices that have already spoken here...  if
you are clueless, get a clue.  If you are clueful, you are being
dishonest with your investors.  In either case, the world doesn't need
another spider hitting 800 million pages.  Look at Dogpile or
Metacrawler about ways to query the union of these databases, or even
WWW::Search from Perl to see how to query the individual databases.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!


------------------------------

Date: Thu, 16 Sep 1999 19:08:27 GMT
From: kragen@dnaco.net (Kragen Sitaker)
Subject: Re: PERL (cgi) and Databases -> How To?
Message-Id: <LebE3.14827$N77.1101799@typ11.nn.bcandid.com>

In article <0b51f39f.f8b3f1be@usw-ex0102-012.remarq.com>,
LynchMan  <lynchmanNOpaSPAM@cyberenet.net> wrote:
>I am curently developing some Perl CGI scripts, and I want
>to get into using a database with my scripts.  I was
>wondering where I should start.  I use my ISP to host my
>pages (it is a unix box).  So basically I was wondering what
>database program I should use, and how do I connect to it
>using perl.

If it's a relational database, you connect to it using DBI and DBD.
You may want to look into something like mod_perl or FastCGI though, as
opening database connections tends to be slow.

>  I am running a Linux box at home, so what I am
>really looking for is some database that I can create
>locally, then upload it to the server and be able to have
>the scripts (which are running on the sever) hit it on a
>needs be basis.

Most relational databases can export entire tables as SQL statements;
some of them generate portable enough SQL that you can just run that
SQL on another kind of RDBMS in order to copy the data over.

Of course, even without this feature, relational databases contain
fairly simple data; copying it from one kind of RDBMS to another should
be easy anyway.

You probably won't want to copy the raw database space.  :)

>My main concern is what my ISP will
>support.  I don't think they will install any modules for me
>or a new database driver, so what types would a typical unix
>ISP box support? 

You can install modules in your own directory.  The same is not usually
true of database servers.

Typical Unix ISPs don't have database servers.  You probably need to
find one who does.

>If it would work I could deal with an MS access DB.  I do
>have a windows box also so I could make the DB in that, but
>then will I be able to access it using Perl and on a unix
>ISP without having anything installed onto the server?

No.  You could export the data into CSV or SQL files, though, which you
could either access directly (if you don't need fancy stuff like
atomicity, consistency, isolation, durability, or fast queries) or load
into some other database.

HTH.
-- 
<kragen@pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
Thu Sep 16 1999
53 days until the Internet stock bubble bursts on Monday, 1999-11-08.
<URL:http://www.pobox.com/~kragen/bubble.html>


------------------------------

Date: Thu, 16 Sep 1999 13:31:19 -0400
From: Ala Qumsieh <aqumsieh@matrox.com>
Subject: Re: PERL (cgi) and Databases -> How To?
Message-Id: <x3y7llqhifs.fsf@tigre.matrox.com>


jonceramic@nospammiesno.earthlink.net (Jon S.) writes:

> On 15 Sep 1999 21:52:08 -0500, abigail@delanet.com (Abigail) wrote:
> >Do you however not
> >need your ISP to install modules, but you knew that, because you
> >read the FAQ, didn't you?
> 
> Abigail, I too am a newbie working on this, and, given the immense
> numbers of FAQs, I haven't been able to find this one yet.
> 
> Can you direct me to where I should look?  Is it in the FAQs at
> www.perl.com?

[Note: I am not Abigail ;-)]

Yes. Look in perlfaq8:

	How do I keep my own module/library directory?

and

	How do I add the directory my program lives in to 
	the module/library search path?

--Ala



------------------------------

Date: 16 Sep 1999 20:39:30 +0100
From: <nwsread@cloudband.com>
Subject: perl prog for big/little indian conversion
Message-Id: <37e14772@glitch.nildram.co.uk>

	Hi i am looking for a perl program to convert some
big indian ultra sparc gdbm files, to little indian 32bit
gdbm files.  Does someone have something in perl that does
byte conversion?  


------------------------------

Date: 16 Sep 1999 12:24:13 -0700
From: merlyn@stonehenge.com (Randal L. Schwartz)
Subject: Re: Some e-mails get sent, some don't
Message-Id: <m1iu5a64o2.fsf@halfdome.holdit.com>

>>>>> "Greg" == Greg Miller <gmiller@iglou.com> writes:

Greg> 	I'm using code similar to the following:

Greg> if(!open(M,"|mail $addr"){

Danger, Will Robinson!

What happens on the day that someone comes along with an email
address of:

	'merlyn@stonehenge.com </etc/passwd'

Egah!  I just got your password file.

Or what about:

	'; rm -rf /'

Egah!  You just got spanked badly.

And if you're saying "but I filter out bad email addresses", do you
*PERMIT* all valid email addresses?  Can you send mail to

	fred&barney@stonehenge.com

which is a valid email address (try it... it has an autoresponder)?

<sigh>

This is another CERT bug waiting to happen.

THINK, people... THINK!

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!


------------------------------

Date: Thu, 16 Sep 1999 12:16:20 -0700
From: Makarand Kulkarni <makkulka@cisco.com>
Subject: Re: testing data types
Message-Id: <37E14203.35C97314@cisco.com>

[Kevin Howe wrote:

> How do you test if a variable is an array, or a hash, or scalar?

If you had an array  @x, scalar $x and  hash %x then why
would I use ref() to find the type. But in the case that $s
was set to either of \@x or \$x or \%x then to find
the type of ref $s is you would pass it to ref() and check
the return value. If ref() returns false then $s is not a reference.

For example you cannot differentiate between $str, @str and %str
in the following program

$str = "../whatever/....\.\.\." ;
@str = ( 1, 2, 3);
%str = ( 1=> 2) ;

 if (!ref( $str )) {
        print "r is not a reference at all.\n";
    }
 if (!ref( @str )) {
        print "r is not a reference at all.\n";
    }
 if (!ref( %str )) {
        print "r is not a reference at all.\n";
    }
~
--prints --
r is not a reference at all.
r is not a reference at all.
r is not a reference at all.

However you can differentiate between references to these variables
using ref()
--



------------------------------

Date: Thu, 16 Sep 1999 14:35:00 -0500
From: Tom Briles <sariq@texas.net>
Subject: Re: testing data types
Message-Id: <37E14664.9ADE9E15@texas.net>

Kevin Howe wrote:
> 
> How do you test if a variable is an array, or a hash, or scalar?

I assume you mean 'reference'.

perldoc -f ref

- Tom


------------------------------

Date: 16 Sep 1999 19:44:34 GMT
From: gbacon@itsc.uah.edu (Greg Bacon)
Subject: Re: trimming spaces from a string
Message-Id: <7rrhb2$5g8$1@info2.uah.edu>

In article <X6bE3.14819$N77.1100736@typ11.nn.bcandid.com>,
	kragen@dnaco.net (Kragen Sitaker) writes:

: If I were writing Unicode, I'd be miffed if \s didn't match the various
: non-ASCII spaces in Unicode.

Such as?

Greg
-- 
Got Mole problems?
Call Avogadro: 6.02 x 10^23


------------------------------

Date: Thu, 16 Sep 1999 14:25:46 -0500
From: "Tim Bornholtz" <tbornhol@prioritytech.com>
Subject: Re: Where do I get perl2exe for Win32?
Message-Id: <46E203E3ED26A361.465B65CBCBCA4B76.4E177AA4B05775EB@lp.airnews.net>


Andy Cragg <andrew_cragg@csi.com> wrote in message
news:7rra50$pv4$1@ssauraaa-i-1.production.compuserve.com...
> Hi,
>
> Erm, where do I get perl2exe for Win32?

Well, I did a quick search on google.com and would you believe it, the first
hit was to the "Perl2Exe Home Page".  Wow, search engines are amazing
things.

Try http://www.demobuilder.com/perl2exe.htm

hth,
Tim Bornholtz
tbornhol@prioritytech.com







------------------------------

Date: Thu, 16 Sep 1999 12:21:59 -0700
From: Makarand Kulkarni <makkulka@cisco.com>
Subject: Re: Where do I get perl2exe for Win32?
Message-Id: <37E14356.606ECB85@cisco.com>

[Andy Cragg wrote:

> Erm, where do I get perl2exe for Win32?

http://www.perl2exe.com/




------------------------------

Date: 1 Jul 99 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 1 Jul 99)
Message-Id: <null>


Administrivia:

The Perl-Users Digest is a retransmission of the USENET newsgroup
comp.lang.perl.misc.  For subscription or unsubscription requests, send
the single line:

	subscribe perl-users
or:
	unsubscribe perl-users

to almanac@ruby.oce.orst.edu.  

To submit articles to comp.lang.perl.misc (and this Digest), send your
article to perl-users@ruby.oce.orst.edu.

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

To request back copies (available for a week or so), send your request
to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
where x is the volume number and y is the issue number.

The Meta-FAQ, an article containing information about the FAQ, is
available by requesting "send perl-users meta-faq" from
almanac@ruby.oce.orst.edu. The real FAQ, as it appeared last in the
newsgroup, can be retrieved with the request "send perl-users FAQ" from
almanac@ruby.oce.orst.edu. Due to their sizes, neither the Meta-FAQ nor
the FAQ are included in the digest.

The "mini-FAQ", which is an updated version of the Meta-FAQ, is
available by requesting "send perl-users mini-faq" from
almanac@ruby.oce.orst.edu. 

For other requests pertaining to the digest, send mail to
perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
sending perl questions to the -request address, I don't have time to
answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V9 Issue 821
*************************************


home help back first fref pref prev next nref lref last post