[19603] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 1798 Volume: 10

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Mon Sep 24 00:10:30 2001

Date: Sun, 23 Sep 2001 21:10:11 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Message-Id: <1001304611-v10-i1798@ruby.oce.orst.edu>
Content-Type: text

Perl-Users Digest           Sun, 23 Sep 2001     Volume: 10 Number: 1798

Today's topics:
        Regular Expression Problem (Scott)
    Re: Regular Expression Problem <jeffplus@mediaone.net>
    Re: Regular Expression Problem <jeffplus@mediaone.net>
    Re: Regular Expression Problem (Logan Shaw)
    Re: Regular Expression Problem <please@no.spam>
    Re: Regular Expression Problem <please@no.spam>
    Re: search, replace, functions, text wrapping (hard que <goldbb2@earthlink.net>
    Re: search, replace, functions, text wrapping (hard que (Martien Verbruggen)
    Re: search, replace, functions, text wrapping (hard que <goldbb2@earthlink.net>
        sub key index sorting <swessels@usgn.net>
    Re: win32 stat in directory with 4682 files <goldbb2@earthlink.net>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: 23 Sep 2001 19:23:17 -0700
From: scott_hill2@hotmail.com (Scott)
Subject: Regular Expression Problem
Message-Id: <7e0b3308.0109231823.5b7ea19e@posting.google.com>

Hello, I'm just learning Regular Expressions, and could use some help.

I have some bits of data in a long string of newline terminated text,
and I need to pull it out. Ths string looks like this.

(please forgive the Java style code...i am using a library)

myString = " ....multi-lines of text i don't need...\n"
myString += "#DATA"     // this flags my data section
myString += "..multi-lines of data i DO need....\n"
myString += "#0"        // this flags the end of my data.


I am trying something like /#DATA([^#0]+)/

but this isnt working. I am trying to use the ^ to mean 'not #0', but I
think Perl is reading it as 'not #'. 

Can anyone help? It would be much appreciated.


------------------------------

Date: Mon, 24 Sep 2001 03:13:36 GMT
From: Jeff <jeffplus@mediaone.net>
Subject: Re: Regular Expression Problem
Message-Id: <Axxr7.6758$xG6.2143028@typhoon.ne.mediaone.net>

Scott-

Perl is interpreting that to mean:

"The text '#DATA' followed by one or more characters that are neither '#' 
nor '0'"

You want this:

$_ = $the_string_with_the_data;
/^#DATA(.*)^#0/ms;
my $data = $1;

The 's' modifier will let the '.' match newlines, and the 'm'  modifier 
will let the caret ('^') match the position right _after_ any embedded 
newline.

-jms

Scott wrote:

> Hello, I'm just learning Regular Expressions, and could use some help.
> 
> I have some bits of data in a long string of newline terminated text,
> and I need to pull it out. Ths string looks like this.
> 
> (please forgive the Java style code...i am using a library)
> 
> myString = " ....multi-lines of text i don't need...\n"
> myString += "#DATA"     // this flags my data section
> myString += "..multi-lines of data i DO need....\n"
> myString += "#0"        // this flags the end of my data.
> 
> 
> I am trying something like /#DATA([^#0]+)/
> 
> but this isnt working. I am trying to use the ^ to mean 'not #0', but I
> think Perl is reading it as 'not #'.
> 
> Can anyone help? It would be much appreciated.
> 




------------------------------

Date: Mon, 24 Sep 2001 03:19:20 GMT
From: Jeff <jeffplus@mediaone.net>
Subject: Re: Regular Expression Problem
Message-Id: <YCxr7.6771$xG6.2144890@typhoon.ne.mediaone.net>

Forgot to mention...

If the code you wrote is Perl, you'll want to change '+=' to '.=', and of 
course the comment sequence is '#' instead of '//'.  If you want to keep 
the Java-style variable name, it's your prerogative :) .

Jeff wrote:

> Scott-
> 
> Perl is interpreting that to mean:
> 
> "The text '#DATA' followed by one or more characters that are neither '#'
> nor '0'"
> 
> You want this:
> 
> $_ = $the_string_with_the_data;
> /^#DATA(.*)^#0/ms;
> my $data = $1;
> 
> The 's' modifier will let the '.' match newlines, and the 'm'  modifier
> will let the caret ('^') match the position right _after_ any embedded
> newline.
> 
> -jms
> 
> Scott wrote:
> 
>> Hello, I'm just learning Regular Expressions, and could use some help.
>> 
>> I have some bits of data in a long string of newline terminated text,
>> and I need to pull it out. Ths string looks like this.
>> 
>> (please forgive the Java style code...i am using a library)
>> 
>> myString = " ....multi-lines of text i don't need...\n"
>> myString += "#DATA"     // this flags my data section
>> myString += "..multi-lines of data i DO need....\n"
>> myString += "#0"        // this flags the end of my data.
>> 
>> 
>> I am trying something like /#DATA([^#0]+)/
>> 
>> but this isnt working. I am trying to use the ^ to mean 'not #0', but I
>> think Perl is reading it as 'not #'.
>> 
>> Can anyone help? It would be much appreciated.
>> 
> 
> 
> 



------------------------------

Date: 23 Sep 2001 22:20:46 -0500
From: logan@cs.utexas.edu (Logan Shaw)
Subject: Re: Regular Expression Problem
Message-Id: <9om8qe$lml$1@charity.cs.utexas.edu>

>Scott wrote:
>> myString = " ....multi-lines of text i don't need...\n"
>> myString += "#DATA"     // this flags my data section
>> myString += "..multi-lines of data i DO need....\n"
>> myString += "#0"        // this flags the end of my data.
>> 
>> 
>> I am trying something like /#DATA([^#0]+)/

In article <Axxr7.6758$xG6.2143028@typhoon.ne.mediaone.net>,
Jeff  <jeffplus@mediaone.net> wrote:
>Perl is interpreting that to mean:
>
>"The text '#DATA' followed by one or more characters that are neither '#' 
>nor '0'"
>
>You want this:
>
>$_ = $the_string_with_the_data;
>/^#DATA(.*)^#0/ms;
>my $data = $1;

It might be a good idea to make the Kleene star non-greedy by adding a
question mark, like this:

	$string =~ /^#DATA(.*?)^#0/ms;

Whether that's necessary or not depends on whether "^#0" occurs only
once after "^#DATA" occurs.  If it does occur more than once, the
non-greedy version will match all the text up to the first one, whereas
the (default) greedy version will match all of the text up to the last
one.

  - Logan
-- 
"Everybody
 Loves to see              
 Justice done
 On somebody else"     ( Bruce Cockburn, "Justice", 1981 )


------------------------------

Date: Mon, 24 Sep 2001 03:34:58 GMT
From: Andrew Cady <please@no.spam>
Subject: Re: Regular Expression Problem
Message-Id: <878zf59sbf.fsf@homer.cghm>

scott_hill2@hotmail.com (Scott) writes:

> Hello, I'm just learning Regular Expressions, and could use some
> help.
> 
> I have some bits of data in a long string of newline terminated
> text, and I need to pull it out. Ths string looks like this.
> 
> (please forgive the Java style code...i am using a library)
> 
> myString = " ....multi-lines of text i don't need...\n"
> myString += "#DATA"     // this flags my data section
> myString += "..multi-lines of data i DO need....\n"
> myString += "#0"        // this flags the end of my data.
> 
> 
> I am trying something like /#DATA([^#0]+)/
> 
> but this isnt working. I am trying to use the ^ to mean 'not #0',
> but I think Perl is reading it as 'not #'.
> 
> Can anyone help? It would be much appreciated.

In a regular expression, [] creates a character class.  A character
class means "match any SINGLE character listed" (or, when ^ is used,
any SINGLE character NOT listed).  IOW, [abcdefg] will match a, or b,
or c, or d...  But it will match at most one character (of course, it
is also subject to modifiers such as * or + in which case the compound
expression will match multiple times).

There are lots of ways to do what you want.  The closest to what
you're trying to use character classes for is with negative lookahead,
but that's not the best.

/#DATA(.*?)#0/ is probably the easiest.  The *? is like regular *
except that after matching each . (or whatever it's modifying) it
checks to see if the rest of the regex matches, and if it does it will
stop.  The regular * will match as many .'s (or whatever's) as it can;
this is called greediness.  As you can imagine, greediness will
prevent you from having multiple text/data sections following one
after the another.  It will match right through from the beginning of
the first data section to the end of the last.  Non-greediness is
slow, though, because it has to check the rest of the regex for every
character (although if the rest of the regex is just "#0" that won't
be too bad).

Iff you know for sure the #0 you want is the last #0 in the entire
string, you can just use /#DATA(.*)#0/, which will be faster than the
non-greedy *?  (or, since you're writing the data yourself, just leave
off the #0 and use /#DATA(.*)/ ).

HTH.


------------------------------

Date: Mon, 24 Sep 2001 03:41:39 GMT
From: Andrew Cady <please@no.spam>
Subject: Re: Regular Expression Problem
Message-Id: <873d5d9s0a.fsf@homer.cghm>

Jeff <jeffplus@mediaone.net> writes:

> Scott-
> 
> Perl is interpreting that to mean:
> 
> "The text '#DATA' followed by one or more characters that are
> neither '#' nor '0'"
> 
> You want this:
> 
> $_ = $the_string_with_the_data;
> /^#DATA(.*)^#0/ms;
> my $data = $1;
> 
> The 's' modifier will let the '.' match newlines, and the 'm'
> modifier will let the caret ('^') match the position right _after_
> any embedded newline.
> 
> -jms

Yeesh, somehow I was under the impression that the entire string was
terminated by a newline.


------------------------------

Date: Sun, 23 Sep 2001 19:29:07 -0400
From: Benjamin Goldberg <goldbb2@earthlink.net>
Subject: Re: search, replace, functions, text wrapping (hard question! I think!)
Message-Id: <3BAE7043.AB552BFC@earthlink.net>

Richard Lawrence wrote:
[snip]
> Now my regexp utterly breaks because there are two spaces in there
> that have to be taken into consideration. What I'm looking for is the
> input of
> 
> this is a really good site i
>   found it at http://www.fish
>   andchips.com and its great.
> 
> to become:
> 
> this is a really good site i
>   found it at <a href="http://www.fishandchips.com">http://www.fish
>   andchips.com and its great.
> 
> (note how the fishandchips domain within the href is lacking the
> newline and takes into account the double spacing).

my $thefile = ....; # read whole document.

use URI ();

# Even though file: is a valid file prefix, we *don't* want
# something like File::Find to be considered a url...
# so the first char after "scheme:" must be any valid url
# character *except* a colon.

(my $cheat = $URI::uric) =~ tr/://d;

$thefile =~ s/&/&amp;/g; $thefile =~ s/</&lt;/g;
$thefile =~ s[
    ($URI::scheme_re\:[$cheat][$URI::uric#]*)
    ((?:\n[ ]{2}[$URI::uric#]+)*)
][
    (my $continued = $2) =~ tr/ \n//d;
    qq[<A href="$1$continued">$1</A>$2];
]oxge;
print "<PRE>$thefile</PRE>\n";

The regular expression is based loosely from the one in URI::Find.

This code might end up thinking something is a continuation which isn't,
for example:

this is a wrapped line with
  a complete url followed by
  some words: http://www.foo.com
  is my favorite place to go!

It will end up thinking the url is "http://www.foo.comis", which is
wrong.  There's no simple way of fixing that problem... so I just hope
that it doesn't happen often.

Also, you might prefer to change "$1</A>$2" to "$1$2</A>" ... the first
one matches your example, but the second may make more sense [it does to
me, anyway].

NB, this code is untested.

-- 
"I think not," said Descartes, and promptly disappeared.


------------------------------

Date: Sun, 23 Sep 2001 23:46:28 GMT
From: mgjv@tradingpost.com.au (Martien Verbruggen)
Subject: Re: search, replace, functions, text wrapping (hard question! I think!)
Message-Id: <slrn9qst2j.eug.mgjv@verbruggen.comdyn.com.au>

On Sun, 23 Sep 2001 19:29:07 -0400,
	Benjamin Goldberg <goldbb2@earthlink.net> wrote:
> Richard Lawrence wrote:
> [snip]
>> Now my regexp utterly breaks because there are two spaces in there
>> that have to be taken into consideration. What I'm looking for is the
>> input of
> 
> (my $cheat = $URI::uric) =~ tr/://d;

Hmm. Aren't you getting a bit too chummy with the URI implementation
here? I mean, the $uric variable isn't exactly documented, so one
would need to assume that it may disappear at some time in the future.

Maybe a warning should be added..

Martien
-- 
Martien Verbruggen              | 
Interactive Media Division      | In the fight between you and the
Commercial Dynamics Pty. Ltd.   | world, back the world - Franz Kafka
NSW, Australia                  | 


------------------------------

Date: Sun, 23 Sep 2001 23:00:27 -0400
From: Benjamin Goldberg <goldbb2@earthlink.net>
Subject: Re: search, replace, functions, text wrapping (hard question! I think!)
Message-Id: <3BAEA1CB.1C1CC4A6@earthlink.net>

Martien Verbruggen wrote:
> 
> On Sun, 23 Sep 2001 19:29:07 -0400,
>         Benjamin Goldberg <goldbb2@earthlink.net> wrote:
> > Richard Lawrence wrote:
> > [snip]
> >> Now my regexp utterly breaks because there are two spaces in there
> >> that have to be taken into consideration. What I'm looking for is
> >> the input of
> >
> > (my $cheat = $URI::uric) =~ tr/://d;
> 
> Hmm. Aren't you getting a bit too chummy with the URI implementation
> here? I mean, the $uric variable isn't exactly documented, so one
> would need to assume that it may disappear at some time in the future.

Ehh, I took this from URI::Find, more or less.  To be honest, I haven't
a clue as to what's actually in it.  Well, I assume that it's a string
with all of the characters which are valid in the somethingorother part
of the in scheme:somethingorother, but I didn't get that from looking
into URI.pm.

> Maybe a warning should be added..

You mean add a warning in my code?  Or in URI.pm?

-- 
"I think not," said Descartes, and promptly disappeared.


------------------------------

Date: Mon, 24 Sep 2001 03:21:35 GMT
From: "Scott Wessels" <swessels@usgn.net>
Subject: sub key index sorting
Message-Id: <3Fxr7.51033$aZ6.12809181@news1.rdc1.az.home.com>

I was wondering if I might improve upon a sort routine for the following
data structure:

my $logData = {
    1 => {
        _create => '20010919',
        _modify => '20010919',
    },
    2 => {
        _create => '20010920',
        _modify => '20010921'
    },
    ...
};

Where the primary sort key will be $logData->{$a}->{_create} and the
secondary will be $logData->{$a}. This needs to be a numeric sort and the
data returned to be an index of $logData's keys;

I've append to the end of this post, my current attempts at this sort, and
presently the sprintf routine beats the others out by over two-fold though I
have concerns that it
may not be the most efficient of routines.

Any help, thoughts, tips would be greatly appreciated.

Thanks,
Scott

-----Code-----
#!/usr/bin/perl -w

use strict;
use Benchmark qw/ cmpthese /;

#my $type = 'create';

my $logData = randData(100_000);

cmpthese(5, {
    simple      => sub {
        my @idx = sort {
                $logData->{$a}->{_create} <=> $logData->{$b}->{_create}
                    ||
                $a <=> $b
            } keys %$logData;
    },
    ref         => sub {
        my @idx = map { $_->[0] }
            sort {
                $a->[1] <=> $b->[1]
                        ||
                $a->[0] <=> $b->[0]
            }
            map { [ $_, $logData->{$_}->{_create} ] }
            keys %$logData;
    },
    sprintf     => sub {
        my $length = 0;
        $length ^= $_ foreach keys %$logData;
        $length = length $length;

        my @idx = map { int(substr $_, 8) }
            sort
            map { sprintf qq{$logData->{$_}->{_create}%0${length}d}, $_ }
            keys %$logData;
    }
});

sub randData {
    my $count = shift || 50;

    my %data;

    my $x;
    while ($count ne $x++) {
        $data{$x} = {
            _create => sprintf '2001%02d%02d', int(rand 12) + 1, int(rand
31) + 1,
            _modify => sprintf '2001%02d%02d', int(rand 12) + 1, int(rand
31) + 1
        }
    }

    return \%data;
}




------------------------------

Date: Sun, 23 Sep 2001 20:52:49 -0400
From: Benjamin Goldberg <goldbb2@earthlink.net>
Subject: Re: win32 stat in directory with 4682 files
Message-Id: <3BAE83E1.1DB49809@earthlink.net>

vze26dpn wrote:
[snip]
> > > 10:15:56.......................10:17:17
> >
> > So you're stat()ing 4600 files in 81 seconds, or about 56.8
> > stats/sec.
> >
> > Since you're not doing anything with the information anyway, I'd
> > take out the call to stat() :-)
> 
> So would I if this was not just a test case. I narrowed it down to
> stat() being the pig. But stat is what I really need to do. I didn't
> see anything  in the 3rd edition camel book or in 'a mess o' google
> searches to suggest a way around it.
> 
> > There's nothing else you can do from inside Perl to make stat() any
> > faster. Try a different OS? Under NT, the filesystem runs as a
> > separate process, so each stat() causes at least two (and maybe
> > more) context switches.
> 
> I would gladly get divorced from all Micro$oft code. I just can't
> afford the alimony payments;-)
> 
> > Perhaps Win32::File provides an interface that would let you get the
> > information you need all in one go. But then your script would be
> > tied to Win32 machines (and maybe that's OK with you).
> 
> If runtime penalty is no more than 3x, I think that is an acceptable
> price to pay for portability for a utility that only taxes my
> patience. But this is taking about 30x as long as what I consider
> reasonable. My feeling is that I should be able to do all the stats in
> less time than it takes dos to complete a 'dir /b /s' on the same
> directory tree.

And how long is that?

Also, when you say 'tree' that implies that you're recursively going
through directories... If you're using File::Find to do this, then keep
in mind that it already stat()s each file once, and if you stat() it
again yourself, that significantly slows things down.

> What I am doing is a recursive decent with lstat and then running
> md5 checksums on all the files in the tree. That takes a long time
> even when stat or lstat are not pigs.

Aha, yup you are doing this recursively.

Are you using File::Find?  If so, remember that just before your wanted
sub is called, a stat() was just done... so there will be a valid stat
structure stored in the _ filehandle.  Call stat(_) instead of
stat($File::Find::name), and you should get significantly faster
results.

If you aren't using File::Find...

On any half-decent OS, the time it takes to call opendir and fail should
take no more time than it takes to call stat on the same filename. 
Thus, you could use opendir as your test and the thing to pass to the
recursive call.  Also, you can stat a filehandle, which may be faster
than stating a filename... opening+stating might be slower than stating
directly, but if you're going to open the file anyway to do the
digest...

my $ctx = Digest::MD5->new;
sub recurse {
	my ($path, $dirfh) = @_;
	while( defined( my $filename = readdir $dirfh ) ) {
		my $fullpath = "$path\\$filename";
		if( opendir( my ($fh), $fullpath ) ) {
			recurse($fh, $fullpath);
			closedir $fh;
		} elsif( open( $fh, "<$fullpath" ) ) {
			my @x = stat($fh);
			my $d = $ctx->addfile($fh)->digest;
			# do stuff with @x and $d.
			close $fh;
		} else {
			warn "Couldn't open $fullpath: $!";
		}
	}
}


> Getting (l)stats of all the files in one swell foop is kinda what I'd
> like to do.

That may be slower than getting the stat just before opening the file,
due to caching.  You're likely best off doing the stat and opening the
file within a line or two of each other.

The reason I think that stating the filehandle might be faster than
stating the filename, is that it's possible [I'm not sure, since I don't
know much if anything about Win32 internals] that when you open a file,
it's stat structure is all pulled off of disk and put into memory
somewhere, so stating the filehandle might get it from memory, and
stating the filename might get it off of disk again.

> Win32::File has only 2 methods documented in the ActiveState
> distribution:
> 
> GetAttributes(filename, returnedAttributes)
> SetAttributes (filename, newAttributes)

How does the speed of GetAttributes compare to stat ?
Also... if returnedAttributes is a structure [I haven't looked at the
docs] which you would have to unpack(), consider than you can use the
"@" and "x" features of unpack to skip those parts of the structure
which you don't need.

> Are there any methods that ActiveState is not listing?
> 
> My other option is to look into Win32::API hooks. That would
> be really ugly. Before I do that I wanna make sure there is not
> a cleaner solution.

If "dir" is running faster than the fastest perl code you can write, see
if you can learn how it does it.

As a last resort, considering running dir in lieu of calling stat
multiple times... after all, if it does it faster than you can, and the
overhead of io is low [open2/open3, if it works, should have much faster
throughput than open(...|), due to Win32's emulation of popen, so try
it]

Another possibility is to use fork to create two threads, one which does
stat and writes to a pipe, the other which reads from the pipe and does
the rest of the work.  If the slowdown of stat is because it does IO
with a 'filesystem process' of some sort, as someone elsewhere in this
thread said, then this may allow us to remove some of the latency.

> I guess this could indicate a need for another perl module, but
> 4000+ entry directories are not the norm anywhere I've ever
> been, so I don't know if the module would serve any wide spread
> need.

-- 
"I think not," said Descartes, and promptly disappeared.


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

The Perl-Users Digest is a retransmission of the USENET newsgroup
comp.lang.perl.misc.  For subscription or unsubscription requests, send
the single line:

	subscribe perl-users
or:
	unsubscribe perl-users

to almanac@ruby.oce.orst.edu.  

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

To request back copies (available for a week or so), send your request
to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
where x is the volume number and y is the issue number.

For other requests pertaining to the digest, send mail to
perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
sending perl questions to the -request address, I don't have time to
answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V10 Issue 1798
***************************************


home help back first fref pref prev next nref lref last post