[32095] in Perl-Users-Digest
Perl-Users Digest, Issue: 3359 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Wed Apr 20 21:09:26 2011
Date: Wed, 20 Apr 2011 18:09:11 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Wed, 20 Apr 2011 Volume: 11 Number: 3359
Today's topics:
Re: A regex to search for numeric ranges... sln@netherlands.com
Re: FAQ 3.8 Is there a pretty-printer (formatter) for P <ac.russell@live.com>
for @{ my $x } on Perl 5.10 (bug?) <john@castleamber.com>
Re: for @{ my $x } on Perl 5.10 (bug?) <derykus@gmail.com>
Re: parsing a command line, but not the usual problem <*@eli.users.panix.com>
Re: Perl RegExp question <cartercc@gmail.com>
Re: Perl RegExp question <tzz@lifelogs.com>
Re: Perl RegExp question <tzz@lifelogs.com>
Re: Perl RegExp question <cartercc@gmail.com>
Re: Perl RegExp question <tzz@lifelogs.com>
Re: Perl RegExp question <cartercc@gmail.com>
Re: Perl RegExp question <tzz@lifelogs.com>
Re: Perl RegExp question <cartercc@gmail.com>
Re: Perl RegExp question <jimsgibson@gmail.com>
Re: Perl RegExp question <tzz@lifelogs.com>
Web Scraping Proxy <Groleau+news@FreeShell.org>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Wed, 20 Apr 2011 16:02:47 -0700
From: sln@netherlands.com
Subject: Re: A regex to search for numeric ranges...
Message-Id: <9jnuq6ld71rg0r4i0q19f1su8vake1445h@4ax.com>
On Tue, 19 Apr 2011 12:35:56 -0700 (PDT), Mr P <misterperl@gmail.com> wrote:
>I read up on this on the www and I found ideas like
>
>if ( /\b([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\b/ ) ...
>
>which is pretty uncipherable at a glance and just in general not
>elegant in any sense.
>
>I generally do something like
>
> if ( /(\d+)/ && $1 > 256 && $1 < 1024 )
>
>
>Which to me is a lot more readable at a glance, but like the example
>above not overly elegant..
>
>But what I'd REALLY like to do is, similar to the trick for numeric
>sort, a way to do it in the regex like
>
>/[256-1024]/ # but force it to be numeric, not literal perhaps with a
>switch
>
>Thoughts, Masters?
/[256-1024]/ is generally possible.
It has limitations that affect the surrounding expressions, but it
could be worked around and functionally generalized (again within
specific limitations).
-sln
-----------------------
use strict;
use warnings;
my $str = '0001023 widgets';
# Inline code is going to be a thing of the future and definitely
# going to happen (see perl 6 regex).
# This allows parameter checking and is usefull when the source
# has extended data to be regex analyzed in one expression.
if ($str =~ / \b (\d+) \b
(?(?{$^N > 256 && $^N < 1024}) # is this number between 256-1024?
# yes, continue processing
|
(*FAIL) # no, fail outright
)
# more expressions here ..
\s*
(.+)
/x )
{
print "Number: '$1', Type: '$2'\n";
}
else {
print "failed\n";
}
print "\n";
# This does a source conversion of \d+ to a single utf8 character.
# It then allows checking it in a HEX numeric range character class.
# Even though the source is decimal, '1023', when magically assumed to
# be hex and converted to a utf8 char like "\x{1023}", its code point
# will be corectly matched within a regex character class range.
# Example: "\x{1023}" =~ /[\x{257}-\x{1023}]/ will match.
# And, only "\x{N}" where N is between 257-1023 will match.
for (0 .. 4096)
{
# Construct a fake string using the current counter.
# In reality, you have to parse the source string and do the conversion
# so that you end up doing something like this:
# $src =~ /^(.*?)\b(\d+)\b(.*?)$/
# eval "\$temp_src = \"$1\\x{$2}$3\" ";
# Then use the $temp_src in place of the $str below.
my $padded_string = "000$_"; # the extra '000' padding is just a test
eval "\$str = \"\\x{$padded_string} widgets\" ";
if ( $str =~ /^ ([\x{257}-\x{1023}])
\s*
(.+)
/x )
{
print "Number: '$padded_string', Type: '$2'\n";
}
}
__END__
Output
------------
Number: '0001023', Type: 'widgets'
Number: '000257', Type: 'widgets'
Number: '000258', Type: 'widgets'
Number: '000259', Type: 'widgets'
Number: '000260', Type: 'widgets'
Number: '000261', Type: 'widgets'
Number: '000262', Type: 'widgets'
Number: '000263', Type: 'widgets'
Number: '000264', Type: 'widgets'
Number: '000265', Type: 'widgets'
Number: '000266', Type: 'widgets'
Number: '000267', Type: 'widgets'
...
...
Number: '0001012', Type: 'widgets'
Number: '0001013', Type: 'widgets'
Number: '0001014', Type: 'widgets'
Number: '0001015', Type: 'widgets'
Number: '0001016', Type: 'widgets'
Number: '0001017', Type: 'widgets'
Number: '0001018', Type: 'widgets'
Number: '0001019', Type: 'widgets'
Number: '0001020', Type: 'widgets'
Number: '0001021', Type: 'widgets'
Number: '0001022', Type: 'widgets'
Number: '0001023', Type: 'widgets'
------------------------------
Date: Wed, 20 Apr 2011 21:05:41 -0400
From: Adam Russell <ac.russell@live.com>
Subject: Re: FAQ 3.8 Is there a pretty-printer (formatter) for Perl?
Message-Id: <966b2$4daf82e7$ad30ae40$30387@news.eurofeeds.com>
Also, there is a pretty good perl plugin for Netbeans
http://netbeans.mojgorod.ru/perl.html
that formats perl code within the Netbeans IDE.
------------------------------
Date: Wed, 20 Apr 2011 14:53:23 -0500
From: John Bokma <john@castleamber.com>
Subject: for @{ my $x } on Perl 5.10 (bug?)
Message-Id: <871v0wwvgs.fsf@castleamber.com>
perl -e 'use strict; use warnings; print for @{ my $x }'
Can't use an undefined value as an ARRAY reference at -e line 1.
This is perl, v5.8.8 built for x86_64-linux-thread-multi
perl -e 'use strict; use warnings; print for @{ my $x }'
This is perl, v5.10.0 built for x86_64-linux-gnu-thread-multi
Is this a known bug? At least, I assume that the latter working is a bug.
--
John Bokma j3b
Blog: http://johnbokma.com/ Facebook: http://www.facebook.com/j.j.j.bokma
Freelance Perl & Python Development: http://castleamber.com/
------------------------------
Date: Wed, 20 Apr 2011 15:52:18 -0700 (PDT)
From: "C.DeRykus" <derykus@gmail.com>
Subject: Re: for @{ my $x } on Perl 5.10 (bug?)
Message-Id: <1b5b3080-8a6b-485c-b1c1-2bd4d7e40cc9@j35g2000prb.googlegroups.com>
On Apr 20, 12:53=A0pm, John Bokma <j...@castleamber.com> wrote:
> perl -e 'use strict; use warnings; print for @{ my $x }'
> Can't use an undefined value as an ARRAY reference at -e line 1.
>
> This is perl, v5.8.8 built for x86_64-linux-thread-multi
>
> perl -e 'use strict; use warnings; print for @{ my $x }'
>
> This is perl, v5.10.0 built for x86_64-linux-gnu-thread-multi
>
> Is this a known bug? At least, I assume that the latter working is a bug.
>
Same with 5.12.2:
perl -Mstrict -wle "print if @{my $x}"
Can't use an undefined value as an ARRAY reference at -e line 1.
perl -Mstrict -wle "print for @{my $x}"
At the very least it seems quirky that the
former fails and the latter doesn't.
--
Charles DeRykus
------------------------------
Date: Wed, 20 Apr 2011 23:01:04 +0000 (UTC)
From: Eli the Bearded <*@eli.users.panix.com>
Subject: Re: parsing a command line, but not the usual problem
Message-Id: <eli$1104201901@qz.little-neck.ny.us>
In comp.lang.perl.misc, Marc Girod <marc.girod@gmail.com> wrote:
> On Apr 19, 11:11 pm, Eli the Bearded <*...@eli.users.panix.com> wrote:
> > It's not exactly what I wanted, but it is a good start. I might just
> > borrow code from it rather than use it directly.
> What about Text:ParseWords, then?
Shell::Parser uses Text::ParseWords behind the scenes, but I'm
begining to think Parser::Lex and/or Parse::Yapp are more my style
in the absence of a readymade parser. I've used lex(1) for C code
generation in the past. Regular grammars convert easily to lex/yacc.
Offhand, I think I'm looking for something like:
# Lexer bits
DIGIT [0-9]
NUMBER [-]?{DIGIT}+([.]{DIGIT}+)?
LEAD [a-zA-Z_]
TRAIL [a-fA-Z0-9_]
IDENT {LEAD}{TRAIL}*
WS [ \t\n]+
ENDCMD [;]
SQUOTE '(\\.|[^\\'])*'
DQUOTE "(\\.|[^\\"])*"
VAR [$]{IDENT}
# Yapp bits
# Note this allows 'concatenation of '"adjoining $strings"' with different'
# variable expansion rules inside the strings.
STRING ({SQUOTE}|{DQUOTE})+
COMMAND {IDENT}({WS}{VAR}
|{WS}{STRING}
|{WS}{IDENT}
|{WS}{NUMBER}
|{WS}
)*{ENDCMD}
ASSIGN {IDENT}[=]{STRING}{ENDCMD}
STATEMENT {WS}*({ASSIGN}|{COMMAND})
Elijah
------
has started the code, but isn't finished
------------------------------
Date: Wed, 20 Apr 2011 06:26:52 -0700 (PDT)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: Perl RegExp question
Message-Id: <d0398968-8464-4a64-8728-436a49185016@j13g2000pro.googlegroups.com>
On Apr 20, 8:44=A0am, Keith <keithdlee2...@gmail.com> wrote:
> Azra:
> =A0Yes, I have learned about the limitations of Perl RegExp or RegExp in =
general. Thank you.
>
> Keith
This isn't a limitation of regular expressions. A regular expression
is a pattern and the regular expression is a pattern matching
languages, like Prolog, for example. In order to use it, you must have
patterns.
To illustrate, HTML is also a 'pattern matching language' in a sense.
Look at examples (a) and (b):
(a)
<html>
<head>
<title>My Home Page</title>
</head>
<body>
<h1>Keith's Home Page</h1>
<p>How do you like it?</p>
</body>
</html>
(b)
My Home Page Keith's Home Page How do you like it>
Now, suppose you had (b)? Would you say that it's valid HTML? Of
course not! The same is true for your data, YOU DON'T HAVE PATTERNS TO
MATCH.
At the risk of being a little insulting (you may have earned it)
you've been slow on the uptake. GIGO. Your input is garbage, so your
output will be garbage.
I'm not saying that your data is invalid necessarily -- I can't
determine that and make no judgment on that point. What I'm saying is
that you do not have valid input to feed to a regular expression to
generate the kind of output that you want.
If you want my advice, I would input the data with some kind of
delimited file format, and then use split() or related to break it
apart.
my $input =3D q(King of Royal Mounted:Murderer's Row:Chapter 2.AVI);
my ($show, $episode, $avi) =3D split(/:/, $input);
$avi =3D~ /(.+).AVI/;
my $chapter =3D $1;
my $vid =3D "$episode.AVI";
print qq(show: $show\nepisode: $episode\navi: $avi\nchapter: $chapter
\nvid: $vid\n);
outputs this:
show: King of Royal Mounted
episode: Murderer's Row
avi: Chapter 2.AVI
chapter: Chapter 2
vid: Murderer's Row.AVI
CC.
------------------------------
Date: Wed, 20 Apr 2011 09:29:54 -0500
From: Ted Zlatanov <tzz@lifelogs.com>
Subject: Re: Perl RegExp question
Message-Id: <87aaflf125.fsf@lifelogs.com>
On Tue, 19 Apr 2011 21:56:44 +0000 (UTC) Keith <keithdlee2000@gmail.com> wrote:
K> What if you don't know what the title is exactly in an .avi file?
K> That is, you know that it's the first word(s) of the file name but
K> nothing more?
Assuming you're talking about TV shows, try
http://thetvdb.com/?tab=advancedsearch (it's completely open and has a
developer API).
Another approach is, if you can's match any known shows, ask for a name
and add it to the show list, then save the list. So the list grows as
you encounter more shows.
Ted
------------------------------
Date: Wed, 20 Apr 2011 09:37:14 -0500
From: Ted Zlatanov <tzz@lifelogs.com>
Subject: Re: Perl RegExp question
Message-Id: <8739ldf0px.fsf@lifelogs.com>
On Wed, 20 Apr 2011 06:26:52 -0700 (PDT) ccc31807 <cartercc@gmail.com> wrote:
c> On Apr 20, 8:44 am, Keith <keithdlee2...@gmail.com> wrote:
>> Yes, I have learned about the limitations of Perl RegExp or RegExp in general. Thank you.
c> This isn't a limitation of regular expressions. A regular expression
c> is a pattern and the regular expression is a pattern matching
c> languages, like Prolog, for example. In order to use it, you must have
c> patterns.
...
c> The same is true for your data, YOU DON'T HAVE PATTERNS TO MATCH.
Sure he does. It's not as if he's looking for DNA patterns that need to
be statistically determined. There are only so many shows he can have
in his database.
Perl regular expressions are definitely not just about matching
patterns. Especially if you consider that you can embed Perl code right
in them. Many regexp limitations just don't apply.
Ted
------------------------------
Date: Wed, 20 Apr 2011 08:53:54 -0700 (PDT)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: Perl RegExp question
Message-Id: <fd91314e-0af8-45c8-8547-c17ba154b51d@v36g2000prm.googlegroups.com>
On Apr 20, 10:37=A0am, Ted Zlatanov <t...@lifelogs.com> wrote:
> c> The same is true for your data, YOU DON'T HAVE PATTERNS TO MATCH.
>
> Sure he does. =A0It's not as if he's looking for DNA patterns that need t=
o
> be statistically determined. =A0There are only so many shows he can have
> in his database.
No he doesn't. He wants to match a title, and episode, and a file
name, and all he has are space delimited tokens.
If he wanted to match the tokens, I'd agree with you. 'Pattern' is a
concept we impose on the data, not something inherent in the data.
What is the title of "King of Royal Mounted Murderer's Row"? 'King of
Royal'? 'King of Royal Mounted'? 'King of Royal Mounted Murderer's'?
'King of Royal Mounted Murderer's Row'?
My point was that he wants to impose order on an essentially unordered
collection of tokens and does not have anything by which to collect
the words into groups. The fact that a pattern language recognizes
patterns isn't a limitation of the language, it's a description of the
language. He's trying to use the wrong tool for the job, and blames
the tool when it fails to do the job.
Besides all of that, he's basically doing data munging, and REs aren't
particularly suited to data munging.
CC.
------------------------------
Date: Wed, 20 Apr 2011 11:16:34 -0500
From: Ted Zlatanov <tzz@lifelogs.com>
Subject: Re: Perl RegExp question
Message-Id: <87liz4ew4d.fsf@lifelogs.com>
On Wed, 20 Apr 2011 08:53:54 -0700 (PDT) ccc31807 <cartercc@gmail.com> wrote:
c> On Apr 20, 10:37 am, Ted Zlatanov <t...@lifelogs.com> wrote:
c> The same is true for your data, YOU DON'T HAVE PATTERNS TO MATCH.
>>
>> Sure he does. It's not as if he's looking for DNA patterns that need to
>> be statistically determined. There are only so many shows he can have
>> in his database.
c> No he doesn't. He wants to match a title, and episode, and a file
c> name, and all he has are space delimited tokens.
c> If he wanted to match the tokens, I'd agree with you. 'Pattern' is a
c> concept we impose on the data, not something inherent in the data.
c> What is the title of "King of Royal Mounted Murderer's Row"? 'King of
c> Royal'? 'King of Royal Mounted'? 'King of Royal Mounted Murderer's'?
c> 'King of Royal Mounted Murderer's Row'?
There are patterns inherent in most data and context helps establish
them. Take English, for example. Sentences are terribly ambiguous
without context. The programming distinction is between lexing ("where
are the words?") and parsing ("what are the words saying?").
Another good example is Perl code itself. Does this:
map { "no $_" } qw/1 2 3/, qw/4 5 6/;
mean to map across 1-6 or 1-3, then leave 4-6 alone? At a glance it's
confusing, so that's where the parser comes in and determines how the
expressions will be grouped.
Incidentally this is one of the things I like about Lisp: there is very
little parsing ambiguity in the language, and in fact writing a Lisp
parser is a famously easy task. You can say that generally the more
"natural" a language is (approximately meaning "the syntax is looser"),
the harder it is to parse. Obviously Perl is pretty "natural" by design.
If you're interested in more on this topic, read up on lexing and
parsing. Perl 6 has *very* extensive support for those, way beyond what
regular expressions can provide (more like the Perl 5 Parse::RecDescent
module). Whether that's a good or a bad thing depends on who you ask.
Anyhow, the context here is "names of TV show episodes." The file name
is not just space-delimited tokens, it's the name of a TV show followed
by the episode name and then the chapter number. So to parse out the TV
show name, you can either use feedback training (where the user teaches
the parser the names of the TV shows) or a knowledge database (the list
of all TV shows).
c> Besides all of that, he's basically doing data munging, and REs aren't
c> particularly suited to data munging.
I disagree. Well, either this is false or I've been using regular
expressions wrong for the last 15 years or so. Could be the latter.
Ted
------------------------------
Date: Wed, 20 Apr 2011 09:46:42 -0700 (PDT)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: Perl RegExp question
Message-Id: <5178ea4e-3bfe-4643-ac48-106a02e892e5@o15g2000prn.googlegroups.com>
On Apr 20, 12:16=A0pm, Ted Zlatanov <t...@lifelogs.com> wrote:
> Incidentally this is one of the things I like about Lisp: there is very
> little parsing ambiguity in the language, and in fact writing a Lisp
> parser is a famously easy task.
An S-expression is by itself an abstract syntax tree. You can take
(* 3 (+ 4 (- 5 6)) (/ 7 8)) and write that as an AST directly. IMO,
this is want makes Lisp both very powerful and very difficult.
> Anyhow, the context here is "names of TV show episodes." =A0The file name
> is not just space-delimited tokens, it's the name of a TV show followed
> by the episode name and then the chapter number.
I very rarely watch TV, and had no context within which to understand
the question. Obviously, if you have particular strings you want to
match, you can do that several ways. Still, to me the data appears to
be scrambled, even though to someone familiar with the names of TV
programs it might be pretty clear.
> c> Besides all of that, he's basically doing data munging, and REs aren't
> c> particularly suited to data munging.
>
> I disagree. =A0Well, either this is false or I've been using regular
> expressions wrong for the last 15 years or so. =A0Could be the latter.
This may be a point where individual experience colors our opinions.
For the past six years or so, I've earned my living as a data munger
(my job title is Database Manager but my responsibilities are mostly
creating reports based on the results of queries).
I see data as discrete values arranged in rows and columns, which
often results in multi-dimensional structures, e.g., reporting on
students by college, zip code, academic level, program, and credit
hours completed. To me, 'data munging' means reading input files,
rearranging the data, and writing output files --- and regular
expressions don't help at all with this kind of job. Instead, I use
hashes of hash refs a great deal.
I use regular expressions a lot, but mostly in connection with
operations on individual datums, NOT what I would consider data
munging.
To me, the OP's problem is a typical data munging task, and the OP is
using the wrong tool to do it but criticizing that tool for the
inability to get the job done. It's kind of like using a wrench to
drive nails, and faulting the wrench for not being able to drive nails
very well. You can do it, yes, but a hammer is much better for the
job.
CC.
------------------------------
Date: Wed, 20 Apr 2011 13:23:29 -0500
From: Ted Zlatanov <tzz@lifelogs.com>
Subject: Re: Perl RegExp question
Message-Id: <87hb9seq8u.fsf@lifelogs.com>
On Wed, 20 Apr 2011 09:46:42 -0700 (PDT) ccc31807 <cartercc@gmail.com> wrote:
c> I see data as discrete values arranged in rows and columns, which
c> often results in multi-dimensional structures, e.g., reporting on
c> students by college, zip code, academic level, program, and credit
c> hours completed. To me, 'data munging' means reading input files,
c> rearranging the data, and writing output files --- and regular
c> expressions don't help at all with this kind of job. Instead, I use
c> hashes of hash refs a great deal.
Oh boy, you're missing out on half the fun then. Regular expressions
are very good for manipulating and rearranging data, especially
line-based data. When each piece of data spans multiple lines it can
get a little harder to manipulate it.
For any data manipulation task, I try to do it in this sequence:
- for any line, ALWAYS produce the same output. Else,
- for any line, produce some output based on the input so far. Else,
- once all the data is consumed, produce some aggregate output.
Your usage seems to be mostly the third case, but the first two are very
useful as well. The first one (stateless data manipulation) is
especially useful because it can be done on any chunk of the data (which
in turn makes it easiest to parallelize and possibly map-reduce).
Ted
------------------------------
Date: Wed, 20 Apr 2011 12:20:44 -0700 (PDT)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: Perl RegExp question
Message-Id: <8c783752-b999-425d-8e1f-8dc33a080aaf@22g2000prx.googlegroups.com>
On Apr 20, 2:23=A0pm, Ted Zlatanov <t...@lifelogs.com> wrote:
>=A0Regular expressions
> are very good for manipulating and rearranging data, especially
> line-based data.
What about delimited files that are 80 columns wide and 25000 rows
deep?
> For any data manipulation task, I try to do it in this sequence:
I typically do the following:
while (<DATA>)
{
next unless /\w/;
chomp;
my ($var1, $var2, $var3 ... as appropriately named ) =3D
some_split_function($_);
next if $var2 =3D~ /unwanted value/;
$hash{$var1}{$var2}{$var3}{$var4} =3D $var5;
}
I then have all my desired data in a hash that I can sort and
manipulate, and print, resulting in this pattern:
foreach my $k1 (sort keys %hash)
{
foreach my $k2 (sort keys %{$hash{$k1}})
{
... as many levels as I need
}
}
>
> - for any line, ALWAYS produce the same output. =A0Else,
>
> - for any line, produce some output based on the input so far. =A0Else,
>
> - once all the data is consumed, produce some aggregate output.
>
> Your usage seems to be mostly the third case, but the first two are very
> useful as well.
I usually have to aggregate data, so I sum or count datums on each
row. Don't get me wrong -- I use REs all the time, but for the task of
reading input, building data structures, and writing output, I don't
find them particularly useful.
CC.
------------------------------
Date: Wed, 20 Apr 2011 13:00:08 -0700
From: Jim Gibson <jimsgibson@gmail.com>
Subject: Re: Perl RegExp question
Message-Id: <200420111300085586%jimsgibson@gmail.com>
In article
<8c783752-b999-425d-8e1f-8dc33a080aaf@22g2000prx.googlegroups.com>,
ccc31807 <cartercc@gmail.com> wrote:
> I usually have to aggregate data, so I sum or count datums on each
> row. Don't get me wrong -- I use REs all the time, but for the task of
> reading input, building data structures, and writing output, I don't
> find them particularly useful.
Well, I do. Perhaps in your case the reason you don't need regular
expressions to parse your data is because your data is contained within
a well-structured database.
I typically use Perl to extract information from program output and log
files. These files are semi-structured, containing some fixed bits and
some variable bits. I am usually interested in the variable bits and
use regular expressions and the fixed bits to extract the data I want.
The format of the log files is sometimes under my control and sometimes
not. For these cases, regular expressions are invaluable.
--
Jim Gibson
------------------------------
Date: Wed, 20 Apr 2011 18:42:01 -0500
From: Ted Zlatanov <tzz@lifelogs.com>
Subject: Re: Perl RegExp question
Message-Id: <87zknk32ye.fsf@lifelogs.com>
On Wed, 20 Apr 2011 12:20:44 -0700 (PDT) ccc31807 <cartercc@gmail.com> wrote:
c> On Apr 20, 2:23 pm, Ted Zlatanov <t...@lifelogs.com> wrote:
>> Regular expressions
>> are very good for manipulating and rearranging data, especially
>> line-based data.
c> What about delimited files that are 80 columns wide and 25000 rows
c> deep?
Sure, depending on the manipulation needed of course. 2MB or so of data
is hardly large, though.
c> I typically do the following:
c> while (<DATA>)
c> {
c> next unless /\w/;
c> chomp;
c> my ($var1, $var2, $var3 ... as appropriately named ) =
c> some_split_function($_);
c> next if $var2 =~ /unwanted value/;
c> $hash{$var1}{$var2}{$var3}{$var4} = $var5;
c> }
some_split_function() almost definitely uses regular expressions :)
c> I then have all my desired data in a hash that I can sort and
c> manipulate, and print, resulting in this pattern:
c> foreach my $k1 (sort keys %hash)
c> {
c> foreach my $k2 (sort keys %{$hash{$k1}})
c> {
c> ... as many levels as I need
c> }
c> }
Sure. This works great until %hash gets big. It also produces output
only after all the input is consumed, as opposed to line-based
processing which tends to be much more responsive. So you choose the
approach depending on the task you need to do and the size of your
input, although all other things being equal, go with stateless
line-by-line processing if you can. But I'm repeating myself...
c> I usually have to aggregate data, so I sum or count datums on each
c> row. Don't get me wrong -- I use REs all the time, but for the task of
c> reading input, building data structures, and writing output, I don't
c> find them particularly useful.
OK. Many of us do, so I think it's simply that you haven't had the
opportunity and need to try it, rather than a fundamental shortcoming of
regular expressions as a data processing and munging tool.
Ted
------------------------------
Date: Wed, 20 Apr 2011 10:05:22 -0400
From: Wes Groleau <Groleau+news@FreeShell.org>
Subject: Web Scraping Proxy
Message-Id: <iomp70$rfv$1@dont-email.me>
Is there a newsgroup or forum suitable for questions regarding wsp.pl
(by Katcheff at AT&T)? A web search gives me dozens of places to
download it and a general perl forum with a single unresolved question
on the wsp.
I _think_ I have the required libraries installed, and it isn't
complaining about anything, but it takes one request and quits.
And that request is failing because FireFox won't allow me
to accept wsp's SSL certificate. (Which is presumably a FireFox
issue, but I wish WSP would keep recording while I fight with FF.)
--
Wes Groleau
Itâs the Law!
http://Ideas.Lang-Learn.us/WWW?itemid=93
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 3359
***************************************