[31439] in Perl-Users-Digest
Perl-Users Digest, Issue: 2691 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Nov 24 16:09:37 2009
Date: Tue, 24 Nov 2009 13:09:06 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Tue, 24 Nov 2009 Volume: 11 Number: 2691
Today's topics:
Re: help with regex sln@netherlands.com
Re: perl hash: low-level implementation details? <Peter@PSDT.com>
Re: perl hash: low-level implementation details? <tzz@lifelogs.com>
Problem parsing HTML <nickli2000@gmail.com>
Re: Problem parsing HTML <news@danrumney.co.uk>
Re: Problem parsing HTML <nickli2000@gmail.com>
Re: Problem parsing HTML <jurgenex@hotmail.com>
Re: Problem parsing HTML <nickli2000@gmail.com>
Re: Problem parsing HTML <tadmc@seesig.invalid>
Re: Problem parsing HTML <nickli2000@gmail.com>
Why is "use 5.010" necessary <abc@def.com>
Re: Why is "use 5.010" necessary <ben@morrow.me.uk>
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Tue, 24 Nov 2009 12:40:15 -0800
From: sln@netherlands.com
Subject: Re: help with regex
Message-Id: <0mgog592u1fnh96ndc19i4i1t7mo132jmc@4ax.com>
On Sat, 21 Nov 2009 13:42:48 -0800 (PST), Obama <cyrusgreats@gmail.com> wrote:
>On Nov 21, 1:07 pm, Martien Verbruggen
><martien.verbrug...@invalid.see.sig> wrote:
>> On Sat, 21 Nov 2009 08:19:05 -0800 (PST),
>> If you have any questions, don't hesistate to ask.
>> Martien
>
>Martien,
>thanks for your help, I do have a log file which contains more than
>3000 records, after running the program it records only 23, the reason
>I guess that code overwrites if one server has more than one 'Start|
>End'. One server could 'Start' say at Nov 15 00:03:45 and 'End' Nov 15
>00:3:55 and have another session later on the log, say at Nov 18
>00:08:41 and 'End' Nov 18 00:08:59. The last one overwrite the earlier
>ones! Again thanks for your help if you can help me out on this!
>
Since you refuse to treat this like a database, here is a
hybrid, putting the record in fixed strings, sorting, then
extracting. All in a fixed way, since you refuse everything else.
-sln
---------
Output:
Hercules:sm_fv_emba
Tue Nov 12 00:15:04 - Wed Nov 13 00:00:13 (6098287 KB)
Sun Nov 15 00:15:04 - Sun Nov 15 00:15:13 (1900 KB)
Hercules:sm_fv_faculty
Sun Nov 15 00:15:04
Hercules:sm_fv_phd
Sun Nov 15 00:00:03 - Sun Nov 15 00:00:26 (4528 KB)
Hercules:sm_fv_researchdata
Sun Nov 15 00:15:04 - Sun Nov 15 00:15:18 (1820 KB)
Hercules:sm_fv_servicedata
- Tue Nov 15 00:00:58 (53664 KB)
- Sun Nov 14 00:00:55 (53664 KB)
- Sun Nov 14 00:02:01 (53664 KB)
- Sun Nov 15 00:00:01 (53664 KB)
Sun Nov 15 00:00:03 - Sun Nov 15 00:01:00 (4445554 KB)
Hercules:sm_fv_students
Sun Nov 15 00:00:03 - Sun Nov 15 00:00:25 (3368 KB)
Hercules:sm_galaxy_root
Sun Nov 15 00:15:04 - Sun Nov 15 00:15:32 (39128 KB)
network-1:Test1
Fri May 25 00:13:20 - Fri May 25 00:13:49 (2048 KB)
network-2:Test2
Sat May 26 00:15:20 - Sat May 26 00:15:49 (212048 KB) - Sat May 26 00:15:50 (212048 KB)
Sat May 26 00:16:20
Sat May 26 00:16:22 - Sat May 26 00:16:49 (212048 KB)
---------
## misc_parse13.pl, sln
##
use strict;
use warnings;
my %servers;
my %day2num = (
mon=>'1', tue=>'2', wed=>'3', thu=>'4',
fri=>'5', sat=>'6', sun=>'7');
my %month2num = (
jan=>'01', feb=>'02', mar=>'03', apr=>'04',
may=>'05', jun=>'06', jul=>'07', aug=>'08',
sep=>'09', oct=>'10', nov=>'11', dec=>'12');
my %num2day = reverse ( %day2num );
my %num2month = reverse ( %month2num );
while (<DATA>)
{
my @all = split /[()\s]+/;
next if (@all<9 or $all[8] !~ /^(?:start|end)/i);
$all[9] = '' if !defined ($all[9]);
$all[10] = '' if !defined ($all[10]);
my $rec =
$day2num{lc $all[1]}. # day of week/year
$month2num{lc $all[2]}. # month
$all[3]. # day
$all[4]. # time
'-'. # -
$all[8]. # start/end
' ('.$all[9].' '.$all[10].')'; # ( usage size )
push @{$servers{$all[7]}}, $rec;
}
## sort by server
for my $srv (sort keys %servers)
{
print "\n\n$srv";
my $nostart = "\n ";
## sort by year/month/day/time
for (sort @{$servers{$srv}})
{
## print results
(/start/i .. /start/i) and (print "\n", $nostart = '');
if ( /(\d)(\d\d)(\d\d)(.+)-start/i )
{
print ' '.
ucfirst($num2day{$1}).' '.ucfirst($num2month{$2}).
' '.$3.' '.$4.' ';
}
elsif ( /(\d)(\d\d)(\d\d)(.+)-end (.*)/i )
{
print $nostart.
'- '.ucfirst($num2day{$1}).' '.ucfirst($num2month{$2}).
' '.$3.' '.$4.' '.$5.' ';
}
}
}
__DATA__
src Fri May 25 00:13:49 EDT myserver1:Test1 network-1:Test1 End (2048 KB)
src Fri May 25 00:13:20 EDT myserver1:Test1 network-1:Test1 Start
src Sat May 26 00:15:20 EDT myserver2:Test2 network-2:Test2 Start
src Sat May 26 00:15:49 EDT myserver2:Test2 network-2:Test2 End (212048 KB)
src Sat May 26 00:15:50 EDT myserver2:Test2 network-2:Test2 End (212048 KB)
src Sat May 26 00:16:20 EDT myserver2:Test2 network-2:Test2 Start
src Sat May 26 00:16:22 EDT myserver2:Test2 network-2:Test2 Start
src Sat May 26 00:16:49 EDT myserver2:Test2 network-2:Test2 End (212048 KB)
dst Sun Nov 15 00:00:03 EST galaxy.fuqua.duke.edu:fv_servicedata Hercules:sm_fv_servicedata Start
dst Sun Nov 15 00:00:03 EST galaxy.fuqua.duke.edu:fv_phd Hercules:sm_fv_phd Start
dst Sun Nov 15 00:00:03 EST galaxy.fuqua.duke.edu:fv_students Hercules:sm_fv_students Start
dst Sun Nov 15 00:00:25 EST galaxy.fuqua.duke.edu:fv_students Hercules:sm_fv_students End (3368 KB)
dst Sun Nov 15 00:00:26 EST galaxy.fuqua.duke.edu:fv_phd Hercules:sm_fv_phd End (4528 KB)
dst Tue Nov 15 00:00:58 EST galaxy.fuqua.duke.edu:fv_servicedata Hercules:sm_fv_servicedata End (53664 KB)
dst Sun Nov 14 00:00:55 EST galaxy.fuqua.duke.edu:fv_servicedata Hercules:sm_fv_servicedata End (53664 KB)
dst Sun Nov 14 00:02:01 EST galaxy.fuqua.duke.edu:fv_servicedata Hercules:sm_fv_servicedata End (53664 KB)
dst Sun Nov 15 00:00:01 EST galaxy.fuqua.duke.edu:fv_servicedata Hercules:sm_fv_servicedata End (53664 KB)
dst Sun Nov 15 00:01:00 EST galaxy.fuqua.duke.edu:fv_servicedata Hercules:sm_fv_servicedata End (4445554 KB)
dst Wed Nov 13 00:00:13 EST galaxy.fuqua.duke.edu:fv_emba Hercules:sm_fv_emba End (6098287 KB)
dst Sun Nov 15 00:01:00 EST andromeda.fuqua.duke.edu:esx_fc_nfs2 Hercules:sm_esx_fc_nfs2 Request (Retry)
dst Sun Nov 15 00:15:04 EST galaxy.fuqua.duke.edu:fv_faculty Hercules:sm_fv_faculty Start
dst Sun Nov 15 00:15:04 EST galaxy.fuqua.duke.edu:fv_researchdata Hercules:sm_fv_researchdata Start
dst Sun Nov 15 00:15:04 EST galaxy.fuqua.duke.edu:fv_emba Hercules:sm_fv_emba Start
dst Sun Nov 15 00:15:04 EST galaxy.fuqua.duke.edu:root Hercules:sm_galaxy_root Start
dst Sun Nov 15 00:15:13 EST galaxy.fuqua.duke.edu:fv_emba Hercules:sm_fv_emba End (1900 KB)
dst Sun Nov 15 00:15:18 EST galaxy.fuqua.duke.edu:fv_researchdata Hercules:sm_fv_researchdata End (1820 KB)
dst Sun Nov 15 00:15:32 EST galaxy.fuqua.duke.edu:root Hercules:sm_galaxy_root End (39128 KB)
dst Sun Nov 15 00:16:00 EST andromeda.fuqua.duke.edu:esx_fc_nfs2 Hercules:sm_esx_fc_nfs2 Request (Retry)
dst Tue Nov 12 00:15:04 EST galaxy.fuqua.duke.edu:fv_emba Hercules:sm_fv_emba Start
------------------------------
Date: Tue, 24 Nov 2009 14:14:44 GMT
From: Peter Scott <Peter@PSDT.com>
Subject: Re: perl hash: low-level implementation details?
Message-Id: <oTROm.58438$rE5.9984@newsfe08.iad>
On Mon, 23 Nov 2009 13:47:55 -0800, zikester wrote:
> I've written a perl program that takes 3GB worth of key/value pairs (
> each are numbers in the range of 0-60million ), and builds a hash with
> them. The hash itself seems to be taking more than 130GB ( linux 64-bit
> ) and counting--I had to kill the program b/c it was growing too large (
> we have some 130Gb memory machines at our lab ).
>
> Is there an article/book that describes the inner workings of perl data
> structures / hashes in particular? I just want to know why it's taking
> so much memory.
To find out more about Perl's internal data representations, google
"perlguts illustrated".
To store a lot of numbers compactly, consider the PDL module (http://
pdl.perl.org/).
--
Peter Scott
http://www.perlmedic.com/
http://www.perldebugged.com/
http://www.informit.com/store/product.aspx?isbn=0137001274
------------------------------
Date: Tue, 24 Nov 2009 09:00:32 -0600
From: Ted Zlatanov <tzz@lifelogs.com>
Subject: Re: perl hash: low-level implementation details?
Message-Id: <87vdh0ynlb.fsf@lifelogs.com>
On Mon, 23 Nov 2009 21:23:14 -0800 Xho Jingleheimerschmidt <xhoster@gmail.com> wrote:
XJ> zikester wrote:
>>
>> Thanks for the alternate approaches: I realize the impracticality of
>> my approach, but I just want to understand what is going on :)
>>
>>
>> The following simplified code generates 16Gb of RAM usage for me.
>> Since we have 50 million pairs, given a hash of size 50 million slots
>> we get about 320 bytes of memory usage per slot. Each slot on average
>> would have close to 1 member in it,
XJ> Each value slot will have exactly one value in it--that is how Perl
XJ> hashes work. However, in you code that value will be a reference to
XJ> an array, which array will on average have close to 1 element in it.
XJ> And there goes your memory. You have about 50 million tiny arrays,
XJ> each one using a lot of overhead.
XJ> I'd guess roughly it comes up to something like: 48 bytes for the key
XJ> and associated structure, 40 bytes for the value-scalar (which holds
XJ> an arrayref), 160 bytes for the array overhead, and 48 bytes for each
XJ> scalar (usually 1) inside each array.
In fact my #1 recommendation after the above discussion would be to make
your values strings instead of arrays with some fixed delimiter (e.g. a
0 byte can be used with pack/unpack). If the values can be packed more
efficiently in binary (e.g. they are always floats), do that in the
value.
I would use SQLite to do this, however. With an indexed primary key,
the lookup will be very fast and, more importantly, can be parallelized.
Running more than one core is very valuable with this kind of large
problem. You can even make a RAM disk and keep the database file there.
It may be slightly slower than a pure Perl solution, but I would
benchmark it regardless.
There are also dedicated key-value databases like TDB and many many
others that are designed for large data sets and scale well across
multiple machines. What you should use depends on the end purpose of
your key-value store.
Ted
------------------------------
Date: Tue, 24 Nov 2009 08:42:12 -0800 (PST)
From: Ninja Li <nickli2000@gmail.com>
Subject: Problem parsing HTML
Message-Id: <c7827b03-b7a8-410b-bb01-0c6268b35f14@m11g2000vbo.googlegroups.com>
Hi,
I am trying to parse HTML from the website http://biz.yahoo.com/c/e.html
using HTML::TreeBuilder module and generate a comma-delimited file.
However, I am getting an extra "," at the first line and I would also
like to get rid of the "," at the end of the each line.
Please advise why that happens and the fix. The code is at the end
of the post.
Thanks in advance.
Nick
-----------------------------------------------
Soure code:
use strict;
use LWP::Simple;
use HTML::Tree;
use warnings;
my $url = 'http://biz.yahoo.com/c/e.html';
my $content = get($url);
my $tree = HTML::TreeBuilder->new_from_content($content);
my @tr = $tree->look_down('_tag' => 'tr',
sub { $_[0]->as_text !~ m/My Yahoo!/ &&
$_[0]->as_text !~ m/Welcome/i &&
$_[0]->as_text !~ m/Economic Calendar/i
&&
$_[0]->as_text !~ m/Last Week/i });
foreach my $tr (@tr)
{
if ($tr)
{
my @detail = $tr->look_down('_tag' => 'td');
foreach my $detail (@detail)
{
print $detail->as_text . ",";
}
print "\n";
}
else
{
warn "No detail data";
}
}
$tree->delete;
------------------------------
Date: Tue, 24 Nov 2009 11:29:44 -0600
From: Dan Rumney <news@danrumney.co.uk>
Subject: Re: Problem parsing HTML
Message-Id: <heh556$r9$1@aioe.org>
> my @detail = $tr->look_down('_tag' => 'td');
>
> foreach my $detail (@detail)
> {
> print $detail->as_text . ",";
> }
> print "\n";
There's your problem.
For every element in @detail, you print that element, plus a comma.
You've programmed it to put a comma after every $detail, so you're going
to get a comma after the last $detail on each line.
Try
> my @detail = $tr->look_down('_tag' => 'td');
> print join(',',@detail)."\n";
Dan
------------------------------
Date: Tue, 24 Nov 2009 10:25:36 -0800 (PST)
From: Ninja Li <nickli2000@gmail.com>
Subject: Re: Problem parsing HTML
Message-Id: <62a4fcc0-53d2-4c0a-866f-3dd891f257df@p19g2000vbq.googlegroups.com>
On Nov 24, 12:29=A0pm, Dan Rumney <n...@danrumney.co.uk> wrote:
> > =A0 =A0 =A0 =A0 =A0 =A0my @detail =3D $tr->look_down('_tag' =3D> 'td');
>
> > =A0 =A0 =A0 =A0 =A0 =A0foreach my $detail (@detail)
> > =A0 =A0 =A0 =A0 =A0 =A0{
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0print $detail->as_text . ",";
> > =A0 =A0 =A0 =A0 =A0 =A0}
> > =A0 =A0 =A0 =A0 =A0 =A0print "\n";
>
> There's your problem.
>
> For every element in @detail, you print that element, plus a comma.
> You've programmed it to put a comma after every $detail, so you're going
> to get a comma after the last $detail on each line.
>
> Try
>
> =A0> =A0 =A0 =A0 =A0 =A0 my @detail =3D $tr->look_down('_tag' =3D> 'td');
> =A0> =A0 =A0 =A0 =A0 =A0 print join(',',@detail)."\n";
>
> Dan
Dan,
I tried your solution but I got error messages.
Thanks.
Nick
------------------------------
Date: Tue, 24 Nov 2009 10:46:32 -0800
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: Re: Problem parsing HTML
Message-Id: <eeaog5hauuh6sdml92hokgokfersteb7b0@4ax.com>
Ninja Li <nickli2000@gmail.com> wrote:
>On Nov 24, 12:29 pm, Dan Rumney <n...@danrumney.co.uk> wrote:
>> > my @detail = $tr->look_down('_tag' => 'td');
>>
>> > foreach my $detail (@detail)
>> > {
>> > print $detail->as_text . ",";
>> > }
>> > print "\n";
>>
>> There's your problem.
>>
>> For every element in @detail, you print that element, plus a comma.
>> You've programmed it to put a comma after every $detail, so you're going
>> to get a comma after the last $detail on each line.
>>
>> Try
>>
>> > my @detail = $tr->look_down('_tag' => 'td');
>> > print join(',',@detail)."\n";
>
> I tried your solution but I got error messages.
And? Are you keeping those error messages a secret? Kind of difficult to
correct code without knowing the code _and_ the error messages.
jue
------------------------------
Date: Tue, 24 Nov 2009 11:13:02 -0800 (PST)
From: Ninja Li <nickli2000@gmail.com>
Subject: Re: Problem parsing HTML
Message-Id: <edf20f27-4d20-4213-ba25-2a51dd5c42fc@p23g2000vbl.googlegroups.com>
On Nov 24, 1:46=A0pm, J=FCrgen Exner <jurge...@hotmail.com> wrote:
> Ninja Li <nickli2...@gmail.com> wrote:
> >On Nov 24, 12:29=A0pm, Dan Rumney <n...@danrumney.co.uk> wrote:
> >> > =A0 =A0 =A0 =A0 =A0 =A0my @detail =3D $tr->look_down('_tag' =3D> 'td=
');
>
> >> > =A0 =A0 =A0 =A0 =A0 =A0foreach my $detail (@detail)
> >> > =A0 =A0 =A0 =A0 =A0 =A0{
> >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0print $detail->as_text . ",";
> >> > =A0 =A0 =A0 =A0 =A0 =A0}
> >> > =A0 =A0 =A0 =A0 =A0 =A0print "\n";
>
> >> There's your problem.
>
> >> For every element in @detail, you print that element, plus a comma.
> >> You've programmed it to put a comma after every $detail, so you're goi=
ng
> >> to get a comma after the last $detail on each line.
>
> >> Try
>
> >> =A0> =A0 =A0 =A0 =A0 =A0 my @detail =3D $tr->look_down('_tag' =3D> 'td=
');
> >> =A0> =A0 =A0 =A0 =A0 =A0 print join(',',@detail)."\n";
>
> > =A0 I tried your solution but I got error messages.
>
> And? Are you keeping those error messages a secret? Kind of difficult to
> correct code without knowing the code _and_ the error messages.
>
> jue- Hide quoted text -
>
> - Show quoted text -
Jurgen,
Thanks for pointing this out. Here is error message after code
change. The new code is at the end of the post:
Thanks.
------------------------------------
HTML::Element=3DHASH(0xdd3228)
HTML::Element=3DHASH(0xdee864),HTML::Element=3DHASH
(0xdee924),HTML::Element=3DHASH(0xdee9a8),HTML::Element=3DHASH
(0xdeea44),HTML::Element=3DHASH(0xdeeb04),HTML::Element=3DHASH
(0xdeebac),HTML::Element=3DHASH(0xdeec54),HTML::Element=3DHASH
(0xdf2bd8),HTML::Element=3DHASH(0xdf2c80)
HTML::Element=3DHASH(0xdf2d58),HTML::Element=3DHASH
(0xdf2dd0),HTML::Element=3DHASH(0xdf2e0c),HTML::Element=3DHASH
(0xdf2eb4),HTML::Element=3DHASH(0xdf2f2c),HTML::Element=3DHASH
(0xdf2f8c),HTML::Element=3DHASH(0xdf2fec),HTML::Element=3DHASH
(0xdf304c),HTML::Element=3DHASH(0xdf30ac)
HTML::Element=3DHASH(0xdf313c),HTML::Element=3DHASH
(0xdf31b4),HTML::Element=3DHASH(0xdf31f0),HTML::Element=3DHASH
(0xdf32a4),HTML::Element=3DHASH(0xdf331c),HTML::Element=3DHASH
(0xdf337c),HTML::Element=3DHASH(0xdf33dc),HTML::Element=3DHASH
(0xdf343c),HTML::Element=3DHASH(0xdf349c)
HTML::Element=3DHASH(0xdf352c),HTML::Element=3DHASH
(0xdf35a4),HTML::Element=3DHASH(0xdf35e0),HTML::Element=3DHASH
(0xdf3694),HTML::Element=3DHASH(0xdf370c),HTML::Element=3DHASH
(0xdf376c),HTML::Element=3DHASH(0xdf37cc),HTML::Element=3DHASH
(0xdf382c),HTML::Element=3DHASH(0xdf388c)
HTML::Element=3DHASH(0xdf391c),HTML::Element=3DHASH
(0xdf3994),HTML::Element=3DHASH(0xdf39d0),HTML::Element=3DHASH
(0xdf3a84),HTML::Element=3DHASH(0xdf3afc),HTML::Element=3DHASH
(0xdf6480),HTML::Element=3DHASH(0xdf64e0),HTML::Element=3DHASH
(0xdf6540),HTML::Element=3DHASH(0xdf65a0)
HTML::Element=3DHASH(0xdf6630),HTML::Element=3DHASH
(0xdf66a8),HTML::Element=3DHASH(0xdf66e4),HTML::Element=3DHASH
(0xdf6798),HTML::Element=3DHASH(0xdf6810),HTML::Element=3DHASH
(0xdf6870),HTML::Element=3DHASH(0xdf68d0),HTML::Element=3DHASH
(0xdf6930),HTML::Element=3DHASH(0xdf6990)
HTML::Element=3DHASH(0xdf6a20),HTML::Element=3DHASH
(0xdf6a98),HTML::Element=3DHASH(0xdf6ad4),HTML::Element=3DHASH
(0xdf6b28),HTML::Element=3DHASH(0xdf6ba0),HTML::Element=3DHASH
(0xdf6c00),HTML::Element=3DHASH(0xdf6c60),HTML::Element=3DHASH
(0xdf6cc0),HTML::Element=3DHASH(0xdf6d20)
HTML::Element=3DHASH(0xdf6db0),HTML::Element=3DHASH
(0xdf6e28),HTML::Element=3DHASH(0xdf6e64),HTML::Element=3DHASH
(0xdf6f0c),HTML::Element=3DHASH(0xdf6f84),HTML::Element=3DHASH
(0xdf6fe4),HTML::Element=3DHASH(0xdf7044),HTML::Element=3DHASH
(0xdf70a4),HTML::Element=3DHASH(0xdf7104)
HTML::Element=3DHASH(0xdf7194),HTML::Element=3DHASH
(0xdf720c),HTML::Element=3DHASH(0xdf7248),HTML::Element=3DHASH
(0xdf72f0),HTML::Element=3DHASH(0xdf7368),HTML::Element=3DHASH
(0xdf73c8),HTML::Element=3DHASH(0xdf7428),HTML::Element=3DHASH
(0xdfae6c),HTML::Element=3DHASH(0xdfaecc)
HTML::Element=3DHASH(0xdfaf5c),HTML::Element=3DHASH
(0xdfafd4),HTML::Element=3DHASH(0xdfb010),HTML::Element=3DHASH
(0xdfb064),HTML::Element=3DHASH(0xdfb0dc),HTML::Element=3DHASH
(0xdfb13c),HTML::Element=3DHASH(0xdfb19c),HTML::Element=3DHASH
(0xdfb1fc),HTML::Element=3DHASH(0xdfb25c)
HTML::Element=3DHASH(0xdfb2ec),HTML::Element=3DHASH
(0xdfb364),HTML::Element=3DHASH(0xdfb3a0),HTML::Element=3DHASH
(0xdfb3f4),HTML::Element=3DHASH(0xdfb46c),HTML::Element=3DHASH
(0xdfb4cc),HTML::Element=3DHASH(0xdfb52c),HTML::Element=3DHASH
(0xdfb58c),HTML::Element=3DHASH(0xdfb5ec)
HTML::Element=3DHASH(0xdfb67c),HTML::Element=3DHASH
(0xdfb6f4),HTML::Element=3DHASH(0xdfb730),HTML::Element=3DHASH
(0xdfb7d8),HTML::Element=3DHASH(0xdfb850),HTML::Element=3DHASH
(0xdfb8b0),HTML::Element=3DHASH(0xdfb910),HTML::Element=3DHASH
(0xdfb970),HTML::Element=3DHASH(0xdfb9d0)
HTML::Element=3DHASH(0xdfba60),HTML::Element=3DHASH
(0xdfbad8),HTML::Element=3DHASH(0xdfbb14),HTML::Element=3DHASH
(0xdfbb68),HTML::Element=3DHASH(0xdfbbe0),HTML::Element=3DHASH
(0xdfbc40),HTML::Element=3DHASH(0xdfbca0),HTML::Element=3DHASH
(0xdfbd00),HTML::Element=3DHASH(0xdfbd60)
HTML::Element=3DHASH(0xdfbdf0),HTML::Element=3DHASH
(0xdfe75c),HTML::Element=3DHASH(0xdfe798),HTML::Element=3DHASH
(0xdfe84c),HTML::Element=3DHASH(0xdfe8c4),HTML::Element=3DHASH
(0xdfe924),HTML::Element=3DHASH(0xdfe984),HTML::Element=3DHASH
(0xdfe9e4),HTML::Element=3DHASH(0xdfea44)
HTML::Element=3DHASH(0xdfead4),HTML::Element=3DHASH
(0xdfeb4c),HTML::Element=3DHASH(0xdfeb88),HTML::Element=3DHASH
(0xdfec3c),HTML::Element=3DHASH(0xdfecb4),HTML::Element=3DHASH
(0xdfed14),HTML::Element=3DHASH(0xdfed74),HTML::Element=3DHASH
(0xdfedd4),HTML::Element=3DHASH(0xdfee34)
HTML::Element=3DHASH(0xdfeec4),HTML::Element=3DHASH
(0xdfef3c),HTML::Element=3DHASH(0xdfef78),HTML::Element=3DHASH
(0xdff020),HTML::Element=3DHASH(0xdff098),HTML::Element=3DHASH
(0xdff0f8),HTML::Element=3DHASH(0xdff158),HTML::Element=3DHASH
(0xdff1b8),HTML::Element=3DHASH(0xdff218)
HTML::Element=3DHASH(0xdff2a8),HTML::Element=3DHASH
(0xdff320),HTML::Element=3DHASH(0xdff35c),HTML::Element=3DHASH
(0xdff3b0),HTML::Element=3DHASH(0xdff428),HTML::Element=3DHASH
(0xdff488),HTML::Element=3DHASH(0xdff4e8),HTML::Element=3DHASH
(0xdff548),HTML::Element=3DHASH(0xdff5a8)
HTML::Element=3DHASH(0xdff638),HTML::Element=3DHASH
(0xdff6b0),HTML::Element=3DHASH(0xdff6ec),HTML::Element=3DHASH
(0xe040e0),HTML::Element=3DHASH(0xe04158),HTML::Element=3DHASH
(0xe041b8),HTML::Element=3DHASH(0xe04218),HTML::Element=3DHASH
(0xe04278),HTML::Element=3DHASH(0xe042d8)
HTML::Element=3DHASH(0xe04368),HTML::Element=3DHASH
(0xe043e0),HTML::Element=3DHASH(0xe0441c),HTML::Element=3DHASH
(0xe044d0),HTML::Element=3DHASH(0xe04548),HTML::Element=3DHASH
(0xe045a8),HTML::Element=3DHASH(0xe04608),HTML::Element=3DHASH
(0xe04668),HTML::Element=3DHASH(0xe046c8)
HTML::Element=3DHASH(0xe04758),HTML::Element=3DHASH
(0xe047d0),HTML::Element=3DHASH(0xe0480c),HTML::Element=3DHASH
(0xe04860),HTML::Element=3DHASH(0xe048d8),HTML::Element=3DHASH
(0xe04938),HTML::Element=3DHASH(0xe04998),HTML::Element=3DHASH
(0xe049f8),HTML::Element=3DHASH(0xe04a58)
HTML::Element=3DHASH(0xe04ae8),HTML::Element=3DHASH
(0xe04b60),HTML::Element=3DHASH(0xe04b9c),HTML::Element=3DHASH
(0xe04c44),HTML::Element=3DHASH(0xe04cbc),HTML::Element=3DHASH
(0xe04d1c),HTML::Element=3DHASH(0xe04d7c),HTML::Element=3DHASH
(0xe04ddc),HTML::Element=3DHASH(0xe04e3c)
HTML::Element=3DHASH(0xe04ecc),HTML::Element=3DHASH
(0xe04f44),HTML::Element=3DHASH(0xe04f80),HTML::Element=3DHASH
(0xe04fd4),HTML::Element=3DHASH(0xe089d8),HTML::Element=3DHASH
(0xe08a38),HTML::Element=3DHASH(0xe08a98),HTML::Element=3DHASH
(0xe08af8),HTML::Element=3DHASH(0xe08b58)
-------------------------
Source Code:
use strict;
use LWP::Simple;
use HTML::Tree;
use warnings;
my $file =3D "economic_calendar.dat";
unlink($file);
open FILE, ">$file" or die $!;
my $url =3D 'http://biz.yahoo.com/c/e.html';
my $content =3D get($url);
my $tree =3D HTML::TreeBuilder->new_from_content($content);
my @tr =3D $tree->look_down('_tag' =3D> 'tr',
sub { $_[0]->as_text !~ m/My Yahoo!/ &&
$_[0]->as_text !~ m/Welcome/i &&
$_[0]->as_text !~ m/Economic Calendar/i
&&
$_[0]->as_text !~ m/Last Week/i });
foreach my $tr (@tr)
{
if ($tr)
{
my @detail =3D $tr->look_down('_tag' =3D> 'td');
print join(',',@detail)."\n";
}
else
{
warn "No detail data";
}
}
close FILE;
$tree->delete;
------------------------------
Date: Tue, 24 Nov 2009 13:32:38 -0600
From: Tad McClellan <tadmc@seesig.invalid>
Subject: Re: Problem parsing HTML
Message-Id: <slrnhgod11.igk.tadmc@tadbox.sbcglobal.net>
Ninja Li <nickli2000@gmail.com> wrote:
> Hi,
>
> I am trying to parse HTML from the website http://biz.yahoo.com/c/e.html
> using HTML::TreeBuilder module and generate a comma-delimited file.
> use HTML::Tree;
> my @tr = $tree->look_down('_tag' => 'tr',
If the date that you want to scrape is in an HTML table,
then TableExtract is likely to make for prettier and more
robust code:
----------------------------------
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
my $html = get 'http://biz.yahoo.com/c/e.html';
my @headers = (
'Date',
"Time",
'Statistic',
'For',
'Actual',
'Briefing Forecast',
'Market Expects',
'Prior',
"Revised\nFrom",
);
my $te = HTML::TableExtract->new( headers => \@headers );
$te->parse($html);
foreach my $ts ( $te->tables ) {
foreach my $row ($ts->rows) {
my $csv = join ',', @$row;
print "$csv\n";
}
}
----------------------------------
--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
------------------------------
Date: Tue, 24 Nov 2009 12:52:48 -0800 (PST)
From: Ninja Li <nickli2000@gmail.com>
Subject: Re: Problem parsing HTML
Message-Id: <f907af6b-d43d-4497-8af9-33bcd1ca6e6d@e20g2000vbb.googlegroups.com>
On Nov 24, 2:32=A0pm, Tad McClellan <ta...@seesig.invalid> wrote:
> Ninja Li <nickli2...@gmail.com> wrote:
> > Hi,
>
> > =A0 =A0I am trying to parse HTML from the websitehttp://biz.yahoo.com/c=
/e.html
> > using HTML::TreeBuilder module and generate a comma-delimited file.
> > use HTML::Tree;
> > my @tr =3D $tree->look_down('_tag' =3D> 'tr',
>
> If the date that you want to scrape is in an HTML table,
> then TableExtract is likely to make for prettier and more
> robust code:
>
> ----------------------------------
> #!/usr/bin/perl
> use warnings;
> use strict;
> use LWP::Simple;
> use HTML::TableExtract;
>
> my $html =3D get 'http://biz.yahoo.com/c/e.html';
>
> my @headers =3D (
> =A0 =A0 'Date',
> =A0 =A0 "Time",
> =A0 =A0 'Statistic',
> =A0 =A0 'For',
> =A0 =A0 'Actual',
> =A0 =A0 'Briefing Forecast',
> =A0 =A0 'Market Expects',
> =A0 =A0 'Prior',
> =A0 =A0 "Revised\nFrom",
> );
>
> my $te =3D HTML::TableExtract->new( headers =3D> \@headers );
> $te->parse($html);
>
> foreach my $ts ( $te->tables ) {
> =A0 =A0 foreach my $row ($ts->rows) {
> =A0 =A0 =A0 =A0 my $csv =3D join ',', @$row;
> =A0 =A0 =A0 =A0 print "$csv\n";
> =A0 =A0 }}
>
> ----------------------------------
>
> --
> Tad McClellan
> email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
Tad,
Thanks for your response. HTML::TableExtract looks to be a better
option for dealing with HTML. I tried to apply your code to another
web link, however I didn't get any output. Please advise what might be
wrong. The code is at the end of the post.
Thanks.
Nick
------------------------
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
my $html =3D 'http://www.earnings.com/conferencecall.asp?client=3Dcb';
my @headers =3D
(
'SYMBOL',
'COMPANY',
'EVENT TITLE',
'WEBCAST',
'TRANSCRIPT',
'TIME'
);
my $te =3D HTML::TableExtract->new( headers =3D> \@headers );
$te->parse($html);
foreach my $ts ( $te->tables )
{
foreach my $row ($ts->rows)
{
my $csv =3D join ',', @$row;
print "$csv\n";
}
}
------------------------------
Date: Tue, 24 Nov 2009 16:36:17 +0000
From: zaphod <abc@def.com>
Subject: Why is "use 5.010" necessary
Message-Id: <zaudnY1MvYkclpHWnZ2dnUVZ8gCdnZ2d@brightview.co.uk>
Why is it necessary to state:
use 5.010;
... simply to have access to Perl 5.10's features? If I didn't want to use it's features I wouldn't have installed it in the first place. 5.10 is backwards compatible in any case so it all seems a bit pointless.
zaphod
------------------------------
Date: Tue, 24 Nov 2009 17:57:07 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: Why is "use 5.010" necessary
Message-Id: <jcjtt6-qt52.ln1@osiris.mauzo.dyndns.org>
Quoth zaphod <abc@def.com>:
> Why is it necessary to state:
>
> use 5.010;
>
> ... simply to have access to Perl 5.10's features? If I didn't want to
> use it's features I wouldn't have installed it in the first place. 5.10
> is backwards compatible in any case so it all seems a bit pointless.
It's only necessary to say that if you want a feature that might be
incompatible with code written for previous versions of Perl. For
instance, the '//' operator is always available, but the 'say' operator
isn't, in case you have some code that uses a sub called 'say'.
Ben
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 2691
***************************************