[31631] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 2890 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Sat Mar 27 06:09:25 2010

Date: Sat, 27 Mar 2010 03:09:07 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Sat, 27 Mar 2010     Volume: 11 Number: 2890

Today's topics:
    Re: logic question for text file updates sln@netherlands.com
    Re: logic question for text file updates <cartercc@gmail.com>
    Re: logic question for text file updates <cartercc@gmail.com>
    Re: logic question for text file updates sln@netherlands.com
    Re: logic question for text file updates <ben@morrow.me.uk>
    Re: logic question for text file updates <cartercc@gmail.com>
    Re: logic question for text file updates <m@rtij.nl.invlalid>
    Re: logic question for text file updates sln@netherlands.com
    Re: logic question for text file updates <xhoster@gmail.com>
    Re: Perl / cgi / include file a la #include <uri@StemSystems.com>
    Re: Perl / cgi / include file a la #include <cartercc@gmail.com>
    Re: socket transmission <derykus@gmail.com>
    Re: socket transmission <derykus@gmail.com>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Fri, 26 Mar 2010 13:20:27 -0700
From: sln@netherlands.com
Subject: Re: logic question for text file updates
Message-Id: <be5qq5tipahmq4vkc2q5n4hc52uqtb5elv@4ax.com>

On Fri, 26 Mar 2010 19:38:03 +0000, Ben Morrow <ben@morrow.me.uk> wrote:

>
[snip]
>Is there any way of 'blanking' a record? Normal CSV doesn't support
>comments, and if you're importing into Excel you can't extend it to do
>so; what does Excel do if you give it a file like
>
>    one|two|three
>    ||||||||||||||
>    four|five|six

In this case, the newline is the record delimeter, '|' is the
field delimeter.
You can have excel treat consecutive field delimeters as one.
In this case, the ||||||||||| produces a blank record.
This is the way Excel 2002 works, don't know if you can
auto-remove blank records in newer versions though.

-sln


------------------------------

Date: Fri, 26 Mar 2010 13:42:08 -0700 (PDT)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: logic question for text file updates
Message-Id: <7ba05202-c63a-4da4-9aaa-7e582d1ffb14@b33g2000yqc.googlegroups.com>

On Mar 26, 3:42=A0pm, s...@netherlands.com wrote:
> So, if you have a source file, delimited by | that you eventually make
> a dbl quote comma delimited csv file, you could make that status field fi=
xed
> width (what 20 chars tops?) in the source. When you generate the dat, csv
> file, just strip white space from the beginning and end of the field
> before you double quote it to a csv file.

Working backwards, my ultimate output file looks like this:
"id","field2","field3","status","field5","field6"\n

The 'status' field should be the current status, which rarely changes,
but it's critical to use the most current status.

I get about ten update files a year with the current status and a
number of other fields that I don't care about. I take these files,
strip out everything except the ID and the STATUS, and write that data
into memory.

Working frontwards, I build a source file with the two fields I
referenced, like this: [id|status]

I went ahead and bit the bullet, since I had to do something. I (1)
save the source file to a backup, (2) read in the source file and save
it to a hash on the ids, (3) read in the update file the same way, and
(4) print out the hash to the source file. It's reasonably quick, less
than a second (although I haven't bench marked it) and seems to be
reliable.

That said, I'd like to learn a more elegant way to do it.

CC.

> You know the file position of the previous EOR. Use index() to find the p=
ipe '|' char
> of the status field of the current record (4th in the example), add that =
to the previous
> EOR to get the write() position for the new status (if it changed).
>
> To find out if the status changed, do your split /'|'/ to get all the fie=
lds, check
> the ID/status from the update file, write out new "fixed width" status (f=
ormat with
> printf or something) to the source file.
>
> When it comes time to generate the csv from the source, just trim spaces =
before you
> write it out.

That sounds a lot more complicated than the brute force approach I
used. But I appreciate your suggestion as treating the files as fixed
width, and I will explore that later.

CC


------------------------------

Date: Fri, 26 Mar 2010 13:49:12 -0700 (PDT)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: logic question for text file updates
Message-Id: <dcbc2ae0-a386-4247-b588-0e2b6be5377b@z3g2000yqz.googlegroups.com>

On Mar 26, 3:38=A0pm, Ben Morrow <b...@morrow.me.uk> wrote:
> Is there any way of 'blanking' a record? Normal CSV doesn't support
> comments, and if you're importing into Excel you can't extend it to do
> so; what does Excel do if you give it a file like

Actually, I comment CSV files all the time, not for use by Excel, but
for use by my scripts. The 'comments' are on interspersed lines
beginning with #, so I can do this:
while (<INPUT>)
{
  next if /^#/;
  ...
}

> =A0 =A0 - read the update file(s) into a hash,
> =A0 =A0 - open the source file read/write,
> =A0 =A0 - go through it looking for the appropriate records,
> =A0 =A0 - when you find one, wipe it out without changing the length or
> =A0 =A0 =A0 removing the newline,
> =A0 =A0 - add the changed records onto the end of the file, since the
> =A0 =A0 =A0 records weren't in order anyway.

I don't see any real difference between this and reading the entire
file into memory, at least for the size files I'm dealing this. IO is
always a bottleneck, and unless space is limited it's better to use
space than time.

> It's generally not worth messing around with approaches like this,
> though. Rewriting a file of a few MB doesn't exactly take long, and it's
> much easier to get right.

Yeah, I'm beginning to think that my investment in my time isn't worth
the results.

Thanks, CC.


------------------------------

Date: Fri, 26 Mar 2010 14:15:15 -0700
From: sln@netherlands.com
Subject: Re: logic question for text file updates
Message-Id: <rr8qq5t0r0f7gkd2nva5708736dse4jjjr@4ax.com>

On Fri, 26 Mar 2010 13:42:08 -0700 (PDT), ccc31807 <cartercc@gmail.com> wrote:

>
>That sounds a lot more complicated than the brute force approach I
>used. But I appreciate your suggestion as treating the files as fixed
>width, and I will explore that later.
>
>CC

Actually, the brute force method you cite is far and away the
much more complicated approach.

Good luck!

-sln


------------------------------

Date: Fri, 26 Mar 2010 21:26:46 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: logic question for text file updates
Message-Id: <mdlv77-2511.ln1@osiris.mauzo.dyndns.org>


Quoth ccc31807 <cartercc@gmail.com>:
> 
> Working backwards, my ultimate output file looks like this:
> "id","field2","field3","status","field5","field6"\n
> 
> The 'status' field should be the current status, which rarely changes,
> but it's critical to use the most current status.
> 
> I get about ten update files a year with the current status and a
> number of other fields that I don't care about. I take these files,
> strip out everything except the ID and the STATUS, and write that data
> into memory.
> 
> Working frontwards, I build a source file with the two fields I
> referenced, like this: [id|status]
> 
> I went ahead and bit the bullet, since I had to do something. I (1)
> save the source file to a backup, (2) read in the source file and save
> it to a hash on the ids, (3) read in the update file the same way, and
> (4) print out the hash to the source file. It's reasonably quick, less
> than a second (although I haven't bench marked it) and seems to be
> reliable.
> 
> That said, I'd like to learn a more elegant way to do it.

Umm... that's about the simplest way possible, where the files are small
enough for it to be workable. The only real potential improvement would
be to use a DBM instead of a flat file, since that can be updated
directly rather than rewritten. See DB_File &c. in the core
distribution.

Ben



------------------------------

Date: Fri, 26 Mar 2010 14:33:36 -0700 (PDT)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: logic question for text file updates
Message-Id: <952f620c-1711-42d3-bccb-1ddecb13e830@g28g2000yqh.googlegroups.com>

On Mar 26, 5:15=A0pm, s...@netherlands.com wrote:
> Actually, the brute force method you cite is far and away the
> much more complicated approach.

I would be very interested in why you think this. It may depend on
your definition of 'complicated.'

In terms of writing the code, it was pretty simple. First, open the
source file and read it into a hash. Second, open the update file and
read it into the SAME(!) hash (thereby overwriting the old values
where the hash keys are duplicated.) Third, write the hash back out to
the source file.

As to 'complicated' I have know people that use Access for data
processing files, spending hours on end creating and building Access
databases, queries, and reports to manipulate data. It takes them a
lot longer to generate a report using Access than it does me, using
Perl to munge the data. They say that my way is more 'complicated'
because I use Perl (which is 'harder') and Access is easier. I say my
way is less 'complicated' because I don't have to mess around with
Access. Frankly, when I read some of the scripts you post to c.l.p.m.,
I have a very hard time understanding them, and (from my POV) I would
say that you have a weird conception of 'complicated.'

CC.


------------------------------

Date: Fri, 26 Mar 2010 23:31:10 +0100
From: Martijn Lievaart <m@rtij.nl.invlalid>
Subject: Re: logic question for text file updates
Message-Id: <e6pv77-5d6.ln1@news.rtij.nl>

On Fri, 26 Mar 2010 12:00:17 -0700, ccc31807 wrote:

> The key will always be a seven character integer. The value will always
> be a string with fewer than 20 characters. I COULD use a fixed width
> format, but my current format (for the source file) is pipe separated
> (e.g. 0059485|Current) and all my logic splits input on the pipe symbol.

Hurray for encapsulation. If from the start you encapsulated this (i.e. 
get a line, call a sub split_to_fields) you only would have had to update 
one sub.

Not much help now, but something to keep in the back of your mind when 
designing your next program.

M4


------------------------------

Date: Fri, 26 Mar 2010 16:38:42 -0700
From: sln@netherlands.com
Subject: Re: logic question for text file updates
Message-Id: <rvgqq593ubfqkp6jkmg064q4edfn9gkbfk@4ax.com>

On Fri, 26 Mar 2010 14:33:36 -0700 (PDT), ccc31807 <cartercc@gmail.com> wrote:

>On Mar 26, 5:15 pm, s...@netherlands.com wrote:
>> Actually, the brute force method you cite is far and away the
>> much more complicated approach.
>
>I would be very interested in why you think this. It may depend on
>your definition of 'complicated.'
>
>In terms of writing the code, it was pretty simple. First, open the
>source file and read it into a hash. Second, open the update file and
>read it into the SAME(!) hash (thereby overwriting the old values
>where the hash keys are duplicated.) Third, write the hash back out to
>the source file.
>
>As to 'complicated' I have know people that use Access for data
>processing files, spending hours on end creating and building Access
>databases, queries, and reports to manipulate data. It takes them a
>lot longer to generate a report using Access than it does me, using
>Perl to munge the data. They say that my way is more 'complicated'
>because I use Perl (which is 'harder') and Access is easier. I say my
>way is less 'complicated' because I don't have to mess around with
>Access. Frankly, when I read some of the scripts you post to c.l.p.m.,
>I have a very hard time understanding them, and (from my POV) I would
>say that you have a weird conception of 'complicated.'
>
>CC.

Unfortunately, I can't find any scripts you have posted here.

However, if you would like to make a technical comment on anything
I write and post here, feel free to do so. And I will be glad to help
if you run into dificulty understanding it.

Cheers.

-sln


------------------------------

Date: Fri, 26 Mar 2010 20:11:12 -0700
From: Xho Jingleheimerschmidt <xhoster@gmail.com>
Subject: Re: logic question for text file updates
Message-Id: <4bad7a15$0$29696$ed362ca5@nr5-q3a.newsreader.com>

ccc31807 wrote:
> We have a csv source file of many thousands of records, with two
> columns, the ID and a
> status field. It has very recently come to my attention that
> occasionally the status of a record will change, with the change being
> significant enough that the record must be updated before the process
> runs. The update files consist of a small subset, sometimes a very
> small subset, of the records in the source file. (The update file has
> a number of other fields that can change also, but I'm only concerned
> with the status field.)

Is the status field of fixed length?  If so, it can be changed in place.

> My first inclination is to open the update file, create a hash with
> the ID as the key and the status as value, then open the source file,
> read each line, update the line if it exists in the hash, and write
> each line to a new output file. However, I can think of several
> different ways to do this -- I just don't know which way would be
> best. I don't particularly want to read every line and write every
> line of a source file when only a few lines (if any) need to be
> modified.

When you say the source file has "many thousand" records, how many 
thousand are you talking?  Unless you are talking hundreds of thousands 
or thousands of thousands, I think that even spending the time to worry 
about alternatives to rewriting the file, much less implementing those 
alternatives, is a false economy.

But why write it out at all?  Read it in the exception file to a hash, 
read in the source file applying the exceptions in memory, and do 
whatever you need to do with the now accurate in memory records.  Leave 
the source file as it is, and next time you need to do something with 
it, just re-apply the exception file to it again, again in memory.

> My second inclination would be to use a database and write an update
> query for the records in the update file. But this seems a heavy
> weight solution to a light weight problem -- I would only be using the
> database to modify records, not to to any of the things we ordinarily
> use databases for.

If you already have a database that is maintained and backed up, etc., 
using it for an additional use may not be very heave weight.

Xho


------------------------------

Date: Fri, 26 Mar 2010 17:53:14 -0400
From: "Uri Guttman" <uri@StemSystems.com>
Subject: Re: Perl / cgi / include file a la #include
Message-Id: <87zl1ug379.fsf@quad.sysarch.com>

>>>>> "c" == ccc31807  <cartercc@gmail.com> writes:

  c> On Mar 25, 10:34 pm, me <noem...@nothere.com> wrote:
  >> I have a perl script generating HTML code. I'd like to include some
  >> existing HTML files (e.g. header.htm, footer.htm) in the output,
  >> while the perl program will generate the majority of the html code.

  c> This is a very, very common need.

  c> I create Perl modules which output HTML, and call the functions that
  c> return the HTML in my scripts. Like this:

that doesn't return the html, it prints it directly.

  c>    print qq(<?xml version="1.0" encoding="UTF-8" ?>

  c> HTML::print_header($page);

a better style i teach is called "print rarely, print late". instead of
printing directly, return the string you make up. build up the final
page in a buffer (using .= is easiest) and then you can decide what to
do with it. you can print to stdout as usual, print to a file for later
use by the web server, print to a socket if desired. or more than one of
those at the same time. by printing directly from the html subs you lose
the flexibility. it is also a bit faster to call print one time than
multiple times (.= is faster than print). and you can decide where to
print at the higher level which keeps the logic cleaner. i have seen
requests for dual handles which can print to a file and to stdout (i
think IO::Tee does that) but this is even easier and faster. so the rule
print rarely means build up a full string in a buffer before you print
it. print late means print only when you are done and can decide where
to print it.

uri

-- 
Uri Guttman  ------  uri@stemsystems.com  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------


------------------------------

Date: Fri, 26 Mar 2010 15:21:01 -0700 (PDT)
From: ccc31807 <cartercc@gmail.com>
Subject: Re: Perl / cgi / include file a la #include
Message-Id: <74805479-f5f9-46c9-8178-a05863581fa1@e7g2000yqf.googlegroups.com>

On Mar 26, 5:53=A0pm, "Uri Guttman" <u...@StemSystems.com> wrote:
> a better style i teach is called "print rarely, print late". instead of
> printing directly, return the string you make up.

In Paul Graham's book 'On Lisp' in chapter 3, 'Functional
Programming', he makes an extended case for this, which I find
compelling but not necessarily persuasive. I've been slowly making my
way through the dead tree version of HOP by MJD, and I also have found
a lot to like in that. My problems are (as Graham notes) an imperative
habit, a lack of opportunity, and little discretionary time.

> build up the final
> page in a buffer (using .=3D is easiest) and then you can decide what to
> do with it.

I actually use this quite a bit in my day job. I almost always find
myself constructing and deconstruction strings and arrays (and
hashes), moving between them, and as a result I do this quite a bit.
(I've also found the ||=3D operator useful.)

> you can print to stdout as usual, print to a file for later
> use by the web server, print to a socket if desired. or more than one of
> those at the same time. by printing directly from the html subs you lose
> the flexibility. it is also a bit faster to call print one time than
> multiple times (.=3D is faster than print). and you can decide where to
> print at the higher level which keeps the logic cleaner.

As an explanation if not a defense of what I wrote previously, most of
my HTML consists of front ends to databases, and most of the heavy
lifting is the SQL part. When I write my HTML files, I use variables
to trigger both the SQL and the HTML. I quite frequently end up with a
cgi script that looks something like this (illustration only):

HTML::print_header(@vars1);
HTML::print_banner(@vars2);
HTML::print_menu(@vars3);
my $hashref =3D SQL::get_list($var1);
HTML::print_content($hashref);
$hashref =3D SQL::get_calendar($var2);
HTML::print_content($hashref);
HTML::print_footer();
exit;

> i have seen
> requests for dual handles which can print to a file and to stdout (i
> think IO::Tee does that) but this is even easier and faster. so the rule
> print rarely means build up a full string in a buffer before you print
> it. print late means print only when you are done and can decide where
> to print it.

In this case, I am 'printing' a response to an HTTP request, not a
physical device. I also do some JSP, and for some reason, the JSP
doesn't seem any faster than my CGI, and sometimes is noticeably
slower.

CC


------------------------------

Date: Fri, 26 Mar 2010 17:39:39 -0700 (PDT)
From: "C.DeRykus" <derykus@gmail.com>
Subject: Re: socket transmission
Message-Id: <9bf6c69b-8a71-44f9-9a36-d160aeabe8f7@c20g2000prb.googlegroups.com>

On Mar 26, 10:28=A0am, j...@toerring.de (Jens Thoms Toerring) wrote:

>> ...

>  And if you decide
> you want to send the '\0' character anyway try
>
> print $client "POK\000";
>

Hm, although not the OP's C client, a Perl
one would've picked it up:

server: print $socket "POK\0\n";

client: printf "%s %d %d\n", substr($ack,0,3),
        ord(substr $ack,3,1),ord(substr $ack,4,1);

   --> POK 0 10


--
Charles DeRykus


------------------------------

Date: Fri, 26 Mar 2010 18:04:20 -0700 (PDT)
From: "C.DeRykus" <derykus@gmail.com>
Subject: Re: socket transmission
Message-Id: <3243625d-c4e0-4750-9171-b0d86a46c4b5@k4g2000prh.googlegroups.com>

On Mar 26, 5:39=A0pm, "C.DeRykus" <dery...@gmail.com> wrote:
> On Mar 26, 10:28=A0am, j...@toerring.de (Jens Thoms Toerring) wrote:
>
> >> ...
> > =A0And if you decide
> > you want to send the '\0' character anyway try
>
> > print $client "POK\000";
>
> Hm, although not the OP's C client, a Perl
> one would've picked it up:
>
> server: print $socket "POK\0\n";
>
> client: printf "%s %d %d\n", substr($ack,0,3),
> =A0 =A0 =A0 =A0 ord(substr $ack,3,1),ord(substr $ack,4,1);
>
> =A0 =A0--> POK 0 10

No, I see that I added "\n" which changes the
scenario.

--
Charles DeRykus



------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 2890
***************************************


home help back first fref pref prev next nref lref last post