[102420] in RedHat Linux List

home help back first fref pref prev next nref lref last post

RE: parsing text files

daemon@ATHENA.MIT.EDU (Charles Galpin)
Wed Dec 2 23:56:25 1998

Date: Wed, 2 Dec 1998 23:53:33 -0500
From: Charles Galpin <cgalpin@lighthouse-software.com>
To: "Rick L. Mantooth" <redhat-list@redhat.com>
Resent-From: redhat-list@redhat.com
Reply-To: redhat-list@redhat.com

Hi all

This is fun. I get to show you how Perl really shines in these situations.

First, I agree with Jan Carlson that if you had a fixed delimeter like a 
tab, you could just do this with

cut -f1 datafile > firstcol.out
cut -f2 datafile > secondcol.out
cut -f3 datafile > thridcol.out

And he is correct that for one off jobs, most editors like emacs and vi will 
let you select columns.

But as Rick Mantooth said

>Oh for the world to be easier...;)

right before scaring me with all thse crazy ways of atacking this problem.

Since there is no fixed delimeter, what you want is a bit of logic that 
says, if there are more than X consecutive whitespace characters, consider 
it a delimiter. This is easily done in the following perl script. I've 
written it verbosely so you can read it, but this is pretty much a one liner 
- if I chose  2 spaces as a delimiter it would just be

perl -n -e ' print "$1\n" if /^(.*?)\s{2,}(.*?)\s{2,}(.*?)$/;' datafile > 
firstcol.out
perl -n -e ' print "$2\n" if /^(.*?)\s{2,}(.*?)\s{2,}(.*?)$/;' datafile > 
secondcol.out
perl -n -e ' print "$3\n" if /^(.*?)\s{2,}(.*?)\s{2,}(.*?)$/;' datafile > 
thirdcol.out

Anyway here it is

----- cut ------
#!/usr/bin/perl -n

# the column you want to cut
# for this data set, you would want to run this with the values 1,2,3
$column = 1;

# the minimum number of spaces to call a delimeter
# in this example 2 is sufficient,but you could increase this if need be
$minSpaces = 2;

# if we match the form of DATA<2 or more spaces>DATA<2 or more spaces>DATA
# then print the column we are interested in
if ( /^(.*?)\s{$minSpaces,}(.*?)\s{$minSpaces,}(.*?)$/ )
{
    print "${$column}\n";
}
----- cut ------

I'll tell you the key features of perl used.

o The -n switch to perl says consider all arguments to this script 
filenames, and open each one and run this script over every line of each 
file, but print nothing by default. -p would print every line

o the () are matching operators, and whatever matches inside them is saved 
in variables $1, $2, etc

o the ? at the end of an expression turns off perls naturally greedy 
matching. without it, .* would match any character to all the way to the end 
of the line

o \s mathces any whitespace char

o the {n,} at the end of of an expression says match at least n times

Try it!

hth
Charles

Here is the rest of Rick's email to compare with:

>
>File one:
>abc	123	LineOne
>def	456	LineTwo
>ghi	789	LineThree
>
>$ nawk '{print $1}' one
>abc
>def
>ghi
>(Ah, life is grand)
>
>File two:
>ab c	123	LineOne
>de f	456	LineTwo
>gh i	789	LineThree
>  ^ is a SPACE
>
>$ nawk 'BEGIN{FS="     "}{print $1}' < two
>                ^^^^^ is a TAB
>will give the results:
>ab c
>de f
>gh i
>
>Now for a little fun.
>Assuming TAB is the Field Separator for the columns and words
>with spaces between them belong in the same field. (ass_u_me)
>(Above nawk would work here also)
>
>File three:
>a b c   123     Line One
>de f    456     LineTwo
>gh i    7 89    Line Three
>
>$ sed -e 's/ /_/g' three | nawk '{print $1}' | sed -e 's/_/ /g'
>a b c
>de f
>gh i
>
>Let sed swap the SPACES to _ , hand it off to nawk and then let
>sed swap _ back to SPACE.
>
>Note:
>Not *thoroughly tested*, just some quickies.
>
>Rick
>
>On Wed, 2 Dec 1998, Ed Lazor wrote:
>
>>
>> You're right and you're wrong.  It turns out that the column of
>> information I was actually pulling information from only had one word -
>> so I lucked out *grin*.  This wouldn't have worked in the example I gave
>> though if I were trying to pull the person's last and first name because
>> they are separate words and would have been parsed into separate
>> variables.  So out of curiosity, how would have you gone about
>> separating the column if it had multiple words and the number of words
>> varied?
>>
>--
>Rick L. Mantooth	rickdman@cyberramp.net
>http://www.cyberramp.net/~rickdman
>I live with FEAR and sometimes she lets me go fishing!
>

-- Charles Galpin   <cgalpin@lighthouse-software.com>


-- 
  PLEASE read the Red Hat FAQ, Tips, Errata and the MAILING LIST ARCHIVES!
		http://www.redhat.com http://archive.redhat.com
         To unsubscribe: mail redhat-list-request@redhat.com with 
                       "unsubscribe" as the Subject.


home help back first fref pref prev next nref lref last post