[102420] in RedHat Linux List
RE: parsing text files
daemon@ATHENA.MIT.EDU (Charles Galpin)
Wed Dec 2 23:56:25 1998
Date: Wed, 2 Dec 1998 23:53:33 -0500
From: Charles Galpin <cgalpin@lighthouse-software.com>
To: "Rick L. Mantooth" <redhat-list@redhat.com>
Resent-From: redhat-list@redhat.com
Reply-To: redhat-list@redhat.com
Hi all
This is fun. I get to show you how Perl really shines in these situations.
First, I agree with Jan Carlson that if you had a fixed delimeter like a
tab, you could just do this with
cut -f1 datafile > firstcol.out
cut -f2 datafile > secondcol.out
cut -f3 datafile > thridcol.out
And he is correct that for one off jobs, most editors like emacs and vi will
let you select columns.
But as Rick Mantooth said
>Oh for the world to be easier...;)
right before scaring me with all thse crazy ways of atacking this problem.
Since there is no fixed delimeter, what you want is a bit of logic that
says, if there are more than X consecutive whitespace characters, consider
it a delimiter. This is easily done in the following perl script. I've
written it verbosely so you can read it, but this is pretty much a one liner
- if I chose 2 spaces as a delimiter it would just be
perl -n -e ' print "$1\n" if /^(.*?)\s{2,}(.*?)\s{2,}(.*?)$/;' datafile >
firstcol.out
perl -n -e ' print "$2\n" if /^(.*?)\s{2,}(.*?)\s{2,}(.*?)$/;' datafile >
secondcol.out
perl -n -e ' print "$3\n" if /^(.*?)\s{2,}(.*?)\s{2,}(.*?)$/;' datafile >
thirdcol.out
Anyway here it is
----- cut ------
#!/usr/bin/perl -n
# the column you want to cut
# for this data set, you would want to run this with the values 1,2,3
$column = 1;
# the minimum number of spaces to call a delimeter
# in this example 2 is sufficient,but you could increase this if need be
$minSpaces = 2;
# if we match the form of DATA<2 or more spaces>DATA<2 or more spaces>DATA
# then print the column we are interested in
if ( /^(.*?)\s{$minSpaces,}(.*?)\s{$minSpaces,}(.*?)$/ )
{
print "${$column}\n";
}
----- cut ------
I'll tell you the key features of perl used.
o The -n switch to perl says consider all arguments to this script
filenames, and open each one and run this script over every line of each
file, but print nothing by default. -p would print every line
o the () are matching operators, and whatever matches inside them is saved
in variables $1, $2, etc
o the ? at the end of an expression turns off perls naturally greedy
matching. without it, .* would match any character to all the way to the end
of the line
o \s mathces any whitespace char
o the {n,} at the end of of an expression says match at least n times
Try it!
hth
Charles
Here is the rest of Rick's email to compare with:
>
>File one:
>abc 123 LineOne
>def 456 LineTwo
>ghi 789 LineThree
>
>$ nawk '{print $1}' one
>abc
>def
>ghi
>(Ah, life is grand)
>
>File two:
>ab c 123 LineOne
>de f 456 LineTwo
>gh i 789 LineThree
> ^ is a SPACE
>
>$ nawk 'BEGIN{FS=" "}{print $1}' < two
> ^^^^^ is a TAB
>will give the results:
>ab c
>de f
>gh i
>
>Now for a little fun.
>Assuming TAB is the Field Separator for the columns and words
>with spaces between them belong in the same field. (ass_u_me)
>(Above nawk would work here also)
>
>File three:
>a b c 123 Line One
>de f 456 LineTwo
>gh i 7 89 Line Three
>
>$ sed -e 's/ /_/g' three | nawk '{print $1}' | sed -e 's/_/ /g'
>a b c
>de f
>gh i
>
>Let sed swap the SPACES to _ , hand it off to nawk and then let
>sed swap _ back to SPACE.
>
>Note:
>Not *thoroughly tested*, just some quickies.
>
>Rick
>
>On Wed, 2 Dec 1998, Ed Lazor wrote:
>
>>
>> You're right and you're wrong. It turns out that the column of
>> information I was actually pulling information from only had one word -
>> so I lucked out *grin*. This wouldn't have worked in the example I gave
>> though if I were trying to pull the person's last and first name because
>> they are separate words and would have been parsed into separate
>> variables. So out of curiosity, how would have you gone about
>> separating the column if it had multiple words and the number of words
>> varied?
>>
>--
>Rick L. Mantooth rickdman@cyberramp.net
>http://www.cyberramp.net/~rickdman
>I live with FEAR and sometimes she lets me go fishing!
>
-- Charles Galpin <cgalpin@lighthouse-software.com>
--
PLEASE read the Red Hat FAQ, Tips, Errata and the MAILING LIST ARCHIVES!
http://www.redhat.com http://archive.redhat.com
To unsubscribe: mail redhat-list-request@redhat.com with
"unsubscribe" as the Subject.