[17783] in Perl-Users-Digest
Perl-Users Digest, Issue: 5203 Volume: 9
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Tue Dec 26 21:10:31 2000
Date: Tue, 26 Dec 2000 18:10:14 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Message-Id: <977883014-v9-i5203@ruby.oce.orst.edu>
Content-Type: text
Perl-Users Digest Tue, 26 Dec 2000 Volume: 9 Number: 5203
Today's topics:
Re: Parsing mountain_arts
Re: Parsing (Garry Williams)
Re: Parsing (Abigail)
Re: Parsing (Tad McClellan)
perl/c call <bionic@engineer.com>
Re: perl/c call <danny@lennon.postino.com>
Re: regexp with NULLs <Juha.Laiho@iki.fi>
Re: subroutines and references <bh_ent@my-deja.com>
Re: testing if $_ is equal to some string? (Tim Hammerquist)
Re: Text Conversion in Perl <bart.lateur@skynet.be>
Re: Text Conversion in Perl (Martijn Lievaart)
Re: Text Conversion in Perl <JD@.Tallorno.net>
Digest Administrivia (Last modified: 16 Sep 99) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Tue, 26 Dec 2000 21:30:46 GMT
From: mountain_arts
Subject: Re: Parsing
Message-Id: <3a490d4d.9113558@news.dreamsoft.com>
abigail@foad.org (Abigail) wrote:
>Simon Stiefel (SiStie@nuclear-network.de) wrote on MMDCLXXIII September
>MCMXCIII in <URL:news:Pine.LNX.4.31.0012252154170.3664-100000@server.stiefel.priv>:
>// Hi,
>//
>// I want to parse the output of "finger".
>// The output could be something like this:
>//
>// root root 1 1 Mon 20:10
>// root root 2 - Mon 20:10
>// sistie Simon Stiefel *3 - Mon 20:12
>// ^^^^^^^^^^|^^^^^^^^^^^^^^^^^^^^|^^^^^^^^^^^^|^^^^^|^^^^^^^^^^
>// $1 $2 $3 $4 $5
>//
>// Now, I want to parse those "areas" and save them in different variables
>// (see above).
>//
>// So, uhm, does anybody have an idea how to realize that? =)
>
>
>unpack
>
>
>
>Abigail
why not split?
------------------------------
Date: Tue, 26 Dec 2000 23:06:08 GMT
From: garry@zvolve.com (Garry Williams)
Subject: Re: Parsing
Message-Id: <Av926.1400$Kk5.68009@eagle.america.net>
On Tue, 26 Dec 2000 21:30:46 GMT, mountain_arts <mountain_arts> wrote:
>abigail@foad.org (Abigail) wrote:
>
>>Simon Stiefel (SiStie@nuclear-network.de) wrote on MMDCLXXIII September
>>MCMXCIII in <URL:news:Pine.LNX.4.31.0012252154170.3664-100000@server.stiefel.priv>:
>>// Hi,
>>//
>>// I want to parse the output of "finger".
>>// The output could be something like this:
>>//
>>// root root 1 1 Mon 20:10
>>// root root 2 - Mon 20:10
>>// sistie Simon Stiefel *3 - Mon 20:12
>>// ^^^^^^^^^^|^^^^^^^^^^^^^^^^^^^^|^^^^^^^^^^^^|^^^^^|^^^^^^^^^^
>>// $1 $2 $3 $4 $5
>>//
>>// Now, I want to parse those "areas" and save them in different variables
>>// (see above).
>>//
>>// So, uhm, does anybody have an idea how to realize that? =)
>>
>>unpack
>>
>>Abigail
>why not split?
Because there exists no pattern to consistently split upon. (E.g.,
check the second field.)
--
Garry Williams
------------------------------
Date: 26 Dec 2000 23:14:54 GMT
From: abigail@foad.org (Abigail)
Subject: Re: Parsing
Message-Id: <slrn94i9je.mi8.abigail@tsathoggua.rlyeh.net>
mountain_arts (mountain_arts) wrote on MMDCLXXIV September MCMXCIII in
<URL:news:3a490d4d.9113558@news.dreamsoft.com>:
[] abigail@foad.org (Abigail) wrote:
[]
[] >Simon Stiefel (SiStie@nuclear-network.de) wrote on MMDCLXXIII September
[] >MCMXCIII in <URL:news:Pine.LNX.4.31.0012252154170.3664-100000@server.stiefel.priv>:
[] >// Hi,
[] >//
[] >// I want to parse the output of "finger".
[] >// The output could be something like this:
[] >//
[] >// root root 1 1 Mon 20:10
[] >// root root 2 - Mon 20:10
[] >// sistie Simon Stiefel *3 - Mon 20:12
[] >// ^^^^^^^^^^|^^^^^^^^^^^^^^^^^^^^|^^^^^^^^^^^^|^^^^^|^^^^^^^^^^
[] >// $1 $2 $3 $4 $5
[] >//
[] >// Now, I want to parse those "areas" and save them in different variables
[] >// (see above).
[] >//
[] >// So, uhm, does anybody have an idea how to realize that? =)
[] >
[] >
[] >unpack
[] >
[]
[] why not split?
Because that's a lot easier.
Or could you give a correct, understandable, regex to split on? Look
careful at the third line of the example.
Abigail
--
$"=$,;*{;qq{@{[(A..Z)[qq[0020191411140003]=~m[..]g]]}}}=*_;
sub _ {push @_ => /::(.*)/s and goto &{ shift}}
sub shift {print shift; @_ and goto &{+shift}}
Hack ("Just", "Perl ", " ano", "er\n", "ther "); # 20001226
------------------------------
Date: Tue, 26 Dec 2000 15:18:27 -0500
From: tadmc@metronet.com (Tad McClellan)
Subject: Re: Parsing
Message-Id: <slrn94hv8j.crb.tadmc@magna.metronet.com>
mountain_arts <mountain_arts> wrote:
>abigail@foad.org (Abigail) wrote:
>
>>Simon Stiefel (SiStie@nuclear-network.de) wrote on MMDCLXXIII September
>>MCMXCIII in <URL:news:Pine.LNX.4.31.0012252154170.3664-100000@server.stiefel.priv>:
>>// I want to parse the output of "finger".
>>// The output could be something like this:
>>//
>>// sistie Simon Stiefel *3 - Mon 20:12
>>// ^^^^^^^^^^|^^^^^^^^^^^^^^^^^^^^|^^^^^^^^^^^^|^^^^^|^^^^^^^^^^
>>// $1 $2 $3 $4 $5
>>unpack
>why not split?
Maybe because it will not work?
Working is a nice feature in a "solution" :-)
split() is for data with separators, finger(1)'s data does not
have separators.
unpack() is for fixed-width fields, finger(1)'s data is
fixed-width fields.
--
Tad McClellan SGML consulting
tadmc@metronet.com Perl programming
Fort Worth, Texas
------------------------------
Date: Tue, 26 Dec 2000 15:12:47 -0600
From: bionic_man <bionic@engineer.com>
Subject: perl/c call
Message-Id: <3A4909CF.7CDBF4F9@engineer.com>
Hello all:
What is the quickest way to do the following:
-I like to be able to write a c function, accessible through perl
script(I suppose they call it xsub)
-being able to call another c function from perl.
-what does the make file would look like?
It would be great if someone can give me an example.
Thanks all
------------------------------
Date: Tue, 26 Dec 2000 22:24:44 GMT
From: Danny Aldham <danny@lennon.postino.com>
Subject: Re: perl/c call
Message-Id: <92b5c6$af5$1@lennon.postino.com>
bionic_man <bionic@engineer.com> wrote:
> What is the quickest way to do the following:
> -I like to be able to write a c function, accessible through perl
> script(I suppose they call it xsub)
> -being able to call another c function from perl.
> -what does the make file would look like?
> It would be great if someone can give me an example.
You might want to try out the Inline module, eg lifted from man page:
#!/usr/bin/perl -w
print " 9 + 16 = ", add(9,16) , "\n";
print " 9 - 16 = ", subtract(9,16) , "\n";
use Inline C => <<'END_OF_C_CODE';
int add(int x, int y) {
return x + y ;
}
int subtract(int x, int y) {
return x - y ;
}
END_OF_C_CODE
exit ;
--
Danny Aldham Providing Certified Internetworking Solutions to Business
www.postino.com E-Mail, Web Servers, Web Databases, SQL PHP & Perl
------------------------------
Date: 26 Dec 2000 19:14:09 +0200
From: Juha Laiho <Juha.Laiho@iki.fi>
Subject: Re: regexp with NULLs
Message-Id: <92ajl1$t0q$1@ichaos.ichaos-int>
Steven Fletcher <flec@flec.co.uk> said:
[having NUL-terminated strings in a db]
>However the function that reads the data simply takes 32 bytes from the
>filehandle and returns it. I attempted to write a function that would
>strip anything following a \00 from the data supplied, which went
>something like this:
>
>sub read_memo_sender {
> ($string) = @_;
> if ($string =~ /^(\w+)\00/) {
> return($1);
> } else {
> return($string);
> }
>}
Almost good. I think the RE should be
/^(.*?)\0/
I.e. "non-greedily" match anything (even an empty string) preceding a NUL.
Non-greedines meaning that unlike /^(.*)\0/, the above RE will stop at the
first NUL encountered.
--
Wolf a.k.a. Juha Laiho Espoo, Finland
(GC 3.0) GIT d- s+: a- C++ UH++++$ UL++++ P+@ L+++ E(-) W+$@ N++ !K w !O
!M V PS(+) PE Y+ PGP(+) t- 5? !X R tv--- b+ DI? D G e+ h--- r+++ y+
"...cancel my subscription to the resurrection!" (Jim Morrison)
------------------------------
Date: Tue, 26 Dec 2000 19:01:31 GMT
From: Drew Myers <bh_ent@my-deja.com>
Subject: Re: subroutines and references
Message-Id: <92apu7$oqg$1@nnrp1.deja.com>
In article <Wp526.23435$bw.1592656@news.flash.net>,
egwong@netcom.com wrote:
[ cut ]
> ...do you have any particular reason why you're returning a list of
> references to scalars rather than just a list of scalars themselves?
> Something like
> return ($high, $low, $avg);
The original reason I used references to scalars was because not all
the data is returned as scalars (although they were all scalars in the
sub I asked about). I thought it would be more readable if I was
consistent in my variable-passing usage if I referenced the scalars as
well as the arrays, rather than just the arrays. Form-wise, is it
better to use the references just for the arrays? What's the
compilation cost difference?
Thanks again,
Drew
Sent via Deja.com
http://www.deja.com/
------------------------------
Date: Wed, 27 Dec 2000 01:24:13 GMT
From: tim@degree.ath.cx (Tim Hammerquist)
Subject: Re: testing if $_ is equal to some string?
Message-Id: <slrn94iha3.9cd.tim@degree.ath.cx>
Jerome O'Neil <jerome@activeindexing.com> wrote:
> > open(XMLFILE, "<blah.xml");
> > $i=0;
> > while(<XMLFILE>) {
> > @xmlFileArray[$i] = $_;
> > $i++;
> > }
> > close(XMLFILE);
>
> Is $i needed here? How about something realy easy like
>
> @xmlFileArray = <XMLFILE>; # Whats with the StuDlyCaps?
The syntax used to call functions (ie, func(args) vs. func args;) along
with the unnecessary use of indexing indicates experience with a much
less Perlish language, possibly Python, but more likely C/C++ or Java.
The variable capitalisation makes me lean more toward Java.
Perl, being both a language derived from many sources and a
multifunctional culture as well as its "natural language" design, have
brought the concept of idioms. They often cause culture shock for
people coming from highly anal...I mean, um, pragmatic languages. One
example where Perl makes simple jobs easy:
A C programmer might be tempted to process each file provided as an
argument one line at a time (but now in Perl) thus:
: for $file (@ARGV) {
: open(FILE, "<$file") || die("Can't open $file: $!\n");
: while(<FILE>) {
: # do stuff
: }
: close(FILE);
: }
...but this is a fairly common task, especially in systems programming
or log parsing (we mustn't forget Perl's roots!) =) Why do all that?
: while(<>) {
: # do stuff; $_ will be each subsequent line
: }
By no means do you _have_ to use the idioms, but you can save yourself a
lot of work (and evil typing!) if you become familiar with them. =)
HTH
--
-Tim Hammerquist <timmy@cpan.org>
A child of five would understand this.
Send someone to fetch a child of five.
-- Groucho Marx
------------------------------
Date: Tue, 26 Dec 2000 22:45:52 GMT
From: Bart Lateur <bart.lateur@skynet.be>
Subject: Re: Text Conversion in Perl
Message-Id: <9t6i4t87hesmccfh7jbrrcudbbmb11ocrd@4ax.com>
Mario Thomas wrote:
>The problem i have is that the file contains all
>sorts of strange characters. I can only
>assume they have been put there by Quark on the Mac. Is there anyway i can
>convert these characters to PC format using Perl?
Yup, you're right. The reason is that the Mac's character set and the
PC's (Windows) character set are different. Now since you want to place
the texts on Internet, converting the characters to ISO-Latin-1 will be
good enough. And: ISO-Latin-1 is a subset of Unicode. Sopmebody
suggested getting the character set tables from Unicode.org's FTP site,
and that's a good suggestion. All you need is this file:
<ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT>.
Now, about the file format:these files contain plain (ASCII) text files,
with per line either just a comment, starting with "#", or a line with
the encoding for one character. Let's take an example:
0xCA 0x00A0 # NO-BREAK SPACE
There are three columns in each line, separated by tabs. The first
column is the character code in the proprietary set, the second columns
the character code in Unicode (ISO-LAtin-1, if it's less than 256), both
in hexadecimal; and the third column is a comment, a description ofthe
character in plain text. This one character is the non-breaking space,
0x00A0 = 160 in ISO-Latin-1, and 0xCA = 202, on the Mac. So what you
need to do, is replace characters with code 202 by characters with code
160. Amongst others.
So, here's some more complete code to do this conversion for you.
open IN, "Apple/Roman.txt" or die "Cannot open file: $!";
# subsitute the correct path for the file you downloaded
while(<IN>) {
/^\s*(0x[0-9a-fA-F]+)\t(0x[0-9a-fA-F]+)\t/ or next;
$replace{chr hex $1} = chr hex $2;
}
# convert the data file:
while(<>) {
s/([\200-\377])/$replace{$1}/g;
}
Some final remarks:
* the Mac's characters with code below 128, is plain ASCII, i.e. the
same as ISO-Latin-1. No need to convert anything there, apart for the
line-ends: the Mac uses CR only, the PC takes CR+LF, Unix takes only LF.
Note that in Perl on PC, "\n" is LF only, but this gets converted into
CR+LF when printing to a text file (the normal file mode).
* I do not check for characters that are not in ISO-Latin-1 (codes 256
and above), or even in Unicde (e.g. the Apple symbol). You won't
encounter too many of those, I hope.
--
Bart.
------------------------------
Date: 27 Dec 2000 00:17:01 GMT
From: xnews.public.home@bogus.rtij.nl (Martijn Lievaart)
Subject: Re: Text Conversion in Perl
Message-Id: <Xns90171B51youdontwannaknow@194.109.6.74>
"Mario Thomas" <mario@alamar.net> wrote in <Xd%16.2760$I5.29687@stones>:
[ Requoted, please answer below the quotes. Thanks ]
[ Please reply to the article you are replying to, makes it easier for
others to see what exactly you are replying to. Thanks. ]
>
>"Mario Thomas" <mario@alamar.net> wrote in message
>news:J6Z16.2530$I5.28530@stones...
>> Hi All,
>>
>> I'm receiving a text file from another department in my company which
>> contains an article which has previously been published in a
>> newspaper.
>The
>> purpose of the file is to upload it to the net - thereby replicating
>> our newspaper content online. The problem i have is that the file
>> contains all sorts of strange characters. I can only
>> assume they have been put there by Quark on the Mac. Is there anyway i
>> can convert these characters to PC format using Perl? I have pasted in
>> a
>sample
>> below:
>> <START TEXT>
>>
>> The Phillips Report into the BSE crisis found there had been Òa clear
>policy
>> restricting the disclosure of information about BSEÓ that robbed those
>with
>> an interest of any power to react. MP Tony Benn reckons backbench MPs
>> need
>a
>> Freedom of Information Act as much as anybody, such is the ethos of
>secrecy
>> around the higher Zchelons of government. But, say disappointed
>campaigners,
>> this Bill is fairly toothless.
>>
>> <ENDTEXT>
>>
>I thought about using tr/// the problem with that is that i will have to
>create the mappings as they happen. I would prefer to find out what
>character sets i am translating between. Is there a way to do this or is
>there a list of character sets somewhere?
There are many character sets out there, and you should get aquainted with
them. However, the example you give doesn't seem to be in any particular
character set, there are just funny characters inbetween the text. Ask the
people who give you these data sets what the meaning of those characters is
(but first get it clear what characterset they are using!).
Point is, you are the user of their input, and they should be able to
describe what input they are delivering to you. Alternatively, you could
state what you want to receive, but that is a luxury we don't often have.
:-(
OK, short primer on character sets. There are a lot of them, but only ASCII
and EBCDIC survived as the basic A-Za-z0-9 and funny characters sets. Of
those, only ASCII is worth talking about as it is the basis for all other
current charactersets[1][2].
Now ASCII only defines codes 0-127. So others have filled in the gap from
128-255. One character set that was used very often is IBM-extended, which
was burned into the ROMS of all IBM compatibles. That is not used a lot
anymore. ANSI/ISO defined some character sets, which are used a lot.
Particularly ISO 8859-1 is the now current standard on the internet, but
others do exist. I'm talking about charactersets that have ASCII as their
lower 128 character codes, they redefine the upper 128 codes.
This doesn't work to well in practice (talk to any european that had to
convert between character sets) so they thought up something else. Unicode.
That is a 16 bit code[3] that encodes most commonly used characters. The
idea is good and it pretty much works as advertised. Some problems do
surface though. Most Unices don't have much support for it yet, nor have
any other OSses but WindowsNT. Most people don't understand all the issues
surrounding Unicode (I certainly don't), but then most people stay
blissfully ignorant of charactersets even if they hit them in the face!
Other problems with Unicode are that no font will show all charcters
possible. So there will be some trouble displaying random Unicode strings.
Look to your OS for solutions.
However, unicode is probably the way to go, and Perl support for unicode is
probably one of the most exiting things that happened in computing over the
past decade. Oh, by the way, don't get hung up by UTF-8 (or UTF-x in
general), it is just a funny way of encoding unicode. It's an encoding
only, down below it's just unicode. You'll probably encounter it more, just
be aware of it.
You asked for a list of charcter sets. Just search the web for the terms I
gave above, it should turn up plenty. Note that the ISO standards do cost
money, so if you really need them you probably have to buy them. But they
(or the gist) should probably be on the web somewhere.
OK, Back to your problem. All those character sets inherited from ASCII.
Although there are some funny Unicode end-of-line, end-of-whatever, etc
characters, basicaly nowadays you get either 8-bit or 16-bit input, with 16
bit being very rare. If it is 8 bit, it is either UTF-8 encoded Unicode, or
it is an extended ASCII, probably 8859-1. But the prople who give you the
input should be able to tell you what character set the input is in, and if
applicable, what the funny characters mean. But in your case, it seems that
there is just ASCII with some funny characters.
I may be a sucker for these cases, but I always /demand/ to know what input
I get. That normally means knowing the character set, and the end-of-line
character(s). Without that knowledge (possibly learned from sample files),
I cannot go to work. (There are many more issues, like what is the end of
string seperator, if any, what does an "empty field" mean, etc).
HTH,
M4
[1] This does not mean that EBCDIC (and even others!) are not used anymore,
just that their usage is rare and conversions to ASCII and derivatives
always exist.
[2] I once was project leader for a big project involving exchange of data
files. After inquiring what character set they used, I learned that the
input would be translated from IBM-extended to some form of EBCDIC
extended. Just to be translated back at the other end to IBM-extended.
Needless to say that 1) I shortcircuited things and 2) the other end said
they worked with IBM-extended while in reality they worked with 8859-1.
Duh!
[3] There also seems to be a 32 bit Unicode, but I don't know anything
about that.
------------------------------
Date: Wed, 27 Dec 2000 00:34:04 +0000
From: John Delacour <JD@.Tallorno.net>
Subject: Re: Text Conversion in Perl
Message-Id: <v03130300b66ee026d3d9@e>
At 9:03 am +0000 26/12/00, Mario Thomas wrote:
>I'm receiving a text file from another department in my company which
>contains an article ...Quark on the Mac.... a sample below:
><START TEXT>
>The Phillips Report into the BSE crisis found there had been Òa clear
>policy restricting the disclosure of information about BSEÓ that...
><ENDTEXT>
Some people have suggested using tr/// to convert these curly quotes etc.
from Mac to Latin-1. The problem is that there are no curly quotes in
Latin-1. The curly quotes you now have in Windows belong to the charset
windows-1252.
The normal default for browsers is Latin-1 and unless you write your <HEAD>
correctly with the proper META tag for the charset, many people will not be
able to see the curly quotes properly without manually adjusting the
encoding, which is beyond the wit of most people. Older browsers won't
interpret them properly anyway. If you want curly quotes, you need to use
‘ ’ “ ” for ÔÕÒÕ respectively. Otherwise you can
use Unicode, UTF-7 or UTF8 but older browsers will produce garbage from
that.
tr/// is no good for anything but transliteration from one table to another
of the same size.
##1
#!perl -w
$mac = 'ÒHeÕs here again.Ó, she said.';
(my $windows1252 = $mac) =~ tr/ÔÕÒÓ/‘’“”/;
print "\n$windows1252\n";
##2
$_ = $mac;
s /Ô/‘/g;
s /Õ/’/g;
s /Ò/“/g;
s /Ó/”/g;
print "\n$mac\n$windows1252\n$_\n"
Method 2 is what you need. I've asked before for a perl routine to map
from array to array, rather than charlist to charlist, but I've never got
an answer. There MUST be a quicker and easier way.
JD
------------------------------
Date: 16 Sep 99 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 16 Sep 99)
Message-Id: <null>
Administrivia:
The Perl-Users Digest is a retransmission of the USENET newsgroup
comp.lang.perl.misc. For subscription or unsubscription requests, send
the single line:
subscribe perl-users
or:
unsubscribe perl-users
to almanac@ruby.oce.orst.edu.
| NOTE: The mail to news gateway, and thus the ability to submit articles
| through this service to the newsgroup, has been removed. I do not have
| time to individually vet each article to make sure that someone isn't
| abusing the service, and I no longer have any desire to waste my time
| dealing with the campus admins when some fool complains to them about an
| article that has come through the gateway instead of complaining
| to the source.
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
To request back copies (available for a week or so), send your request
to almanac@ruby.oce.orst.edu with the command "send perl-users x.y",
where x is the volume number and y is the issue number.
For other requests pertaining to the digest, send mail to
perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
sending perl questions to the -request address, I don't have time to
answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V9 Issue 5203
**************************************