[31445] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 2697 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Fri Nov 27 21:09:43 2009

Date: Fri, 27 Nov 2009 18:09:08 -0800 (PST)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Fri, 27 Nov 2009     Volume: 11 Number: 2697

Today's topics:
    Re: Avoiding Perl warning "uninitialized value" <justin.0911@purestblue.com>
    Re: DLL unload question for embedded Perl on Windows <ben@morrow.me.uk>
    Re: Good Golly Miss Molly Perl. Been so long. (Randal L. Schwartz)
        Parsing HTML with HTML::TableExtract <nickli2000@gmail.com>
    Re: Parsing HTML with HTML::TableExtract sln@netherlands.com
    Re: Parsing HTML with HTML::TableExtract <martien.verbruggen@invalid.see.sig>
    Re: perl hash: low-level implementation details? <source@netcom.com>
    Re: Quick CGI question (specific to the CGI package) <hjp-usenet2@hjp.at>
    Re: Quick CGI question (specific to the CGI package) <hjp-usenet2@hjp.at>
    Re: Quick CGI question (specific to the CGI package) <uri@StemSystems.com>
        regexp for removing {} around latin1 characters <friendly@yorku.ca>
    Re: regexp for removing {} around latin1 characters <glennj@ncf.ca>
    Re: regexp for removing {} around latin1 characters <hjp-usenet2@hjp.at>
    Re: regexp for removing {} around latin1 characters <friendly@yorku.ca>
    Re: regexp for removing {} around latin1 characters sln@netherlands.com
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Fri, 27 Nov 2009 23:19:20 +0000
From: Justin C <justin.0911@purestblue.com>
Subject: Re: Avoiding Perl warning "uninitialized value"
Message-Id: <oc36u6-skl.ln1@purestblue.com>

In article <slrnhgtc8v.2h4.tadmc@tadbox.sbcglobal.net>, Tad McClellan wrote:
> Justin C <justin.0911@purestblue.com> wrote:
>> On 2009-11-26, neilsolent <n@solenttechnology.co.uk> wrote:
> 
>>>         if (/\d+\s+\d+\s+(\d+)\s+(\d+)\%\s+(?!\/cdrom)/)
> 
> 
>> I don't know why the ? is inside the () in the above regex.
> 
> 
> Then read the "Extended Patterns" section in:
> 
>     perldoc perlre

I did, and it's not pretty! I read:

The stability of these extensions varies widely.  Some have been part
of the core language for many years.  Others are experimental and may
change without warning or be completely removed.  Check the documenta‐
tion on an individual feature to verify its current status.


 .... and decided I'd leave this alone for a while. There's some very
complex pattern matching available in there. Maybe I'll get my head
around it sometime, maybe I'll find another way to do it!

Thanks for the pointer.

   Justin.

-- 
Justin C, by the sea.


------------------------------

Date: Fri, 27 Nov 2009 20:37:29 +0000
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: DLL unload question for embedded Perl on Windows
Message-Id: <9tp5u6-5a61.ln1@osiris.mauzo.dyndns.org>


Quoth cyl <u8526505@gmail.com>:
> I ran into some problems when executing the sample code from perldoc
> about embedding Perl in a C program. Here are my codes
> 
> ---- embeddedperl.c begin ----
> #include <EXTERN.h>               /* from the Perl distribution     */
> #include <perl.h>                 /* from the Perl distribution     */
> 
> EXTERN_C void boot_DynaLoader (pTHX_ CV* cv);
> 
> EXTERN_C void xs_init(pTHX)
> {
>         char *file = __FILE__;
>         /* DynaLoader is a special case */
>         newXS("DynaLoader::boot_DynaLoader", boot_DynaLoader, file);
> }
> 
> void runperl()
> {
> 	int ARGC = 2;
> 	char *ARGV[]={"","test.pl"};
> 	PerlInterpreter *my_perl;
> 
> 	PERL_SYS_INIT3(&ARGC,(char ***)&ARGV,NULL);

This is wrong. The PERL_SYS macros must be called once each only (per
process), and must be called with the actual argv and env that were
passed to main. You can pass a different argv/env to perl_parse if you
need to.

> 	my_perl = perl_alloc();
> 	perl_construct(my_perl);
> 	PL_exit_flags |= PERL_EXIT_DESTRUCT_END;
> 	perl_parse(my_perl, xs_init, ARGC, ARGV, (char **)NULL);
> 	perl_run(my_perl);
> 	perl_destruct(my_perl);
> 	perl_free(my_perl);
> 	PERL_SYS_TERM();
> 
> }
> 
> int main(int argc, char **argv, char **env)
> {
> 	int i=0;
> 	for (i=0;i<2;i++)
> 	     runperl();
> }
> ---- embeddedperl.c end ----
> 
> ---- test.pl begin ----
> use Cwd;
> 
> print cwd,"\n";
> ---- test.pl end ----
> 
> Here are the problems I got during execution
> 
> 1. The loaded DLLs do not unload after the Perl interpreter is
> shutdown
>     In my example, after perl_parse() Cwd.dll will be loaded. I
> expected this dll will be unloaded after calling
>     perl_run() but it was not. How do I force all DLLs to be unloaded
> after a script finishes?

I would not expect Cwd.dll to be unloaded until after perl_free is
called. It is not normal for a perl interpreter to ever unload a loaded
extension dll.

> 2. perl_destruct() always throws exception. I have to comment out it
> for my program to run. I suspect if it can run
>     without problem, maybe my 1st question can be solved.

I suspect this may have something to do with your misuse of SYS_INIT3
and SYS_TERM, but I don't know. If fixing that doesn't help, build perl
with -DDEBUGGING and see if you get more information.

> 3. After commenting out perl_destruct(), my program throws exception
> after calling runperl() the 2nd time. It
>     actually died on perl_parse(). Since I wrap everything in a sub-
> routine runperl() I thought everything starts from
>     scratch. However It seems not. I have no idea how come it is OK
> the 1st time but not the 2nd.

This is expected. Your perl is built with threads (since you're on
WIn32) and you are effectively trying to use two different interpreters
from the same thread. Since you are not using the PERL_SET_CONTEXT
macros, this isn't going to work. Once you get perl_destruct working, I
suspect this will go away.

Ben



------------------------------

Date: Fri, 27 Nov 2009 08:37:14 -0800
From: merlyn@stonehenge.com (Randal L. Schwartz)
Subject: Re: Good Golly Miss Molly Perl. Been so long.
Message-Id: <86ocmnudol.fsf@blue.stonehenge.com>

>>>>> "RedGrittyBrick" == RedGrittyBrick  <RedGrittyBrick@spamweary.invalid> writes:

RedGrittyBrick> I've never found a situation where I thought it would be
RedGrittyBrick> useful to use goto. When modifying other people's code that
RedGrittyBrick> contains gotos it has almost invariably made my job
RedGrittyBrick> harder. Much harder.

In the first Camel, we had precisely *one* program that included a goto, and
it was a contributed program at that (not one that either Larry or I had
written).

The amount of flack we took for that far exceeded anything else we had done
poorly on the book. :)

print "Just another Perl hacker,"; # the original

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
See http://methodsandmessages.vox.com/ for Smalltalk and Seaside discussion


------------------------------

Date: Fri, 27 Nov 2009 14:57:07 -0800 (PST)
From: Ninja Li <nickli2000@gmail.com>
Subject: Parsing HTML with HTML::TableExtract
Message-Id: <24573ad2-fa16-4061-8332-c4fabd5f76d5@p23g2000vbl.googlegroups.com>

Hi,

    I am trying to a comma-delimited file by parsing HTML from the
website "http://www.earnings.com/conferencecall.asp?client=3Dcb"
using HTML::TableExtract module (Thanks for Tad McClellan for the
introduction). However, I got the following error message when running
my script at the end of the post:
----------------------
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
Use of uninitialized value in join or string at conference.pl line 25.
HOGGF.PK
					=A0,HOGG ROBINSON GROUP PLC,Half- Year HOGG ROBINSON GROUP PLC
Earnings Conference Call,,,4:00 AM
 ...............
----------------------

   Also notice the large spaces between first value "HOGGF.PK" and
second "HOGG ROBINSON GROUP PLC". There are only a few spaces after
the first field in the original HTML. For what I could see so far, it
seems the empty values in the fields are not handled correctly. The
source code is at the end of the post.

   Please advise the root cause and the fix.

   Thanks in advance.

   Nick

----------------------------------------------
Source code:

use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;

my $html =3D get 'http://www.earnings.com/conferencecall.asp?
client=3Dcb';

my @headers =3D
(
  'SYMBOL',
  'COMPANY',
  'EVENT TITLE',
  'WEBCAST',
  'TRANSCRIPT',
  'TIME'
);

my $te =3D HTML::TableExtract->new( headers =3D> \@headers );
$te->parse($html);

foreach my $ts ( $te->tables )
{
  foreach my $row ( $ts->rows )
  {
    my $csv =3D join ',', @$row;
    print "$csv\n";
  }
}


------------------------------

Date: Fri, 27 Nov 2009 15:26:01 -0800
From: sln@netherlands.com
Subject: Re: Parsing HTML with HTML::TableExtract
Message-Id: <gln0h5p9g8q1pfnvd0jk54e5037dpf5fj6@4ax.com>

On Fri, 27 Nov 2009 14:57:07 -0800 (PST), Ninja Li <nickli2000@gmail.com> wrote:

>Hi,
>
>    I am trying to a comma-delimited file by parsing HTML from the
>website "http://www.earnings.com/conferencecall.asp?client=cb"
>using HTML::TableExtract module (Thanks for Tad McClellan for the
>introduction). However, I got the following error message when running
>my script at the end of the post:
>----------------------
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>Use of uninitialized value in join or string at conference.pl line 25.
>HOGGF.PK
>					 ,HOGG ROBINSON GROUP PLC,Half- Year HOGG ROBINSON GROUP PLC
>Earnings Conference Call,,,4:00 AM
>...............
>----------------------
>
>   Also notice the large spaces between first value "HOGGF.PK" and
>second "HOGG ROBINSON GROUP PLC". There are only a few spaces after
>the first field in the original HTML. For what I could see so far, it
>seems the empty values in the fields are not handled correctly. The
>source code is at the end of the post.
>
>   Please advise the root cause and the fix.
>
>   Thanks in advance.
>
>   Nick
>
What have you done to find out what caused this rediculous
number of warnings? Nothing from your code it seems.
Something is off, WAY off! Something wrong with your content or 
headers. Have to learn the module, actually you have to read the docs
for it. Then, plan ahead. Look at the source of the html.

This is not rocket science.

-sln


------------------------------

Date: Sat, 28 Nov 2009 11:43:47 +1100
From: Martien Verbruggen <martien.verbruggen@invalid.see.sig>
Subject: Re: Parsing HTML with HTML::TableExtract
Message-Id: <3orpeh.tph.ln@news.heliotrope.home>

On Fri, 27 Nov 2009 14:57:07 -0800 (PST),
	Ninja Li <nickli2000@gmail.com> wrote:
> Hi,
>
>     I am trying to a comma-delimited file by parsing HTML from the
> website "http://www.earnings.com/conferencecall.asp?client=cb"
> using HTML::TableExtract module (Thanks for Tad McClellan for the
> introduction). However, I got the following error message when running
> my script at the end of the post:
> ----------------------
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> Use of uninitialized value in join or string at conference.pl line 25.
> HOGGF.PK
> 					 ,HOGG ROBINSON GROUP PLC,Half- Year HOGG ROBINSON GROUP PLC
> Earnings Conference Call,,,4:00 AM
> ...............

Tha is not the only output. I get more.

>    Also notice the large spaces between first value "HOGGF.PK" and
> second "HOGG ROBINSON GROUP PLC". There are only a few spaces after
> the first field in the original HTML. For what I could see so far, it

Check the 'original' HTML again. What's currently at that URL has the
spaces that you see. I guess they muct have changed it since you last
looked at it.

> seems the empty values in the fields are not handled correctly. The
> source code is at the end of the post.

Define 'correctly'. Or rather, find out what HTML::TableExtract defines
as correctly, and adjust your expectations to that. Cells without text
content seem to be returned as undefined values. It's your job to deal
with that in whichever way you think it should be dealt with.

>    Please advise the root cause and the fix.

If you want, I can send you a contract and rate card.

Martien
-- 
                             | 
Martien Verbruggen           | 
first.last@heliotrope.com.au | Can't say that it is, 'cause it ain't.
                             | 


------------------------------

Date: Fri, 27 Nov 2009 10:50:17 -0800
From: David Harmon <source@netcom.com>
Subject: Re: perl hash: low-level implementation details?
Message-Id: <BPKdnUQB3NMHgo3WnZ2dnUVZ_sOdnZ2d@earthlink.com>

On Mon, 23 Nov 2009 21:23:14 -0800 in comp.lang.perl.misc, Xho
Jingleheimerschmidt <xhoster@gmail.com> wrote,
>Each value slot will have exactly one value in it--that is how Perl 
>hashes work.  However, in you code that value will be a reference to an 
>array, which array will on average have close to 1 element in it.
>
>And there goes your memory.  You have about 50 million tiny arrays, each 
>one using a lot of overhead.

I'm thinking a good way to store it would be the actual value in the
hash as long as there is only one for that key, or an array reference as
soon as it grows to more than one.  You would need a sub to insert and
to access, to keep complexity under control.  At which point maybe it
makes sense to make it a class module.  Am I going down a bad path here?
  
 


------------------------------

Date: Fri, 27 Nov 2009 18:03:36 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Quick CGI question (specific to the CGI package)
Message-Id: <slrnhh01jb.kvh.hjp-usenet2@hrunkner.hjp.at>

On 2009-11-26 19:13, Ted Byers <r.ted.byers@gmail.com> wrote:
[generating videos (or actually any content-type) from CGI]
> There are still aspects of the behaviour I saw previously that I don't
> understand.  For example, once I used video/mpeg as the content type
> (not using Mumia's example) the client believed the file name was
> 'my.cgi.script.cgi.mpg' and knew enough to try to open it using
> Windows Media Player, but with all the other content types, the same
> client believed the file name was'my.cgi.script.cgi'.  Why the
> difference?

As Ben already noted, the "file name" in an URI is supposed to be
completely immaterial to the browser. Whether the URL ends in
"video.cgi" or "video.mpg" or "video.html" should not make any
difference. The only thing that is important for the browser is the
content-type. When the browser recognizes the content-type, it knows how
to handle the file, e.g., to call ms media player. It also knows (on
Windows) which extension a file of this type is supposed to have, so it
can add a proper extension.

(Unfortunately, Firefox subscribes to the "the truth is much too
complicated for the average user, so we lie to them and confuse the heck
out of them" school of thought - so you can't believe anything it
displays in dialog boxes. But at least it does the right thing
internally, unlike IE, which both ignores the content type whenever it
feels like it and lies to the user)

	hp



------------------------------

Date: Fri, 27 Nov 2009 18:26:30 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: Quick CGI question (specific to the CGI package)
Message-Id: <slrnhh02u7.kvh.hjp-usenet2@hrunkner.hjp.at>

On 2009-11-25 22:11, Uri Guttman <uri@StemSystems.com> wrote:
>>>>>> "TB" == Ted Byers <r.ted.byers@gmail.com> writes:
>  TB>   my $fcontent;
>  TB>   read FIN, $fcontent, $flength;
>
> where is $flength set? i assume you would do a -s to get the file size
>
>  TB>   print $fcontent;
>
> if you want more speed, use sysread and syswrite.

sysread/syswrite probably aren't much faster than read/print. The latter
have a bit more buffer handling overhead but that is almost certainly
negligible when you read data from a disk and send it over the network.

However, if the files are large (and videos can be quite large), you can
save quite a lot of time by reading the file in smallish chunks (a few
kB to a few MB) and send each chunk immediately. If you read the whole
file into memory first and then send it to the client the times for
reading from disk and sending over the net add up. Otherwise they
overlap resulting in a shorter total time.

	hp



------------------------------

Date: Fri, 27 Nov 2009 13:49:45 -0500
From: "Uri Guttman" <uri@StemSystems.com>
Subject: Re: Quick CGI question (specific to the CGI package)
Message-Id: <87fx7zajli.fsf@quad.sysarch.com>

>>>>> "PJH" == Peter J Holzer <hjp-usenet2@hjp.at> writes:

  PJH> On 2009-11-25 22:11, Uri Guttman <uri@StemSystems.com> wrote:
  >>>>>>> "TB" == Ted Byers <r.ted.byers@gmail.com> writes:
  TB> my $fcontent;
  TB> read FIN, $fcontent, $flength;
  >> 
  >> where is $flength set? i assume you would do a -s to get the file size
  >> 
  TB> print $fcontent;
  >> 
  >> if you want more speed, use sysread and syswrite.

  PJH> sysread/syswrite probably aren't much faster than read/print. The latter
  PJH> have a bit more buffer handling overhead but that is almost certainly
  PJH> negligible when you read data from a disk and send it over the network.

they both avoid stdio (or perl's version) so they are faster. how much
depends on the amount of i/o and how many calls are made. this is why
file::slurp uses sysread/write. see its benchmarks to see the difference
from read/print.

  PJH> However, if the files are large (and videos can be quite large),
  PJH> you can save quite a lot of time by reading the file in smallish
  PJH> chunks (a few kB to a few MB) and send each chunk immediately. If
  PJH> you read the whole file into memory first and then send it to the
  PJH> client the times for reading from disk and sending over the net
  PJH> add up. Otherwise they overlap resulting in a shorter total time.

for some definition of large and small! :)

uri

-- 
Uri Guttman  ------  uri@stemsystems.com  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------


------------------------------

Date: Fri, 27 Nov 2009 12:05:38 -0500
From: Michael Friendly <friendly@yorku.ca>
Subject: regexp for removing {} around latin1 characters
Message-Id: <hep0t2$mg$1@sunburst.ccs.yorku.ca>

I have BibTeX files containing accented characters in forms like
{Johann Peter S{\"u}ssmilch}
{Johann Peter S\"ussmilch}
where, in BibTex, the {} are optional.

To export these to, e.g., EndNote, I have to translate these latex
encodings to latin1, which I can largely do with the unix recode tool.
However, recode cheerfully copies the {}s which mess up things when
I import them.

% echo '{Johann Peter S{\"u}ssmilch},' | recode latex..latin1
{Johann Peter S{ü}ssmilch},

So, I'm looking to complete the process by finding a regexp to remove
the braces around single accented latin1 characters.

recode latex..latin1 < my.bib | perl -pe "s|\{([WHATGOESHERE])\}|$1|g"


-- 
Michael Friendly     Email: friendly@yorku.ca
Professor, Psychology Dept.
York University      Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street    http://www.math.yorku.ca/SCS/friendly.html
Toronto, ONT  M3J 1P3 CANADA


------------------------------

Date: 27 Nov 2009 17:39:46 GMT
From: Glenn Jackman <glennj@ncf.ca>
Subject: Re: regexp for removing {} around latin1 characters
Message-Id: <slrnhh03n3.fc5.glennj@smeagol.ncf.ca>

At 2009-11-27 12:05PM, "Michael Friendly" wrote:
>  I have BibTeX files containing accented characters in forms like
>  {Johann Peter S{\"u}ssmilch}
>  {Johann Peter S\"ussmilch}
>  where, in BibTex, the {} are optional.
[...]  
>  So, I'm looking to complete the process by finding a regexp to remove
>  the braces around single accented latin1 characters.
>  
>  recode latex..latin1 < my.bib | perl -pe "s|\{([WHATGOESHERE])\}|$1|g"

Maybe:
    
    s#{(\\.(?:{.+?}|.+?))}#$1#g

-- 
Glenn Jackman
    Write a wise saying and your name will live forever. -- Anonymous


------------------------------

Date: Fri, 27 Nov 2009 19:43:33 +0100
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: regexp for removing {} around latin1 characters
Message-Id: <slrnhh07el.v64.hjp-usenet2@hrunkner.hjp.at>

On 2009-11-27 17:39, Glenn Jackman <glennj@ncf.ca> wrote:
> At 2009-11-27 12:05PM, "Michael Friendly" wrote:
>>  I have BibTeX files containing accented characters in forms like
>>  {Johann Peter S{\"u}ssmilch}
>>  {Johann Peter S\"ussmilch}
>>  where, in BibTex, the {} are optional.
> [...]  
>>  So, I'm looking to complete the process by finding a regexp to remove
>>  the braces around single accented latin1 characters.
>>  
>>  recode latex..latin1 < my.bib | perl -pe "s|\{([WHATGOESHERE])\}|$1|g"
>
> Maybe:
>     
>     s#{(\\.(?:{.+?}|.+?))}#$1#g

more likely:

perl -pe "s|\{([\xA0-\xFF])\}|$1|g"


I think you are trying to replace the recode, too, but for that you need
a lookup table with all the accented characters.

	hp



------------------------------

Date: Fri, 27 Nov 2009 15:59:59 -0500
From: Michael Friendly <friendly@yorku.ca>
To: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: regexp for removing {} around latin1 characters
Message-Id: <4B103DCF.3010702@yorku.ca>

Peter J. Holzer wrote:
> On 2009-11-27 17:39, Glenn Jackman <glennj@ncf.ca> wrote:
>> At 2009-11-27 12:05PM, "Michael Friendly" wrote:
>>>  I have BibTeX files containing accented characters in forms like
>>>  {Johann Peter S{\"u}ssmilch}
>>>  {Johann Peter S\"ussmilch}
>>>  where, in BibTex, the {} are optional.
>> [...]  
>>>  So, I'm looking to complete the process by finding a regexp to remove
>>>  the braces around single accented latin1 characters.
>>>  
>>>  recode latex..latin1 < my.bib | perl -pe "s|\{([WHATGOESHERE])\}|$1|g"
>> Maybe:
>>     
>>     s#{(\\.(?:{.+?}|.+?))}#$1#g
> 
> more likely:
> 
> perl -pe "s|\{([\xA0-\xFF])\}|$1|g"
> 
> 
> I think you are trying to replace the recode, too, but for that you need
> a lookup table with all the accented characters.
> 
> 	hp
> 

No, all I want to do is to strip the {} around the accented characters;
recode does the conversion well.  With the small test bib file below,
here's what I get using only recode, vs. recode + perl


  % recode latex..latin1 < timeref.bib | grep ssmilch
@BOOK{Sussmilch:1741,
   author = {Johann Peter S{ü}ssmilch},

  % recode latex..latin1 < timeref.bib | perl -pe 
"s|\{([\xA0-\xFF])\}|$1|g" | grep ssmilch
@BOOK{Sussmilch:1741,
   author = {Johann Peter Sssmilch},

Note that the ü just disappears.

---begin timeref.bib ---
@ARTICLE{Buache:1752,
   author = {Buache, Phillippe},
   title = {Essai De G{\'e}ographie Physique},
   journal = {M{\'e}moires de L'Acad{\'e}mie Royale des Sciences},
   year = {1752},
   pages = {399--416},
   note = {\Loc{BNF: Ge.FF-8816-8822}},
   annote = {Contour map},
   oldnum = {2}
}

@BOOK{Crome:1785,
   title = {{\"U}ber die Gr{\"o}sse and Bev{\"o}lkerung der 
S{\"a}mtlichen Europ{\"a}schen
	Staaten},
   publisher = {Weygand},
   year = {1785},
   author = {Crome, August F. W.},
   address = {Leipzig},
   annote = {Superimposed squares to compare areas (of European states)},
   oldnum = {5}
}

@BOOK{Sussmilch:1741,
   title = {Die g{\"o}ttliche Ordnung in den Ver\"anderungen des 
menschlichen Geschlechts,
	aus der Geburt, Tod, und Fortpflantzung},
   publisher = {n.p.},
   year = {1741},
   author = {Johann Peter S{\"u}ssmilch},
   address = {Germany},
   note = {(published in French translation as \emph{L'ordre divin. dans les
	changements de l'esp\`ece humaine, d{\'e}montr{\'e} par la naissance,
	la mort et la propagation de celle-ci}, trans: Jean-Marc Rohrbasser,
	Paris: INED, 1998, ISBN 2-7332-1019-X)},
   url = {http://www.ined.fr/publicat/collections/classiques/Ordivin.htm}
}
--- end timeref.bib -----



-- 
Michael Friendly     Email: friendly@yorku.ca
Professor, Psychology Dept.
York University      Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street    http://www.math.yorku.ca/SCS/friendly.html
Toronto, ONT  M3J 1P3 CANADA


------------------------------

Date: Fri, 27 Nov 2009 15:12:57 -0800
From: sln@netherlands.com
Subject: Re: regexp for removing {} around latin1 characters
Message-Id: <ocl0h59qht8nlnfc7boceo0p377srt2jhu@4ax.com>

On Fri, 27 Nov 2009 15:59:59 -0500, Michael Friendly <friendly@yorku.ca> wrote:

>Peter J. Holzer wrote:
>> On 2009-11-27 17:39, Glenn Jackman <glennj@ncf.ca> wrote:
>>> At 2009-11-27 12:05PM, "Michael Friendly" wrote:
>>>>  I have BibTeX files containing accented characters in forms like
>>>>  {Johann Peter S{\"u}ssmilch}
>>>>  {Johann Peter S\"ussmilch}
>>>>  where, in BibTex, the {} are optional.
>>> [...]  
>>>>  So, I'm looking to complete the process by finding a regexp to remove
>>>>  the braces around single accented latin1 characters.
>>>>  
>>>>  recode latex..latin1 < my.bib | perl -pe "s|\{([WHATGOESHERE])\}|$1|g"
>>> Maybe:
>>>     
>>>     s#{(\\.(?:{.+?}|.+?))}#$1#g
>> 
>> more likely:
>> 
>> perl -pe "s|\{([\xA0-\xFF])\}|$1|g"
>> 
>> 
>> I think you are trying to replace the recode, too, but for that you need
>> a lookup table with all the accented characters.
>> 
>> 	hp
>> 
>
>No, all I want to do is to strip the {} around the accented characters;
>recode does the conversion well.  With the small test bib file below,
>here's what I get using only recode, vs. recode + perl
>
>
>  % recode latex..latin1 < timeref.bib | grep ssmilch
>@BOOK{Sussmilch:1741,
>   author = {Johann Peter S{ü}ssmilch},
>
>  % recode latex..latin1 < timeref.bib | perl -pe 
>"s|\{([\xA0-\xFF])\}|$1|g" | grep ssmilch
>@BOOK{Sussmilch:1741,
>   author = {Johann Peter Sssmilch},
>
>Note that the ü just disappears.
>
I didn't have problem with the substitution when run as
a stand-alone Perl program. The one liner may need its STDOUT
adjusted with binmode().

----------------------
perl gg.pl > itt.txt

itt.txt (from word):
252
unix crlf encoding(iso-8859-1) utf8
{Johann Peter Süssmilch}
::
However, I don't need to set the STDOUT
encoding. The default does the same thing,
probably because internally it remained as
byte strings during the regex since 0-255
latin has common utf8 code points.

-sln
--------------------

use strict;
use warnings;

print ord('ü'),"\n";

my $str = "{Johann Peter S{\xFC}ssmilch}";

$str =~ s/\{([\xC0-\xFF])\}/$1/g;

# try one of these:
binmode (STDOUT, ":encoding(latin-1)");
#binmode (STDOUT);
#binmode (STDOUT, ":raw");

print "@{[PerlIO::get_layers(STDOUT)]}\n";

print "$str\n";



------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 2697
***************************************


home help back first fref pref prev next nref lref last post