[32504] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 3769 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Thu Aug 30 21:09:24 2012

Date: Thu, 30 Aug 2012 18:09:12 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Thu, 30 Aug 2012     Volume: 11 Number: 3769

Today's topics:
    Re: check for exact # of digits <rweikusat@mssgmbh.com>
    Re: check for exact # of digits <hjp-usenet2@hjp.at>
        spaceless text engine <gigagoth@gmail.com>
        String parsing (2 questions) <rcranz143101@gmail.com>
    Re: String parsing (2 questions) <rweikusat@mssgmbh.com>
    Re: String parsing (2 questions) <*@eli.users.panix.com>
    Re: String parsing (2 questions) <ben@morrow.me.uk>
    Re: String parsing (2 questions) <jurgenex@hotmail.com>
    Re: String parsing (2 questions) <jurgenex@hotmail.com>
    Re: String parsing (2 questions) <rweikusat@mssgmbh.com>
    Re: String parsing (2 questions) (Seymour J.)
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Tue, 28 Aug 2012 13:01:30 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: check for exact # of digits
Message-Id: <87fw77kyxx.fsf@sapphire.mobileactivedefense.com>

"John W. Krahn" <jwkrahn@example.com> writes:
> Ben Morrow wrote:
>> Quoth jwkrahn@shaw.ca:

[...]

>>>      if ( 8 == $checkDate =~ tr/0-9// ) {
>>>           return "Date ($checkDate) must be YYYYMMDD\n";
>>>      }

[...]

>> For one thing I believe you have the condition the
>> wrong way around;
>
> No, it is correct.

As Robert Pike once quipped: If the program compiles, the machine is
happy. For that matter,

	if (abs(sqrt(64) - $checkdate =~ tr/0-9//) <= 0.05)

would just be as correct. And it even also avoids having an lvalue on
the left side of ==.



------------------------------

Date: Thu, 30 Aug 2012 09:57:25 +0200
From: "Peter J. Holzer" <hjp-usenet2@hjp.at>
Subject: Re: check for exact # of digits
Message-Id: <slrnk3u775.6f3.hjp-usenet2@hrunkner.hjp.at>

On 2012-08-28 02:25, John W. Krahn <jwkrahn@example.com> wrote:
> Ben Morrow wrote:
>> Quoth jwkrahn@shaw.ca:
>>> bjlockie wrote:
>>>> I have this at the beginning of a sub ($checkDate is an input parameter).
>>>> I want to check for exactly 8 digits.
>>>> This works for less than 8 but doesn't work for more than 8.
>>>>
>>>>      if ($checkDate !~ /^\d{1,8}/) {
>>>>           return "Date ($checkDate) must be YYYYMMDD\n";
>>>>      }

I don't see how this "works for less than 8 but doesn't work for more
than 8". It accepts "1234567" just as it accepts "123456789".


>>>      if ( 8 == $checkDate =~ tr/0-9// ) {
>>>           return "Date ($checkDate) must be YYYYMMDD\n";
>>>      }
>>
>> That's not the same.
>
> Duh!
>
>> For one thing I believe you have the condition the
>> wrong way around;
>
> No, it is correct.

No. Your code complains if $checkDate contains 8 digits and accepts all
other formats:

| jwkrahn: Date (12345678) must be YYYYMMDD

which is obviously the opposite of what the OP was trying to accomplish.


>> for another, the OP's pattern is anchored (at the
>> beginning, and should be at the end), which cannot be emulated with
>> tr///.
>
> The OP should have been more explicit in their specification.
>
> "I want to check for exactly 8 digits"

That's not the whole specification. You forgot:

"Date ($checkDate) must be YYYYMMDD\n";


> Which is accomplished by my solution.

The sentence you quoted didn't specify whether the string should consist
of 8 digits or contain 8 digits. If that string was the only information
you had, both interpretations might be valid. But you had more
information and deliberately chose the one which was contradicted by the
additional information. (I'm not sure whether you also deliberately
inverted the consequence of the check).

	hp


-- 
   _  | Peter J. Holzer    | Deprecating human carelessness and
|_|_) | Sysadmin WSR       | ignorance has no successful track record.
| |   | hjp@hjp.at         | 
__/   | http://www.hjp.at/ |  -- Bill Code on asrg@irtf.org


------------------------------

Date: Tue, 28 Aug 2012 22:20:34 -0700 (PDT)
From: Renee P <gigagoth@gmail.com>
Subject: spaceless text engine
Message-Id: <fd1fa397-2481-4a66-8536-6046c17deae0@googlegroups.com>

Announcing a new spaceless open source spaceless text engine project on github.

https://github.com/singularian/respace

It is an alpha spaceless text parser that will attempt to address a data compression necessity.

The engine consists of an interpreter and command line parser and file parser. It uses the concept of a coordinate string to create the spaced text under linux written in C++.

http://www.indiegogo.com/projects/190454?c=home&a=947604
http://interblock.org/testprocessor.html


------------------------------

Date: Tue, 28 Aug 2012 14:40:51 -0700
From: "Robert Crandal" <rcranz143101@gmail.com>
Subject: String parsing (2 questions)
Message-Id: <a7Sdnepkmut7p6DNnZ2dnUVZ5tCdnZ2d@giganews.com>

My text file contains a list of strings that look similar to this:

ID1   ID2      ID3            Full name          ID4
-----  ---  ------------ ----------------    -----
6523 222 000000564 Adams Cody J     9999
1113 532 000000642 Barnes II Bob R   9999
786   123 000000441 Carter Sr James   9999
3333 245 000000994 Jones Jr Roy J     9999
5644 370 000000234 Martin Tom R      9999
2331 333 000000111 Van Horn Tim R  9999
etc. etc....

Question 1:
I will be running a while(<>) loop that reads each line
one at a time.  How can I extract just the ID1, ID2,
and the "full name" into string variables during each
loop iteration???

Question 2: (difficult)
Suppose we have extracted the "full name" string into
a variable named $full_name.  How can I extract
the "first name", "middle initial", and "last name plus suffix"?
This seems complicated, because some names are missing
middle initials.  Also, some last names contain an optional
suffix (such as Jr, Sr, or II, or III).  Also, some last names
have two parts, such as "Van Horn".

Thanks for your help.






------------------------------

Date: Tue, 28 Aug 2012 23:00:41 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: String parsing (2 questions)
Message-Id: <87sjb6r81i.fsf@sapphire.mobileactivedefense.com>

"Robert Crandal" <rcranz143101@gmail.com> writes:
> My text file contains a list of strings that look similar to this:
>
> ID1   ID2      ID3            Full name          ID4
> -----  ---  ------------ ----------------    -----
> 6523 222 000000564 Adams Cody J     9999
> 1113 532 000000642 Barnes II Bob R   9999
> 786   123 000000441 Carter Sr James   9999
> 3333 245 000000994 Jones Jr Roy J     9999
> 5644 370 000000234 Martin Tom R      9999
> 2331 333 000000111 Van Horn Tim R  9999
> etc. etc....
>
> Question 1:
> I will be running a while(<>) loop that reads each line
> one at a time.  How can I extract just the ID1, ID2,
> and the "full name" into string variables during each
> loop iteration???

Without knowing more about the line format, this can't really be
answered. Untested suggestion:

if (/^([0-9]+)\s+([0-9]+)\s+[0-9]+\+s([A-Za-z].*)\s+[0-9]/) {
	$id1 = $1;
        $id2 = $2;
        $full_name = $3;
} else {
	# complain about wrong format?
        next;
}

> Question 2: (difficult)
> Suppose we have extracted the "full name" string into
> a variable named $full_name.  How can I extract
> the "first name", "middle initial", and "last name plus suffix"?

Split the string on whitespace. Complain if there aren't at least two
elements in the resulting array. Set 'first name' to element #0. If
there are exactly two elements, set last name to element #1. If there
are more than two and the seconds is a capital letter followed by a
dot, assume its a middle initial and use everything behind it as last
name. Otherwise, use everything behind the first name.

Example code for that (parse the first commandline argument):

---------------
@name = split(/\s+/, $ARGV[0]);

die("must have at least two parts") unless @name > 1;

$first = shift(@name);

$middle = shift(@name)
    if  @name > 1 && $name[0] =~/^[A-Z]\.$/;

print('first ', $first, "\n", 'middle ', $middle, "\n",
      'last ', join(' ', @name), "\n");


------------------------------

Date: Tue, 28 Aug 2012 22:09:16 +0000 (UTC)
From: Eli the Bearded <*@eli.users.panix.com>
Subject: Re: String parsing (2 questions)
Message-Id: <eli$1208281755@qz.little-neck.ny.us>

In comp.lang.perl.misc, Robert Crandal <rcranz143101@gmail.com> wrote:
> My text file contains a list of strings that look similar to this:
> 
> ID1   ID2      ID3            Full name          ID4
> -----  ---  ------------ ----------------    -----
> 6523 222 000000564 Adams Cody J     9999
> 1113 532 000000642 Barnes II Bob R   9999
> 786   123 000000441 Carter Sr James   9999
> 3333 245 000000994 Jones Jr Roy J     9999
> 5644 370 000000234 Martin Tom R      9999
> 2331 333 000000111 Van Horn Tim R  9999
> etc. etc....
> 
> Question 1:
> I will be running a while(<>) loop that reads each line
> one at a time.  How can I extract just the ID1, ID2,
> and the "full name" into string variables during each
> loop iteration???

If it is like you show it, using spaces between columns and between
words in the name, you weep. Remember that some full names contain
numbers (consider former NY Times reporter Jennifer 8 Lee).

I'd start something like this:

/^(\d+)\s+(\d+)\s+\d+\s+(.*[^\s])\s+\d+$/

> Question 2: (difficult)
> Suppose we have extracted the "full name" string into
> a variable named $full_name.  How can I extract
> the "first name", "middle initial", and "last name plus suffix"?
> This seems complicated, because some names are missing

Seems complicated is an understatement.

> middle initials.  Also, some last names contain an optional
> suffix (such as Jr, Sr, or II, or III).  Also, some last names
> have two parts, such as "Van Horn".

Names are an extremely complicated problem. For ANY assumption
you can come up with about a name, someone in the world breaks it.

With luck, you don't have them in your database. Without luck,
expect Jennifer 8 Lee, people with a single name (eg Cher),
someone with the family Xi, but entered as XI to trick your parser
into thinking "the eleventh", someone with a given name of Junior,
someone with a two part first name.

Elijah
------
there are, even, people without names (obvious case: abandoned infants)


------------------------------

Date: Tue, 28 Aug 2012 23:51:42 +0100
From: Ben Morrow <ben@morrow.me.uk>
Subject: Re: String parsing (2 questions)
Message-Id: <ukuvg9-3542.ln1@anubis.morrow.me.uk>


Quoth "Robert Crandal" <rcranz143101@gmail.com>:
> My text file contains a list of strings that look similar to this:
> 
> ID1   ID2      ID3            Full name          ID4
> -----  ---  ------------ ----------------    -----
> 6523 222 000000564 Adams Cody J     9999
> 1113 532 000000642 Barnes II Bob R   9999
> 786   123 000000441 Carter Sr James   9999
> 3333 245 000000994 Jones Jr Roy J     9999
> 5644 370 000000234 Martin Tom R      9999
> 2331 333 000000111 Van Horn Tim R  9999
> etc. etc....

What have you tried? You are likely to get better help here if you make
an attempt to solve your problem yourself, and post that attempt with an
explanation of what wasn't working.

> Question 1:
> I will be running a while(<>) loop that reads each line
> one at a time.  How can I extract just the ID1, ID2,
> and the "full name" into string variables during each
> loop iteration???

How are the fields delimited from each other? You need to be able to
answer this question before you can write a program to separate them.
Assuming the ID fields are always going to be numbers, and the name
field is always going to be letters and spaces, you might use something
like this:

    my ($ID1, $ID2, $full_name) =
        /^ (\d+) \s+ (\d+) \s+ \d+ \s+ ([A-Za-z ]+[A-Za-z]) /x;

See 'perldoc perlretut'. The final [A-Za-z] is necessary to prevent the
'name' field from including the spaces after it.

> Question 2: (difficult)
> Suppose we have extracted the "full name" string into
> a variable named $full_name.  How can I extract
> the "first name", "middle initial", and "last name plus suffix"?
> This seems complicated, because some names are missing
> middle initials.  Also, some last names contain an optional
> suffix (such as Jr, Sr, or II, or III).  Also, some last names
> have two parts, such as "Van Horn".

Parsing names is extremely difficult, because they are extremely
variable. If at all possible you want to design your systems so that you
don't need to, and instead ask your users questions like 'what is your
full name' and 'how would you like us to address you'.

In this case, how would you distinguish between these two?

    Van Horn Tim
    Watson Mary Jane

Can you make a complete list of 'von's which might start a two-part
surname, or do you have to handle cases like married women who have
taken both surnames without a hyphen (and, potentially, their children)?
Alternatively, can you make a complete list of valid forenames? Do you
have a complete list of possible suffixes? Are you certain you won't
need to handle any names from cultures which put the family name first,
or use a patronymic system, or are otherwise 'unusual' from an English
perspective?

Is this a job you have to do once, where you can check the results and,
if necessary, clean them up afterwards, or will it be ongoing? Is is
possible/acceptable for the program to stop and ask what to do when a
name is ambiguous, or for it to put them in a file of 'names I couldn't
handle' to be dealt with manually? Is there any chance of changing the
input format to include the parts of the name separately in the first
place?

If you can't answer 'yes' to enough of these questions, what you want
may not be possible. This is nothing to do with Perl, or indeed with
programming: the only reason a human can distinguish the two names above
is because we assume that 'Mary' is a valid forename and 'Horn' isn't.
If that sort of guesswork is acceptable, you can give the program a list
of guesses to make; if it isn't, then the data format you are using
isn't good enough.

Ben



------------------------------

Date: Tue, 28 Aug 2012 17:06:46 -0700
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: Re: String parsing (2 questions)
Message-Id: <15nq38lpmopmbnat83rm79s3mdiop7quv8@4ax.com>

"Robert Crandal" <rcranz143101@gmail.com> wrote:
>My text file contains a list of strings that look similar to this:
>
>ID1   ID2      ID3            Full name          ID4
>-----  ---  ------------ ----------------    -----
>6523 222 000000564 Adams Cody J     9999
>1113 532 000000642 Barnes II Bob R   9999
>786   123 000000441 Carter Sr James   9999
>3333 245 000000994 Jones Jr Roy J     9999
>5644 370 000000234 Martin Tom R      9999
>2331 333 000000111 Van Horn Tim R  9999
>etc. etc....
>
>Question 1:
>I will be running a while(<>) loop that reads each line
>one at a time.  How can I extract just the ID1, ID2,
>and the "full name" into string variables during each
>loop iteration???

perldoc -f split

>Question 2: (difficult)
>Suppose we have extracted the "full name" string into
>a variable named $full_name.  How can I extract
>the "first name", "middle initial", and "last name plus suffix"?
>This seems complicated, because some names are missing
>middle initials.  Also, some last names contain an optional
>suffix (such as Jr, Sr, or II, or III).  Also, some last names
>have two parts, such as "Van Horn".

By employing a data typist who can leverage human intelligence and
therefore may get it right for maybe 90% of cases.

jue


------------------------------

Date: Tue, 28 Aug 2012 17:10:36 -0700
From: Jürgen Exner <jurgenex@hotmail.com>
Subject: Re: String parsing (2 questions)
Message-Id: <2anq38d2964d2bsm280qk8ilrn8gnkro9p@4ax.com>

Eli the Bearded <*@eli.users.panix.com> wrote:
>In comp.lang.perl.misc, Robert Crandal <rcranz143101@gmail.com> wrote:
>> My text file contains a list of strings that look similar to this:
>> 
>> ID1   ID2      ID3            Full name          ID4
>> -----  ---  ------------ ----------------    -----
>> 6523 222 000000564 Adams Cody J     9999
>> 1113 532 000000642 Barnes II Bob R   9999
>> 786   123 000000441 Carter Sr James   9999
>> 3333 245 000000994 Jones Jr Roy J     9999
>> 5644 370 000000234 Martin Tom R      9999
>> 2331 333 000000111 Van Horn Tim R  9999
>> etc. etc....
>> 
>> Question 1:
>> I will be running a while(<>) loop that reads each line
>> one at a time.  How can I extract just the ID1, ID2,
>> and the "full name" into string variables during each
>> loop iteration???
>
>If it is like you show it, using spaces between columns and between
>words in the name, you weep. Remember that some full names contain
>numbers (consider former NY Times reporter Jennifer 8 Lee).

Actually, it is very simple:
- split() on space character
- grab the first 3 items, from these throw away the third
- and from the remaining items throw away the last item and re-combine
the others for the name

>I'd start something like this:
>
>/^(\d+)\s+(\d+)\s+\d+\s+(.*[^\s])\s+\d+$/

You don't have to use a hammer to drive a screw. Perl's toolbox contains
many more tools than just REs.

jue


------------------------------

Date: Wed, 29 Aug 2012 11:44:58 +0100
From: Rainer Weikusat <rweikusat@mssgmbh.com>
Subject: Re: String parsing (2 questions)
Message-Id: <87ehmqq8np.fsf@sapphire.mobileactivedefense.com>

Ben Morrow <ben@morrow.me.uk> writes:

[...]

> Parsing names is extremely difficult, because they are extremely
> variable. If at all possible you want to design your systems so that you
> don't need to, and instead ask your users questions like 'what is your
> full name' and 'how would you like us to address you'.
>
> In this case, how would you distinguish between these two?
>
>     Van Horn Tim
>     Watson Mary Jane
>
> Can you make a complete list of 'von's which might start a two-part
> surname, or do you have to handle cases like married women who have
> taken both surnames without a hyphen (and, potentially, their
> children)?

The solution to this is to define a grammar for 'supported name
formats' which catches the expected cases and live with the fact that
any heuristic fails in some situations. This means that 'Watson Mary
Jane' may have to decide if he is 'Watson M Jane' or if she is 'Mary J
Watson' or any other permutation of the given set of letter and
spaces.






------------------------------

Date: Wed, 29 Aug 2012 16:02:18 -0400
From: Shmuel (Seymour J.) Metz <spamtrap@library.lspace.org.invalid>
Subject: Re: String parsing (2 questions)
Message-Id: <503e754a$18$fuzhry+tra$mr2ice@news.patriot.net>

In <a7Sdnepkmut7p6DNnZ2dnUVZ5tCdnZ2d@giganews.com>, on 08/28/2012
   at 02:40 PM, "Robert Crandal" <rcranz143101@gmail.com> said:

>Question 1:

You will need to do some requirements analysis before anybody will be
able to answer your questions accurately.

>How can I extract just the ID1, ID2,
>and the "full name" into string variables during each
>loop iteration???

Is ID4 always present? Is it always 9999? Can a name terminate in a
string of digits?

>Suppose we have extracted the "full name" string into
>a variable named $full_name.  How can I extract
>the "first name", "middle initial", and "last name plus suffix"?
>This seems complicated,

It is, hence the need for requirements analysis. First determine all
of the forms that you need to support, check for ambiguity and only
then worry about how to parse it.

-- 
Shmuel (Seymour J.) Metz, SysProg and JOAT  <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action.  I reserve the
right to publicly post or ridicule any abusive E-mail.  Reply to
domain Patriot dot net user shmuel+news to contact me.  Do not
reply to spamtrap@library.lspace.org



------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 3769
***************************************


home help back first fref pref prev next nref lref last post