[33112] in Perl-Users-Digest

home help back first fref pref prev next nref lref last post

Perl-Users Digest, Issue: 4388 Volume: 11

daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Fri Mar 13 05:22:18 2015

Date: Fri, 13 Mar 2015 02:22:09 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)

Perl-Users Digest           Fri, 13 Mar 2015     Volume: 11 Number: 4388

Today's topics:
        This RE isn't working as expected. <see.my.sig@for.my.address>
    Re: This RE isn't working as expected. <see.my.sig@for.my.address>
    Re: This RE isn't working as expected. <news@todbe.com>
    Re: This RE isn't working as expected. <bauhaus@futureapps.invalid>
    Re: This RE isn't working as expected. <gamo@telecable.es>
    Re: This RE isn't working as expected. <rweikusat@mobileactivedefense.com>
    Re: This RE isn't working as expected. (Seymour J.)
    Re: This RE isn't working as expected. <rweikusat@mobileactivedefense.com>
    Re: This RE isn't working as expected. <*@eli.users.panix.com>
    Re: This RE isn't working as expected. <rweikusat@mobileactivedefense.com>
    Re: This RE isn't working as expected. <see.my.sig@for.my.address>
        Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)

----------------------------------------------------------------------

Date: Wed, 11 Mar 2015 21:56:19 -0700
From: Robbie Hatley <see.my.sig@for.my.address>
Subject: This RE isn't working as expected.
Message-Id: <_4-dnd9rzJzsgZzInZ2dnUVZ572dnZ2d@giganews.com>


I'm fishing through headers in a bunch of emails, trying to weed out
extraneous headers and only keep Date, Subject, To, and From.
To do that I cooked up the following RE, but it only partly works:

LINE: while (<EHANDLE>) {
    ...
    next LINE if not m/^(?:Da)|(?:Su)|(?:To)|(?:Fr)/;
    ...
}


It does print the desired headers and rejects most others, yes...
but it also allows through some extraneous headers as well.
For example:


	h=From:From:Subject:Date:To:MIME-Version:Content-Type;
	0X8F8tPfdSuLmlAIPmVjsQ==;ro
Date: Wed, 15 Oct 2014 17:25:16 -0700
Subject: Notification of something important
To: Name Changed <blahblah@blahblah.com>
From: "Some Company" <sales@acme.com>
Reply-To: support@gargoyle.com
X-XPT-XSL-Name: email_pimp/default/en_US/ECheckPayment.xsl
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=windows-1252


Which is much better than the original 44 lines of gibberish, but
how did lines get through which start with tab characters, "Re",
"X-", and "Co"? Those should have been filtered out! Is there
something fundamentally wrong with my regular expression?



-- 
Puzzled,
Robbie Hatley
Midway City, CA, USA
perl -le 'print "\154o\156e\167o\154f\100w\145ll\56c\157m"'
http://www.well.com/user/lonewolf/
https://www.facebook.com/robbie.hatley


------------------------------

Date: Wed, 11 Mar 2015 23:46:23 -0700
From: Robbie Hatley <see.my.sig@for.my.address>
Subject: Re: This RE isn't working as expected.
Message-Id: <ytudnasIrJKgq5zInZ2dnUVZ572dnZ2d@giganews.com>


On 3/11/2015 9:56 PM, I had written:

> I'm fishing through headers in a bunch of emails, trying to weed out
> extraneous headers and only keep Date, Subject, To, and From.
> To do that I cooked up the following RE, but it only partly works:
>
> LINE: while (<EHANDLE>) {
>     ...
>     next LINE if not m/^(?:Da)|(?:Su)|(?:To)|(?:Fr)/;
>     ...
> }
>
>
> It does print the desired headers and rejects most others, yes...
> but it also allows through some extraneous headers as well.
> For example:
>
>
>      h=From:From:Subject:Date:To:MIME-Version:Content-Type;
>      0X8F8tPfdSuLmlAIPmVjsQ==;ro
> Date: Wed, 15 Oct 2014 17:25:16 -0700
> Subject: Notification of something important
> To: Name Changed <blahblah@blahblah.com>
> From: "Some Company" <sales@acme.com>
> Reply-To: support@gargoyle.com
> X-XPT-XSL-Name: email_pimp/default/en_US/ECheckPayment.xsl
> Content-Transfer-Encoding: quoted-printable
> Content-Type: text/plain; charset=windows-1252


I finally got that to work. Some of the bugs had to do with
logic which bypassed the RE for some input lines. But I think
some of it had to do with how my RE was structured.

QUESTION: in the original version of my RE, does the ^ metacharacter
bind more tightly to the first group than the | alternation operator?

m/^(?:Da)|(?:Su)|(?:To)|(?:Fr)/

I note that the first bogus line to get through did have the words
"From", "Subject", and "To" in it, just not at the beginning, as if
the ^ was being bypassed. When I changed it to the following, that
first bogus line went away:

m/^(?:Da|Su|To|Fr)/



-- 
Cheers,
Robbie Hatley
Midway City, CA, USA
perl -le 'print "\154o\156e\167o\154f\100w\145ll\56c\157m"'
http://www.well.com/user/lonewolf/
https://www.facebook.com/robbie.hatley


------------------------------

Date: Wed, 11 Mar 2015 23:51:26 -0700
From: "$Bill" <news@todbe.com>
Subject: Re: This RE isn't working as expected.
Message-Id: <5501376E.4010803@todbe.com>

On 3/11/2015 23:46, Robbie Hatley wrote:
>
> On 3/11/2015 9:56 PM, I had written:
>
>> I'm fishing through headers in a bunch of emails, trying to weed out
>> extraneous headers and only keep Date, Subject, To, and From.
>> To do that I cooked up the following RE, but it only partly works:
>>
>> LINE: while (<EHANDLE>) {
>>     ...
>>     next LINE if not m/^(?:Da)|(?:Su)|(?:To)|(?:Fr)/;
>>     ...
>> }
 ...
> I finally got that to work. Some of the bugs had to do with
> logic which bypassed the RE for some input lines. But I think
> some of it had to do with how my RE was structured.
>
> QUESTION: in the original version of my RE, does the ^ metacharacter
> bind more tightly to the first group than the | alternation operator?
>
> m/^(?:Da)|(?:Su)|(?:To)|(?:Fr)/
>
> I note that the first bogus line to get through did have the words
> "From", "Subject", and "To" in it, just not at the beginning, as if
> the ^ was being bypassed. When I changed it to the following, that
> first bogus line went away:
>
> m/^(?:Da|Su|To|Fr)/

I go more like (or 'next if not' if you prefer:

while (<EHANDLE>) {
	if (/^(Date|Subj|To\b|From\b)/) {
		print $_;
	}
}


------------------------------

Date: Thu, 12 Mar 2015 12:38:13 +0100
From: "G.B." <bauhaus@futureapps.invalid>
Subject: Re: This RE isn't working as expected.
Message-Id: <mdrtpk$h5d$1@dont-email.me>

On 12.03.15 07:46, Robbie Hatley wrote:

> QUESTION: in the original version of my RE, does the ^ metacharacter
> bind more tightly to the first group than the | alternation operator?


Yes. From perlrequick(1):

   /(^a|b)c/;   # matches 'ac' at start of string or 'bc' anywhere



------------------------------

Date: Thu, 12 Mar 2015 15:37:27 +0100
From: gamo <gamo@telecable.es>
Subject: Re: This RE isn't working as expected.
Message-Id: <mds8b6$v3r$1@speranza.aioe.org>

El 12/03/15 a las 07:51, $Bill escribió:
> I go more like (or 'next if not' if you prefer:
>
> while (<EHANDLE>) {
>      if (/^(Date|Subj|To\b|From\b)/) {
>          print $_;
>      }
> }

And why to save letters or chars that must be there?

I propose this untested

if (/^(Date|Subject|To|From)\:\s/){


-- 
http://www.telecable.es/personales/gamo/
The generation of random numbers is too important to be left to chance


------------------------------

Date: Thu, 12 Mar 2015 16:29:31 +0000
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: This RE isn't working as expected.
Message-Id: <87wq2mtc0k.fsf@doppelsaurus.mobileactivedefense.com>

gamo <gamo@telecable.es> writes:
> El 12/03/15 a las 07:51, $Bill escribió:
>> I go more like (or 'next if not' if you prefer:
>>
>> while (<EHANDLE>) {
>>      if (/^(Date|Subj|To\b|From\b)/) {
>>          print $_;
>>      }
>> }
>
> And why to save letters or chars that must be there?
>
> I propose this untested
>
> if (/^(Date|Subject|To|From)\:\s/){

If this is supposed to match headers reliably it really has to match the
complete name up to the trailing colon, otherwise, it might always catch
a different header. : is not a meta-character, however, so it doesn't
have to be escaped. There's also one problem with the general idea:
Header values can continue over multiple lines. Syntactically, this
is signifed by a line starting with a whitespace character.

IMHO, it would also be nice to use a list of headers as input and let
some code worry about building a regex from that.

Minature e-mail header cleaner:

-----------
my @wanted = qw(Date Subject To	From Received);

my $want_re = '^(?:'.join('|', @wanted).'):\s';
my $wanted;

while (<>) {
    print, last if /^\n$/;
    print, next if /^\s/ && $wanted;
    $wanted = 0, next unless /$want_re/;

    print;
    $wanted = 1;
}

print while <>;
------------

 ... sooner or later, the wrath of the CPAN surfers ought to hit this
thread ...



------------------------------

Date: Thu, 12 Mar 2015 08:57:09 -0400
From: Shmuel (Seymour J.) Metz <spamtrap@library.lspace.org.invalid>
Subject: Re: This RE isn't working as expected.
Message-Id: <55018d25$4$fuzhry+tra$mr2ice@news.patriot.net>

In <_4-dnd9rzJzsgZzInZ2dnUVZ572dnZ2d@giganews.com>, on 03/11/2015
   at 09:56 PM, Robbie Hatley <see.my.sig@for.my.address> said:

>LINE: while (<EHANDLE>) {
>    ...
>    next LINE if not m/^(?:Da)|(?:Su)|(?:To)|(?:Fr)/;
>    ...
>}

That doesn't handle CFWS correcly. Test with a To: that takes multiple
lines.

-- 
Shmuel (Seymour J.) Metz, SysProg and JOAT  <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action.  I reserve the
right to publicly post or ridicule any abusive E-mail.  Reply to
domain Patriot dot net user shmuel+news to contact me.  Do not
reply to spamtrap@library.lspace.org



------------------------------

Date: Thu, 12 Mar 2015 18:02:25 +0000
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: This RE isn't working as expected.
Message-Id: <87oanyt7pq.fsf@doppelsaurus.mobileactivedefense.com>

Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:

[...]

> print while <>;

Functionally identical: print <>;


------------------------------

Date: Thu, 12 Mar 2015 19:22:04 +0000 (UTC)
From: Eli the Bearded <*@eli.users.panix.com>
Subject: Re: This RE isn't working as expected.
Message-Id: <eli$1503121521@qz.little-neck.ny.us>

In comp.lang.perl.misc, Robbie Hatley  <see.my.sig@for.my.address> wrote:
> I'm fishing through headers in a bunch of emails, trying to weed out
> extraneous headers and only keep Date, Subject, To, and From.
> To do that I cooked up the following RE, but it only partly works:
>     next LINE if not m/^(?:Da)|(?:Su)|(?:To)|(?:Fr)/;
 ...
> Is there something fundamentally wrong with my regular expression?

Yes.

#1:
Email headers are case insensitive, but it is very rare for modern
tools to use non-standard capitalization. It DOES however happen.
Perl specific example:

     Date: Wed, 29 Jul 2009 22:58:41 +0200
     MIME-Version: 1.0
     Content-Type: text/plain; charset="us-ascii"; format="flowed"
     from: modules@perl.org
     to: ELIJAH@cpan.org
     subject: Perl Monks compromised, PAUSE accounts at risk
     
     Dear CPAN author,

     This email is being sent to inform you that all passwords on the popular
     Perl Monks website were compromised.  Many CPAN authors have accounts
     there and in some cases have used the same password for PAUSE.

     [...]

#2:
Email headers can span multiple lines. This is particularly common for
To: and CC: lists, but it is not too rare in Subject:. I've never seen
it in a Date:. 

#3:
Your regular expression has broken anchoring. This is the error that you
have observed. Your "^" is only applied to the "Da" substring.

#4:
Your regular expression has broken end-string filtering. The "Summary:"
header is rare but legitimate. In the case of posted and mailed news,
you might rarely see a "Supersedes:" header. Future email standards may
add other headers that you'd also be matching.

Elijah
------
welcomes finding out what #5 he has missed


------------------------------

Date: Thu, 12 Mar 2015 19:59:49 +0000
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Subject: Re: This RE isn't working as expected.
Message-Id: <878uf2t2a2.fsf@doppelsaurus.mobileactivedefense.com>

Eli the Bearded <*@eli.users.panix.com> writes:
> In comp.lang.perl.misc, Robbie Hatley  <see.my.sig@for.my.address> wrote:
>> I'm fishing through headers in a bunch of emails, trying to weed out
>> extraneous headers and only keep Date, Subject, To, and From.
>> To do that I cooked up the following RE, but it only partly works:
>>     next LINE if not m/^(?:Da)|(?:Su)|(?:To)|(?:Fr)/;
> ...
>> Is there something fundamentally wrong with my regular expression?
>
> Yes.
>
> #1:
> Email headers are case insensitive, but it is very rare for modern
> tools to use non-standard capitalization. It DOES however happen.

Since this was the only one not mentioned so far: This means 'the regex'
(any of them) should use /i for robustness.


------------------------------

Date: Fri, 13 Mar 2015 01:14:59 -0700
From: Robbie Hatley <see.my.sig@for.my.address>
Subject: Re: This RE isn't working as expected.
Message-Id: <G-2dneZKIp3iAZ_InZ2dnUVZ57ydnZ2d@giganews.com>


It appears I got some very relevant and helpful replies to this post.
My thanks to all who replied.

I'll reply to a couple specifically.

On 3/12/2015 9:29 AM, Rainer Weikusat wrote:

> If this is supposed to match headers reliably it really has to match the
> complete name up to the trailing colon, otherwise, it might always catch
> a different header. : is not a meta-character, however, so it doesn't
> have to be escaped....

Yep, I found that to make it work reliably I had to change it to:

next LINE if not m/^(?:Date:|Subject:|To:|From:)/;

Though, as "Eli The Bearded" pointed out in his response, this needs
to be case-insensitive. And as you pointed out in another response,
that could be done with i:

next LINE if not m/^(?:Date:|Subject:|To:|From:)/i;

> There's also one problem with the general idea:
> Header values can continue over multiple lines. Syntactically, this
> is signifed by a line starting with a whitespace character.

In general, yes. But in the emails I was processing, the important
headers were all quite short.

> IMHO, it would also be nice to use a list of headers as input and let
> some code worry about building a regex from that.
>
> Minature e-mail header cleaner:
>
> -----------
> my @wanted = qw(Date Subject To	From Received);
>
> my $want_re = '^(?:'.join('|', @wanted).'):\s';
> my $wanted;
>
> while (<>) {
>      print, last if /^\n$/;
>      print, next if /^\s/ && $wanted;
>      $wanted = 0, next unless /$want_re/;
>
>      print;
>      $wanted = 1;
> }
>
> print while <>;

That's more general, but I was in a hurry to process a bunch of emails
quickly, and they were all addressed to the same person, and the header
lines I really wanted to print ( From: To: Date: Subject: ) were all
quite short, so I wrote program called "qp-to-ascii.perl" to process
these hundreds of emails (part of evidence in a hearing).

(See program below if you're curious.)

> .... sooner or later, the wrath of the CPAN surfers ought to hit this
> thread ...

Yep, I probably should have looked in CPAN to see if there's a
"email cleaner" module that strips out extraneous headers, HTML
sections, and attachments, converts text from "quoted printable"
to ASCII, then prints the "cleaned" email.



My own crude "let's hack that together in a hurry" version follows,
for your amusement.



#! /usr/bin/perl
#  /rhe/scripts/util/qp-to-ascii.perl

use v5.14;
use strict;
use warnings;
use MIME::QuotedPrint;
use Cwd;

our @filenames  = ();      # names of email files to be processed
our $dirname    = '';      #   name   of    current working directory
our $section    = 'head';  # section indicator ('head' or 'body')
our $blflag     = 0;       # previous-line-was-blank  flag

$dirname = getcwd();

opendir(DHANDLE, $dirname) or die "Can\'t open directory \"$dirname\". $!.";

# Iterate through current directory, collecting info on all "*.eml" files:
FILE: while (my $filename=readdir(DHANDLE))
{
    # We're only interested in "regular" files (not directories, symbolic links,
    # etc), so if current file isn't a regular file, move on to next file:
    next FILE if not -f $filename;

    # We're only interested in "*.eml" files, so if $filename is less than
    # 5 characters in length, move on to next file:
    next FILE if (length($filename) < 5);

    # We're only interested in "*.eml" files, so if last 4 characters of
    # file name are not ".eml", move on to next file:
    next FILE if (not(substr($filename,-4,4) eq '.eml'));

    # If we get to here, push current file name onto list:
    push(@filenames, $filename);
};

closedir(DHANDLE);

EMAIL: foreach my $emlname (@filenames)
{
    say "Processing file $emlname.";
    my $txtname = $emlname;
    substr($txtname,-4,4,'.txt');
    open(EHANDLE, '< :encoding(windows-1252)', $emlname)
       or warn "Cannot open email file $emlname for  input."
       and next EMAIL;
    open(THANDLE, '> :encoding(windows-1252)', $txtname)
       or warn "Cannot open text  file $txtname for output."
       and close(EHANDLE)
       and next EMAIL;

    $section = 'head';  # The first few lines are always the header.
    $blflag  = 0;       # We haven't yet printed any blank lines.

    # Process each line of text in current email, converting from
    # "quoted printable" to ASCII and deleting junk lines:
    LINE: while (<EHANDLE>)
    {
       # INITIAL GLOBAL ACTIONS (actions I take regardless of section):

       # Windows-chomp. I can't use regular "chomp" here because it only
       # gets rid of the final \x0a and leaves a trouble-causing \x0d at the
       # end of every line. So instead, I first remove the \x0a from the end
       # of each line, then remove the \x0d from the end of each line:
       s/\x0a$//g; # get rid of LF
       s/\x0d$//g; # get rid of CR

       # SECTION SWITCH (take different actions depending on section):

       # Header section:
       if ($section eq 'head') {
          # If this line starts with 'Content-Type: text/plain',
          # next line is beginning of body section:
          if (m<Content-Type: text/plain>) {
             $section = 'body';
          }

          # Skip unnecessary headers:
          next LINE if not m/^(?:Date:|Subject:|To:|From:)/;
       }

       # Body section:
       else {
          # If this line starts with 'PPID' do not print this line and exit
          # LINE loop here, because the first line marked 'PPID' is the first
          # line of the lengthy HTML gibberish section which comes after the
          # legible "plain text" section:
          last LINE if 'PPID' eq substr($_, 0, 4);

          # Otherwise, decode from quoted-printable to ASCII.
          #
          # Note that I'm purposely mis-using decode_qp() here. Normally, it converts
          # each " =\x0d\x0a" at line ends into a single space, so that the lines of
          # each paragraph are merged into one paragraph with one "\x0d\x0a" at the end.
          #
          # But the emails I'm printing already have lines chopped to just the length
          # I like, so I'm retaining "one CRLF per line" instead of going over to
          # "one CRLF per paragraph". So to purposely sabotage decode_qp() from merging
          # lines, I chomp-off the \x0d\x0a at the ends of all lines (see "INITAL GLOBAL
          # ACTIONS section above).
          #
          # This does necessitate manually getting rid of the " =" at the ends of lines,
          # but that is easily accomplished:

          $_ = decode_qp($_); # Decode quoted-printable to ASCII.
          s/ =$//g;           # Get rid of " =" line endings.

       }

       # FINAL GLOBAL ACTIONS (actions I take regardless of section):

       # Strip all leading and trailing whitespace from current line.
       # (If line consists only of whitespace, line will now become ''.)
       s/^\s+//g; # strip all leading  whitespace
       s/\s+$//g; # strip all trailing whitespace

       if ($blflag) {      # if the previous line we printed was blank
          if ($_ eq '') {  #    if current line is also blank,
             next LINE;    #       skip current line.
          }                #
          else {           #    otherwise,
             $blflag = 0;  #       reset "blank line" flag to false
          }
       }
       else {              # else if previous line was NOT blank,
          if ($_ eq '') {  #    if current line IS blank,
             $blflag = 1;  #       set the "blank line" flag to true
          }
          else {           #    otherwise,
             ;             #       do nothing.
          }
       }

       # Print line:
       print(THANDLE "$_\x0d\x0a");
    }                      # end LINE loop
    close(EHANDLE);        # close *.eml handle
    close(THANDLE);        # close *.txt handle
}                         # end EMAIL loop
exit 0;                   # Exit program and return "success" code.





-- 
Cheers,
Robbie Hatley
Midway City, CA, USA
perl -le 'print "\154o\156e\167o\154f\100w\145ll\56c\157m"'
http://www.well.com/user/lonewolf/
https://www.facebook.com/robbie.hatley


------------------------------

Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin) 
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>


Administrivia:

To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.

Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests. 

#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.


------------------------------
End of Perl-Users Digest V11 Issue 4388
***************************************


home help back first fref pref prev next nref lref last post