[32905] in Perl-Users-Digest
Perl-Users Digest, Issue: 4183 Volume: 11
daemon@ATHENA.MIT.EDU (Perl-Users Digest)
Mon Mar 31 16:09:26 2014
Date: Mon, 31 Mar 2014 13:09:03 -0700 (PDT)
From: Perl-Users Digest <Perl-Users-Request@ruby.OCE.ORST.EDU>
To: Perl-Users@ruby.OCE.ORST.EDU (Perl-Users Digest)
Perl-Users Digest Mon, 31 Mar 2014 Volume: 11 Number: 4183
Today's topics:
Re: Google spreadsheets with WWW::Mechanize <sun_tong_001@users.sourceforge.net>
Re: Google spreadsheets with WWW::Mechanize <marc.girod@gmail.com>
Re: Google spreadsheets with WWW::Mechanize <marc.girod@gmail.com>
The best approach to simplify/clean-up html code <sun_tong_001@users.sourceforge.net>
Re: The best approach to simplify/clean-up html code <john@castleamber.com>
Re: The best approach to simplify/clean-up html code <sun_tong_001@users.sourceforge.net>
Re: The best approach to simplify/clean-up html code (Tim McDaniel)
Digest Administrivia (Last modified: 6 Apr 01) (Perl-Users-Digest Admin)
----------------------------------------------------------------------
Date: Sun, 30 Mar 2014 02:12:28 GMT
From: * Tong * <sun_tong_001@users.sourceforge.net>
Subject: Re: Google spreadsheets with WWW::Mechanize
Message-Id: <g4LZu.8300$a35.3611@fx02.iad>
On Sat, 29 Mar 2014 05:16:05 -0700, Marc Girod wrote:
> I have still not quite solved all my problems, but the package is not to
> blame.
I didn't catch the OP, but FYI, if you are not absolutely have to use
Perl/WWW::Mechanize, Google has a better scripting language for its
spreadsheets, Google Docs, etc. It is Google App Script.
http://www.google.com/script/
Very powerful. I use it to deal with Google Docs, spreadsheets, etc.
------------------------------
Date: Sun, 30 Mar 2014 03:01:03 -0700 (PDT)
From: Marc Girod <marc.girod@gmail.com>
Subject: Re: Google spreadsheets with WWW::Mechanize
Message-Id: <59a4106b-8069-4d64-a9be-70459d59f561@googlegroups.com>
On Sunday, 30 March 2014 03:12:28 UTC+1, * Tong * wrote:
> Google has a better scripting language for its
> spreadsheets, Google Docs, etc. It is Google App Script.
I found it as well, after noticing the API would not give me access to the colour of cells.
I have still some difficulty to find it 'very powerful', but maybe it is only because it is new to me.
I couldn't yet find hashes (or dictionaries...), and even subscripting pseudo arrays (e.g. returned by getDataRange) escapes me now.
But we'd go out-of-scope here...
Thanks anyway!
Marc
------------------------------
Date: Sun, 30 Mar 2014 04:41:20 -0700 (PDT)
From: Marc Girod <marc.girod@gmail.com>
Subject: Re: Google spreadsheets with WWW::Mechanize
Message-Id: <7031ddfc-5bda-44d2-abdd-6fe05ec65b9b@googlegroups.com>
On Sunday, 30 March 2014 11:01:03 UTC+1, Marc Girod wrote:
>... and even subscripting pseudo arrays (e.g. returned by getDataRange) escapes me now.
OK. Got that now.
Marc
------------------------------
Date: Sun, 30 Mar 2014 02:06:37 GMT
From: * Tong * <sun_tong_001@users.sourceforge.net>
Subject: The best approach to simplify/clean-up html code
Message-Id: <N_KZu.8299$a35.506@fx02.iad>
Basically a repost of the same request as
http://stackoverflow.com/questions/838512/automatic-html-simplifier-tool
because I don't like the C# solution as the answer. I prefer Perl.
I.e., I'm looking for a tool to simplify html mark up as much as
possible. If there isn't one already there, I don't mind role up my
sleeves and write one. I did find one, HTML::Clean::Human --
<blockquote>
This is is an html syntax filter/reformatter.
My initial temptation was to simply seek a solution such as html
to text. But then I realized html code may have links and other ephemera
that would be desireable to keep.
What I want it to get rid of; all the stupid html things such as
inline font declarations etc.
This code is useful if you edit html, but you have to do it maybe
from already existing html that some whacko wysiwyg junk spat out. Run it
through this and voila.
</blockquote>
However, UTSL reveals that it cannot simplify html mark up like this:
<a href="http://packages.debian.org/source/unstable/xclip"
style="color: rgb(7, 85, 215); text-decoration: none; font-family: 'DejaVu
Sans', 'Bitstream Vera Sans', sans-serif; font-size: 16px; font-style:
normal; font-variant: normal; font-weight: normal; letter-spacing:
normal; line-height: normal; orphans: auto; text-align: left; text-
indent: 0px; text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb
(255, 255, 255);"><span class="srcversion" title="unstable">0.12</span></
a>
So, what's the best approach to simplify/clean-up html mark up code?
Is there any better ways than the approach used in
https://metacpan.org/source/LEOCHARRE/HTML-Clean-Human-1.07/lib/HTML/
Clean/Human.pm?
PS. My ultimate goal is to come up with something that satisfy the HTML
tags requirement allowed on Stack Exchange sites --
http://meta.stackoverflow.com/questions/1777/what-html-tags-are-allowed-
on-stack-exchange-sites
Thanks
------------------------------
Date: Sun, 30 Mar 2014 12:11:51 -0600
From: John Bokma <john@castleamber.com>
Subject: Re: The best approach to simplify/clean-up html code
Message-Id: <87ob0npndk.fsf@castleamber.com>
* Tong * <sun_tong_001@users.sourceforge.net> writes:
> My initial temptation was to simply seek a solution such as html
> to text. But then I realized html code may have links and other ephemera
> that would be desireable to keep.
How about HTML to MarkDown, and then from MarkDown back to HTML?
http://johnmacfarlane.net/pandoc/
Moreover, Pandoc can also generate JSON which can be post-processed by
Perl, and then converted to any format you would like.
http://johnmacfarlane.net/pandoc/scripting.html section JSON filters.
dirty HTML -> (pandoc) -> JSON -> (your Perl script ) -> JSON ->
(pandoc) -> clean HTML.
--
John Bokma j3b
Blog: http://johnbokma.com/ Perl Consultancy: http://castleamber.com/
Perl for books: http://johnbokma.com/perl/help-in-exchange-for-books.html
------------------------------
Date: Sun, 30 Mar 2014 19:01:39 GMT
From: * Tong * <sun_tong_001@users.sourceforge.net>
Subject: Re: The best approach to simplify/clean-up html code
Message-Id: <nSZZu.17850$ft3.9859@fx24.iad>
On Sun, 30 Mar 2014 12:11:51 -0600, John Bokma wrote:
> How about HTML to MarkDown, and then from MarkDown back to HTML?
> http://johnmacfarlane.net/pandoc/
Thanks for your answer John,
No, pandoc is too big. I don't mind or I'd rather to code one myself.
Just I'm not sure whether to use the regex hack, like the one in my OP,
or to go through the formal html parsing then simplifying route.
Thanks all the same.
------------------------------
Date: Mon, 31 Mar 2014 03:06:56 +0000 (UTC)
From: tmcd@panix.com (Tim McDaniel)
Subject: Re: The best approach to simplify/clean-up html code
Message-Id: <lham4g$cls$1@reader1.panix.com>
In article <N_KZu.8299$a35.506@fx02.iad>,
* Tong * <sun_tong_001@users.sourceforge.net> wrote:
>I.e., I'm looking for a tool to simplify html mark up as much as
>possible. If there isn't one already there, I don't mind role up my
>sleeves and write one.
...
> My initial temptation was to simply seek a solution such as
>html to text. But then I realized html code may have links and other
>ephemera that would be desireable to keep.
>
> What I want it to get rid of; all the stupid html things such
>as inline font declarations etc.
I've not looked at CPAN HTML parser modules. Perhaps you might use
one, walk the tree, and either blacklist or whitelist elements and
attributes?
--
Tim McDaniel, tmcd@panix.com
------------------------------
Date: 6 Apr 2001 21:33:47 GMT (Last modified)
From: Perl-Users-Request@ruby.oce.orst.edu (Perl-Users-Digest Admin)
Subject: Digest Administrivia (Last modified: 6 Apr 01)
Message-Id: <null>
Administrivia:
To submit articles to comp.lang.perl.announce, send your article to
clpa@perl.com.
Back issues are available via anonymous ftp from
ftp://cil-www.oce.orst.edu/pub/perl/old-digests.
#For other requests pertaining to the digest, send mail to
#perl-users-request@ruby.oce.orst.edu. Do not waste your time or mine
#sending perl questions to the -request address, I don't have time to
#answer them even if I did know the answer.
------------------------------
End of Perl-Users Digest V11 Issue 4183
***************************************