[7893] in cryptography@c2.net mail archive

home help back first fref pref prev next nref lref last post

Re: Lowercase compresses better?

daemon@ATHENA.MIT.EDU (Matt Blaze)
Fri Sep 29 18:54:33 2000

Message-Id: <200009292047.QAA04928@fbi.crypto.com>
To: rsalz@CaveoSystems.com
Cc: cryptography@c2.net
In-Reply-To: Message from rsalz@CaveoSystems.com 
   of "Fri, 29 Sep 2000 14:29:50 EDT." <200009291829.OAA22268@os390.caveosystems.com> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Fri, 29 Sep 2000 16:47:32 -0400
From: Matt Blaze <mab@research.att.com>

> In reading
>     http://apachetoday.com/news_story.php3?ltsn=2000-09-27-001-01-OP-CY-LF
> 
> I came across the following guideline for writing Apache documentation:
>     HTML tags should be lowercase wherever possible. In other
>     words, '<a href="foo.html">Link</a>' is preferred over
>     '<A HREF="foo.html">Link</A>'. This is because lowercase
>     letters result in more efficient space savings when documents
>     are compressed.
> 
> I'm trying to figure out how this could be true.
> 	/r$
> 

While I can't imagine that there's anything special about lower case
per se, I can certainly imagine a compression scheme giving a more favorable
encoding to characters that are of the dominant case of the overall document.
Certainly a monocase document has less information in it than a mixed case
one.

This seems to be true in practice as well as in theory (at least for
gnu zip):

crypto$ ls -ld xyzzy.txt
-rw-rw-r--   1 mab      mab         46552 Sep 29 16:39 xyzzy.txt
crypto$ tr A-Z a-z < xyzzy.txt > xyzzy.lower.txt
crypto$ gzip xyzzy.txt
crypto$ gzip xyzzy.lower.txt 
crypto$ ls -ld xyzzy*
-rw-rw-r--   1 mab      mab         13451 Sep 29 16:40 xyzzy.lower.txt.gz
-rw-rw-r--   1 mab      mab         14171 Sep 29 16:39 xyzzy.txt.gz
crypto$ 

-matt




home help back first fref pref prev next nref lref last post