[7893] in cryptography@c2.net mail archive
Re: Lowercase compresses better?
daemon@ATHENA.MIT.EDU (Matt Blaze)
Fri Sep 29 18:54:33 2000
Message-Id: <200009292047.QAA04928@fbi.crypto.com>
To: rsalz@CaveoSystems.com
Cc: cryptography@c2.net
In-Reply-To: Message from rsalz@CaveoSystems.com
of "Fri, 29 Sep 2000 14:29:50 EDT." <200009291829.OAA22268@os390.caveosystems.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Fri, 29 Sep 2000 16:47:32 -0400
From: Matt Blaze <mab@research.att.com>
> In reading
> http://apachetoday.com/news_story.php3?ltsn=2000-09-27-001-01-OP-CY-LF
>
> I came across the following guideline for writing Apache documentation:
> HTML tags should be lowercase wherever possible. In other
> words, '<a href="foo.html">Link</a>' is preferred over
> '<A HREF="foo.html">Link</A>'. This is because lowercase
> letters result in more efficient space savings when documents
> are compressed.
>
> I'm trying to figure out how this could be true.
> /r$
>
While I can't imagine that there's anything special about lower case
per se, I can certainly imagine a compression scheme giving a more favorable
encoding to characters that are of the dominant case of the overall document.
Certainly a monocase document has less information in it than a mixed case
one.
This seems to be true in practice as well as in theory (at least for
gnu zip):
crypto$ ls -ld xyzzy.txt
-rw-rw-r-- 1 mab mab 46552 Sep 29 16:39 xyzzy.txt
crypto$ tr A-Z a-z < xyzzy.txt > xyzzy.lower.txt
crypto$ gzip xyzzy.txt
crypto$ gzip xyzzy.lower.txt
crypto$ ls -ld xyzzy*
-rw-rw-r-- 1 mab mab 13451 Sep 29 16:40 xyzzy.lower.txt.gz
-rw-rw-r-- 1 mab mab 14171 Sep 29 16:39 xyzzy.txt.gz
crypto$
-matt