data compression

updated 2009-04-29.

Contents:

related pages:

"It is my ambition to say in ten sentences what others say in a whole book." -- Friedrich Nietzsche

"Short words are best and the old words when short are best of all." -- Winston Churchill

The most valuable talent is that of never using two words when one will do. -- Thomas Jefferson

If you would be pungent, be brief; for it is with words as with sunbeams. The more they are condensed, the deeper they burn. - Robert Southey

"In the multitude of words there wanteth not sin: but he that refraineth his lips is wise." -- Proverbs 10:19

data compression, in general

There are several different orthogonal ways to categorize data compression:

Most people have assumed that the decompressor has access to the complete compressed data file from the beginning. However, DAV has been thinking about ways to synchronize a decoder in the middle of a data stream -- for example, immediately after you turn on your TV, you want to be able to start watching a compressed digital video signal almost immediately. This is similar to ``medium-latency'', except it requires the TV to be able to decode the last part of the TV program even if it completely missed out on the first part. Some fractal compression techniques have this property ... I've been calling this ``history-limited decompression''.

introductions to data compression beginner tutorials

More links to data compression, in general (Other pages that, like this one, have lots of links to compression information):

2D Image Compression

specifically designed to compress 2D images.

The 2 hottest topics in 2D image compression (circa 1999, when DAV spent a lot of time in this area) are "wavelets" and "fractals" (which really have many similarities).

The goal here is to try to improve on computer_graphics_tools.html#png .

other 2D image compression

fractal image compression

see also computer_graphics_tools.html#fractals for fractals in general (as artistic and mathematical objects).

(FIXME: move to http://en.wikipedia.org/wiki/Fractal_Compression )

[FIXME: write fractal image compression with extra ``erode'' and ``dilate'' parameters]

wavelet image compression

Introductions to Wavelets

Wavelets applied to Image Compression

color image compression

Color coordinate systems and conversion between them.

See also clustering and color quantization machine_vision.html#cluster .

I'm most interested in simple lossless stuff.

1D data compression

Specifically designed to compress 1D streams of symbols, such as English text. 2D images are often rearranged into a linear list of pixels (e.g., by scanning one row at a time, or walking a Hilbert path), then compressed with one of these tools.

Some of these (such as Huffman) don't even take advantage of the 1D correlation between adjacent symbols, and just take advantage of "correlations between random pairs", i.e., statistics of the entire file taken as a unordered set. These algorithms would compress any file with identical statistics (for example, by shuffling the order of the items in any arbitrary manner) just as well. [This could be phrased better ...]

Others of these (LZ77 and descendents) completely ignore the unordered statistics, and simply copy repeated phrases.

Theory:

Source code available at:

Huffman

[FIXME: Not all my Huffman links are here; some of the 1D_compression links describe several compression ideas including Huffman compression. Perhaps I should copy them here and directly link to their Huffman page].

Reverse Huffman coding

David Cary developed "reverse Huffman coding" to minimize the transmission cost whenever where different symbols have different cost.

(Other people, including Peter Wayner, have independently discovered/used reverse Huffman for other purposes)

This requires a Huffman compression algorithm such that decompressing *any* "compressed" file, then compressing that "plaintext", generates identically the same "compressed" file. See David Scott http://bijective.dogma.net/ for another application that requires this same property.

Say you have Q unique symbols that can be transmitted, each one x with cost c(x). For example, you have a battery-powered device that turns a LED off and on, where each 9 bit symbol always starts with the LED on one bit time, and the total cost is the battery drain which only occurs while the LED is turned on. In other situations, the "cost" may also involve the transmission time.

Theorem: To minimize transmission cost, you want to equalize the cost per bit of information transmitted, for each symbol.

Proof: [FIXME]

We assume that the data to be transmitted has already been compressed into a bitstream, such that no matter what values the previous bits have been, the probability that the next bit is a 1 is always approximately 1/2.

The cost can be any arbitrary function of the symbol x. This would commonly include some combination of

If every one of the unique symbols can be transmitted in the same amount of time, and the "time to transmit" dominates the cost, then the minimum cost occurs when all symbols will be transmitted (approximately) equally likely.

Using Arithmetic coding isn't much better than Huffman at equalizing cost/bit when there are lots of symbols, but it can be far superior when there are only 2 or 3 symbols.

In some cases, it may be a good idea to concatenate "simple" symbols to make a "big" symbol. For example, if you have 3 "simple" symbols to work with, you might get better cost/bit if you concatenate 3 of them to create a symbol set of 27 "big" symbols, and then feed the cost of each of those 27 symbols into the reverse Huffman algorithm. Concatenation also simplifies convoluted transmission constraints such as "when transmitting these -1, 0, +1 values, the average must always be 0".

Simple example: We have a LED transmitter/ photodiode receiver pair that transmit 4 bit symbols to each other. Every symbol required to have a minimum of 1 "on" bit to maintain synchronization between transmitter and receiver, so that leaves 15 valid symbols.

The cost for this application is entirely the battery drain when the LEDs are on, a constant 1 unit for each time period. We neglect the time needed to transmit information and the battery power used by the transmitter and receiver while the LED is off.

LED cost encoded bits
0000 0 (never transmitted)
0001 1 001
0010 1 01
0011 2 000001
0100 1 10
0101 2 000010
0110 2 000011
0111 3 000000000
1000 1 11
1001 2 000100
1010 2 000101
1011 3 00000001
1100 2 00011
1101 3 00000010
1110 3 00000011
1111 4 000000001

Without reverse Huffman transmission, you might transmit each symbol equally, so each symbol encodes approximately log2(15) =~= 3.9 bits of information (using bit stuffing). at an average cost of (average cost)/(average bits)= = (32/15)/(log2(15)) = 0.546 cost/bit.

Or you might transmit only 3 bits per symbol, using the 8 lowest-power symbols, to get a slightly better rasio of (average cost)/(average bits) = = (12/8)/(3) = 0.500 cost/bit.

But the optimum cost/bit is to use reverse Huffman transmission. Then the probability of each symbol being transmitted becomes [ 64 128 8 128 8 8 1 128 8 8 2 16 2 2 1]/512. The length in bits for each symbol is [ 3 2 6 2 6 6 9 2 6 6 8 5 8 8 9]. So the average(cost/bit) = sum( probability(i).*cost(i)./bit(i) ) = 0.4611 cost/bit.

We can see how well this choice of bitpatterns fits to our goal of a equal cost/bit by calculating cost ./ bit = [0.3333 0.5000 0.3333 0.5000 0.3333 0.3333 0.3333 0.5000 0.3333 0.3333 0.3750 0.4000 0.3750 0.3750 0.4444] = [120 180 120 180 120 120 120 180 120 120 135 144 135 135 160]./360 .

Well, it's a lot closer to a constant cost/bit than the simpler strategies above.

Open questions: Is it possible to have a "reverse adaptive huffman" adapt to changing line conditions ?

[FIXME: not a good explaination of end effects. See bijection ] End effects: The above neglects what happens at the end of a transmission. What if you are nearing the end of transmitting a message, and you find that the remaining bits to be transmitted are "1100" ? Obviously you find that you need to transmit a "1000" pattern (interpreted as "11"); then what could you send that would be interpreted as "00" ? One simple protocol would be to append a virtually infinite string of zeros after the message; this protocol would then transmit a "0111" message (interpreted as "000000000") to transmit the remaining "00" data bits and a bunch of bogus "0" bits. Hopefully the receiver would properly ignore the extra 0 bits. Perhaps if the message to be transmitted ended in zeros, the transmitter could halt transmissions after the last 1 bit was transmitted, and the receiver would somehow fill in the mising 0 bits. (Appending a virtually infinite string of ones after the message could also be interesting). Is it possible for the receiver to truncate extra zeros (or even fill in missing zeros) even if it doesn't know the exact length of the original text ? Maybe if it knows the original text was a exact multiple of 8 bits in length ... no, that wouldn't work, because in the special case where we lacked 1 bit to make up a full byte, and the the final transmitted symbol was "0111", then the 9 "0" bits that decodes into would be ambigous as to whether it was intended as the final "0" bit of the message, or whether the message ended with another full byte of zeros. (Appending a virtually infinite string of ones looks like it would work in this case ... will it always work ?)

What is the maximum height of a Huffman tree ?

from a thread at http://www.deja.com/[ST_rn=md]/threadmsg_md.xp?AN=487814206 and more info at http://www.compressconsult.com/huffman/#maxlength

As you can see from the above, if you have 2^15 symbols (or less) in your data file, you can refer to any one of them using a 15 bit offset from the start of the file. I was incorrect in concluding that therefore the maximum code length would be 15 bits, since Huffman *inserts* lots of symbols (in that mythical sorted file), so the Huffman offset may be much longer than 15 bits.

alternatively, you can loose a *tiny* amount of compression by using a code table that is slightly less than optimal, to force code lengths to a maximum of 15 bits.

Q: What is the maximum height
of a Huffman tree with an alphabet of 8 bit (256 different characters)?
A: The worst case height is 255.

The worst case is, of course, exceedingly improbable in practice.

If you have control of the Huffman code assignment (which you probably
do as an encoder), then you can set an a-priori limit on the code
length.  This means that the lowest-probability symbols will have
slightly-less-than-optimal code assignments, but since they occur
infrequently by definition, the loss in compression ratio is tiny.
The resulting code simplifications may be well worth that price.

The JPEG standard, for example, requires that no code be longer than
16 bits.  (Since this is a requirement of the standard, both encoders
and decoders are able to take shortcuts that assume this is the
maximum code length.)  The standard offers a simple approximate
algorithm for adjusting a true Huffman code tree to meet this maximum
depth.
                        regards, tom lane
                        organizer, Independent JPEG Group


but,

From: "DI Michael Schindler" 
Subject: Re: huffman code length
Date: 10 Jun 1999 00:00:00 GMT
Organization: Compression Consulting
Newsgroups: comp.compression

hi Tom,

you are correct.

Two things you did not mention:
JPEG uses a very small alphabet; in the case here a maximum codelength
of 8 bit is not practical (we have a 256 symbol alphabet here!)

To estimate loss the factor the  number of bits for the longest code
is greater than then number of bits needed for simple storing symbols
is a good measure.


The original poster also wanted to know codetable size for a
*canonical* code:
If you omit the leading zeros/ones (which one depends on wether the
shortest code is all zeros or ones) you can store codelength+trailing bits
pairs in memory. The number of trailing bits needed is limited by
log2(alphabetsize); see my huffman coding webpage
http://www.compressconsult.com/huffman/ for details.

Michael
From: "DI Michael Schindler" 
Subject: Re: huffman code length
Date: 16 Jul 1999 00:00:00 GMT
Organization: Compression Consulting
Newsgroups: comp.compression,alt.comp.compression,sci.crypt,sci.math

hi!

the formula below is right only if you take the longest subtrees
in case of equality of weights. If you prefer shorter trees
the # of samples raises faster.
The reason is that in the case of this worst-case sample
distribution you add just about half the number of samples you
already have but need an additional bit.

Note the codelength is not important; important is the part of
the code that is nonconstant for coding and decoding issues.
This can be limited to ceil(log2(ALPHABETSIZE)) bits when using a
canonical code (which you should for decoding speed reasons).

Michael

For the "standard" code (where we prefer shorter trees by merging the least-recently-merged tree when there is equality of weights), then the worst-case (pathological) codelength is still (ALPHABETSIZE-1) bits. The pathological case for 7 symbols (a tree with depth 6) is ((((((1 1) 1) 3) 4) 7) 11)

length of file (# of samples) maximum codelength
0..1 0
2 1
3..4 2
5..8 3
9..14 4
15..24 5
25..40 6
41..66 7
67..108 8
109..176 9
177..286 10
287..464 11
465..752 12
753..1218 13
1219..1972 14
1973..3192 15
3193..5166 16
5167..8360 17

The width of each range is 1 more than the lower end of the previous range.


I already knew about preferring shorter trees (easy to do by preferring
the least-frequently-merged nodes when building the tree during
compression), but somehow I neglected to take that into account when
building that table. Thanks for pointing out that flaw. Using
standard (prefer shorter trees) encoding,

    length of file (# of samples)
               maximum code length (standard encoding)
     0..1      0
     2         1
     3..4      2
     5..8      3
     9..14     4
    15..24     5
    25..40     6
    41..66     7
    67..108    8
     ...
  1973..3192  15
  3193..5166  16
  5167..8360  17

The width of each range is the lower end of the previous range.
(Did I get it right this time ?)

Unfortunately, this doesn't seem to make a significant difference (5166
characters vs. 4180 characters).

Using standard (prefer shorter trees) encoding,
the worst-case (pathological) case for 8 unique symbols is a depth of 7,
  ((((((((1+1)+1)+3)+4)+7)+11)+18) = 46 symbols
which is almost, but not quite, the standard Fibonacci sequence.
Each frequency is the sum of the 2 previous frequencies,
but this sequence starts out "1 1 1 3" rather than "1 1".

But a depth of 7 can occur with only 41 symbols,
albeit with 9 unique symbols:
  ((((((((1+1)+(1+1))+1)+4)+6)+10)+16) = 41
This is even less like the standard Fibonacci sequence.
Each frequency is the sum of the 2 previous frequencies,
but this sequence starts out
"1 1 1 1 1 4 6" rather than "1 1".

Thanks for telling me that, after the initial zeros, even the worst-
case canonical codes are relatively short. I didn't know that. I can
see that that would definitely simplify Huffman compressors and
decompressors. If I have time later, I will see if that really makes my
software (which already uses canonical codes) go any faster (or slower).

"DI Michael Schindler" 
mentioned:
> the formula below is right only if you take the longest subtrees
> in case of equality of weights. If you prefer shorter trees
> the # of samples raises faster.
> The reason is that in the case of this worst-case sample
> distribution you add just about half the number of samples you
> already have but need an additional bit.
>
> Note the codelength is not important; important is the part of
> the code that is nonconstant for coding and decoding issues.
> This can be limited to ceil(log2(ALPHABETSIZE)) bits when using a
> canonical code (which you should for decoding speed reasons).

...
> d_cary at my-deja.com wrote
...
> > the Fibonacci sequence of frequencies
> > gives the maximum possible code length.
> >
> >     length of file (# of samples)
> >                maximum code length
> >          2      1
> >      3...4      2
> >      5...7      3
> >      8...12     4
> >     13...20     5
> >     21...33     6
> >   ......      ...
> >   1597...2583  15
> >   2584...4180  16
> >   4181...      17

obsolete:

Date:  Mon, 31 May 1999 01:30:46 GMT
From:  d_cary 
Subject:  Re: Frequency distribution of Huffman encoding tree

Mok-Kong Shen  wrote:
...
> My example given for a balanced tree of 4 symbols appears to
> contradict your formula. (Your range is [0.25, 0.5]).
>
> M. K. Shen

You're right. The formula I posted was wrong. I would really like to
know the correct formula (and maybe even a URI or paper reference).

Maybe the correct formula is
    2^(-Hi-1) <= fi <= 2^(1-Hi)
where
    fi = ni/L = number of times "i" occurs / total length of file
    Hi = number of bits assigned by Huffman to the "i" code.
?

No, that's not right either. For the special case of only 2 symbols,
Huffman always assigns 1 bit per symbol, unless the probability of one
of the symbols is identically zero (and the other is identically 1). So
for the case of 2 symbols, one cannot find a tighter bound than
  0 < fi < 1.

I'm pretty sure a tighter bound can be found if one has more than 2
symbols with non-zero probability.

So the formula must depend not only on the frequency that a symbol
occurs, but also on how many symbols there are.


LZ77 and other copy-repeat algorithms

[FIXME: todo: write some code to try out the speculative ideas] [post a summary to the data compression wiki, leaving out the speculative ideas]

2009-01-06:DAV: ideas for short data compression: ... a variant of LZRW: rather than fixed size for all files, perhaps allow compressor to adapt "offset" and "length" field size for the particular file to be compressed. When compressing short files, it's important to be able to compress byte pairs, which repeat far more often than byte triplets. (But I can imagine byte pairs being so rare when compressing large files that it's not worth dealing with them)

LZRW1-A and LZJB both use (in effect)

Contrary to what both Williams and Bonwick claim, 2 byte copies actually *would* save space (albeit only 1 bit each) -- 2 literals take 18 bits, but a 2 byte copy takes 17 bits.

Perhaps the tradeoff between "length" and "offset" should be made on a file-by-file basis.

Or perhaps go even further, and make "length" and "offset" independently adjustable on a file by file basis. This slows down the decompressor (because the "items" in the compressed file are no longer byte aligned, although we maintain byte alignment in the decompressed data). And it slows down the compressor a lot, because it must search for the optimum "offset" and "length" field sizes.

Does it make sense to somehow make the compressed phrase merely 8 bits, so even a 1 byte copy might take less space than a 1 byte literal (9 bits)?

Rather than the compressor attempting all possible variations of "offset" and "length", is there any way to more quickly calculate the optimum size?

... also: perhaps mash in a "universal code" (variable length) for either the offset or the length or both ... ... also: perhaps the complex "escape code" system used in pucrunch in order to reduce the expansion of completely random data ...

For example, Using 10 bits for literals, 9 bits for compressed phrases and the most common single bytes (not the most common bytes in the source code; the most common bytes that would otherwise be encoded as 10 bit literals): (using 1 or 2 bits in the "control byte", and a single byte for the remainder, would keep the compressed code byte aligned)

DAV's example for 9 bit compressed phrases:

compressed file:
  * special ID code
  * other metadata
  * # of "common single bytes" used
  * table of "most common single bytes"
  * # of lengths used
  * table of "most common lengths"
  * #bits used in offset field
  * zero or more items
  * CRC

Each item is either a literal item, a "substitute single byte" item, or a copy item. If the length field is, for example, 3 bits, then

  000: copy 1 byte
  001: copy 2 bytes
  010: copy 3 bytes
  011: copy 4 bytes
  100: if offset is one of the L special values, use a substitute literal. Otherwise, copy 6 bytes.
  101: copy 8 bytes.
  11x: x and the following 7 bits are a literal byte.

The "substitute single byte" only makes sense when the offset field is short enough that a "substitute single byte" is shorter than a literal item.

A single byte copy can be shorter than a literal. For example, when the "length" is 3 bits and "offset" is 5 bits (8 bits), but a literal is 9 bits. (I'm not sure this happens often enough to actually save any bits. With short source texts, the overhead of the table may make it not worthwhile. With long source texts, where the overhead of the table is negligible, using those codes for more "copy" options may give a shorter compress file than using those codes for "substitute bytes".)

To extend the range of those 5 bits of offset, the offset is not a literal offset in bytes, but a context offset -- given the previous byte, the Nth time the previous byte occurred.

Unlike Huffman compression of symbols emitted by a memoryless source, there are several valid "length" fields. For example, say a block 301 bytes long is exactly repeated.

Various ways to optimize "continue copying from where we left off". Such a thing is common in English text. Some common cases include:

If we use special "extra long" lengths, the obvious thing is to make them powers of 2 ... but is that really optimum? Perhaps some other lengths would be better. For example, rather than 2^n (natural binary), use (2^n)-1 instead (Booth encoding). In particular, I think I would like to try the lengths used in Combsort11 : 1, 2, 3, 4, 6, 8, 11, ... with a shrink factor of around 1.24 ... 1.5 . fast decoding: look up lengths in table or use easily-calculated function: * f(x) = (1<<x) ("obvious powers of 2"): 1, 2, 4, 8, 16, 32, ... * f(x) = floor(x*x / 4) (squaring, rather than the exponential I want ... but still better than linear): 0, 0, 1, 2, 4, 6, 9, 12, 16, 20, 25, 30, 36, ... * f(x) = floor(x*x / 8) (squaring, rather than the exponential I want ... but still better than linear): 0, 0, 0, 1, 2, 3, 4, 6, 8, 10, 12, 15, 18, ... * f(x) = ( 2 + (1&x) ) << (x>>1) (approximately powers of sqrt 2): 2, 3, 4, 6, 8, 12, 16, 24, 32 ... If I hit a phrase that isn't exactly the same length as the available lengths, then I break it up into pieces, starting with the next-shortest phrase length and work down. But should I emit those pieces longest-to-shortest or shortest-to-longest? For example, should I break a phrase of length 10 up into 8, 2 or 2, 8 ? And if we're not doing powers-of-two, there seems to be several different ways to break them up: a phrase of length 10 could also be broken into 6, 4. Of course, this is all moot if we use a variable-length "universal code" for the length.

Benchmark Images; Benchmark Files

Lots of standard test images and standard test text files, plus comparisons of various compression programs.

short data compression

Many data compression algorithms (1D and 2D) are designed to work with large amounts of data, and compress them all-or-nothing.

These actually *expand* the data (by a few bytes) if one tries to use them for "short" snippets of text or other data less than a few hundred bytes long.

If one has a large database, those data compression algorithms can give excellent compression, but it takes a very long time to decompress the database and find something in it.

Here I've listed algorithms that are a bit less efficient on "large" blocks of data, but are appropriate for "small" amounts of data (or for random-access indexing into large databases with small amounts of data at each entry). Perhaps you could think of them as 0D.

One might think that low-latency algorithms must all be 0D, operating on small blocks of data, but [elsewhere] there are many "single-pass" and "adaptive" algorithms with low latency (such as LZW compression, adaptive Huffman compression, etc.).

[FIXME: split out into seperate section] I also have some "small" data compression algorithms, designed to be used on slow machines with very little RAM and ROM. There's several applications, and several variants that trade off speed, RAM, and ROM. One semi-popular application is "I want to squeeze all kinds of cool stuff into my video game, but I want it all to fit on one disk / one ROM. Oh, and I want it to boot fast, too.". We want to create a bootable executable on that disk that loads the rest of the info off the disk and decompresses it. In this case, we want to minimize the total (decompression binary + compressed video game data) size -- since smaller size means it takes less time to get it off the disk. On one hand, we want the algorithm very simple (so that the decompression binary is small); on the other hand, we want the algorithm very sophisticated (so that the compressed video game data is small).

representing integers

What are the different ways of representing integers ?

see fixme - floating point; endian

[FIXME: this leaves out gray codes, PN shift codes ...]

[FIXME: point to different ways of representing real numbers -- IEEE floating point, continued fraction representation, factorial notation, etc.]

[FIXME: summarize this conversation]


From: CBFalconer 
Subject: Re: compact storage of integer values
Date: 01 Feb 2001 00:00:00 GMT
Approved: clc at plethora.net
Organization: AT&T Worldnet
Reply-To: cbfalconer at worldnet.att.net
Newsgroups: comp.lang.c.moderated

Harald Kirsch wrote:
> > Situation: a HUGE (no HHHUUUGGGEEE) number of mostly small data
> elements must be kept in memory. Their size is stored along with
> them. 7 bit are sufficient to hold the size most of the
> time. Nevertheless larger sizes can be expected.
> > I came up with the following scheme, where 7 should be read (CHAR_BIT-1):
> > If size fits in 7 bit, store it.
> If size fits in n*7 bits:
> store (n-1) times 7 bit in a char and set the 8th bit
> store the last 7 bit in the last char (without 8th bit set)
> > The idea is to use the high-bit of a char to indicate if more bytes
> follow.

I suggest you first evaluate what the MAX value can be.  I suspect
you don't need the infinite linking.  If MAX is expressible in N
bytes, then use 0..255-N+1 as values, and M = N+2 .. 255 to
signify that the following M-N bytes hold a value.

I think this has less wastage, crams more into the 1 byte
representation, etc.

--
Chuck F (cbfalconer at my-deja.com) (cbfalconer@XXXXworldnet.att.net)
http://www.qwikpages.com/backstreets/cbfalconer
   (Remove "NOSPAM." from reply address. my-deja works unmodified)
   mailto:uce@ftc.gov  (for spambots to harvest)
--
comp.lang.c.moderated - moderation address: clcm at plethora.net

[This is in response to a suggestion by
Hans-Bernhard Broeker and Andy Isaacson
 who (independently ?) proposed a specialized version
 of this general scheme with
N = 128
which is simple to decode:
 hi bit = 0: this byte is a small literal (0...127).
 hi bit = 1: clear the hi bit, then use the result as the number of 8 bit chunks following.
]



To: "John S. Fine" <johnfine at erols.com>, Pasi Ojala <albert at cs.tut.fi>, Antaeus Feldspar <feldspar at cryogen.com>
From: David Cary <d.cary at ieee.org>
Subject: "little workspace" decompressors


Dear "John S. Fine" <johnfine at erols.com>
and Pasi Ojala <albert at cs.tut.fi>
and Antaeus Feldspar <feldspar at cryogen.com>,

I saw you all in the comp.compression newsgroup.

You all seem to be interested in "little workspace" decompressors.
I just thought I'd point out your web pages to each other.
I am really impressed with the clever scheme Pasi Ojala came up with.

How does the Elias Gamma codes (that Pasi Ojala re-invented) compare
with the functionally-similar variable-length code John Fine invented ?
Both map more common (smaller value) integers to shorter (in bits) prefix codes.
(Of course, knowing the actual value frequencies and applying huffman
gives one of several optimum encodings. What are the actual value frequencies for typical files ?).
For this application,
we also want a code that a very small/fast routine can decode
(slightly less-than-optimum encoding can trade off for smaller/faster decode routine).

>From:         "John S. Fine" <johnfine at erols.com>
>Take bits in triplets from high order to low order, with the rightmost
>bit in each triplet used as a stop bit:
>001=0  011=1 ... 111=3 000001=4, 000011=5 ... 010001=8 ... 110111=19
>... 000000001=20 ... 110110111=83 etc.  It may not be obvious, but
>the above can be decoded in a trivial amount of x86 assembler code
>with no tables or workspace.

Pasi Ojala < albert at cs.tut.fi >
http://www.cs.tut.fi/~albert/Dev/pucrunch/
some interesting thoughts on data compression --
-- optimizing for very low-RAM decompression agents -- minimizing *both*
decompression *code* as well as intermediate data used in decompression.
("in-place" decompression; the compressed data includes the decoder program;
RAM is smaller than the total compressed data + total decompressed data)
Includes source code implementing his ideas.

>Subject:      Comment on these compression ideas, please
>From:         "John S. Fine" <johnfine at erols.com>
>Date:         1998/06/27
>Newsgroups:   comp.compression
>...
>http://www.erols.com/johnfine/

I'm thinking about building very tiny embedded processors,
that instead of communicating status via LEDs or a tiny LCD panel,
use the serial port to
send full-fledged HTML pages (with pointers to pretty graphics on some other machine)
as status; perhaps even feedback forms for control.
Since these processors have only tiny amount of RAM (but slightly larger amounts of ROM),
Antaeus Feldspar has a good idea to try to apply a bit of compression.

p.s.: Given a device spewing HTML pages out its serial port,
does any of you know the techy details involved in actually displaying them
on a commodity PC using some web browser ?
I *could* use a terminal program, capture-as-text, save,
then load into my web browser -- but surely there's a better way.












From: "Ojala Pasi 'Albert'" <albert at cs.tut.fi>
Subject: Re: "little workspace" decompressors
To: d.cary at ieee.org (David Cary)
Date: Fri, 10 Jul 1998 11:53:14 +0300 (EET DST)
Cc: johnfine at erols.com, albert at cs.tut.fi, feldspar at cryogen.com
MIME-Version: 1.0

> How does the Elias Gamma codes (that Pasi Ojala re-invented) compare
> with the functionally-similar variable-length code John Fine invented ?

(Note: do not swallow before biting a bit. I may jump to conclusions here.)

John's code assumes (and benefits from) more flat distribution.
The code can be transformed into:
   n     n+1
  0 1(bb)
by rearranging the "stop bits" to the beginning of the code. So, in fact
we have a unary code which tells us how many actual bits to read (times 2).
(Btw, rearranging the code like this may even make it easier and faster
 to decode.)

We can compare this to the Elias Gamma Code:
   n   n
  0 1 b

The code lengths for the first couple of values become
  Gamma:  1 3 3 5 5 5 5 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 ...
  John's: 3 3 3 3 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 9 9 9 ...
So, the result depends entirely on the value distribution.

> What are the actual value frequencies for typical files ?

_My_ typical frequencies were very skewed, that's why I selected
Gamma Code. Your typical files probably differ :-)

Oh, and btw, my literal tag system automatically takes advantage
of 7-bit literals and literal locality.

-Pasi
--
"We believe that no race can be truly intelligent without laughter."
	-- Delenn to Sheridan in Babylon 5:"A Race Through Dark Places"








Date: Sun, 12 Jul 1998 08:52:08 -0400
From: John Fine <johnfine at erols.com>
Reply-To: johnfine at erols.com
MIME-Version: 1.0
To: "Ojala Pasi 'Albert'" <albert at cs.tut.fi>
CC: David Cary <d.cary at ieee.org>, feldspar at cryogen.com
Subject: Re: "little workspace" decompressors

Ojala Pasi 'Albert' wrote:
>
> > How does the Elias Gamma codes (that Pasi Ojala re-invented) compare
> > with the functionally-similar variable-length code John Fine invented ?

  There are actually a whole range of variable-length codes
that you can get by unary coding a number U and having
(aU+b) data bits.  Note that some of the value is carried
in the length bits, because you never use a longer code
than is needed for a given value.

  If I understand correctly, Elias Gamma uses (a=1, b=-1).
My code has (a=2, b=0).  (Note, I am taking U to be the
total number of length bits, not the value encoded by them).

> John's code assumes (and benefits from) more flat distribution.

  Actually I used that flatter one for offset and a code more
like Elias Gamma (which I had never heard of) for length.

> by rearranging the "stop bits" to the beginning of the code. So, in fact
> we have a unary code which tells us how many actual bits to read (times 2).
> (Btw, rearranging the code like this may even make it easier and faster
>  to decode.)

  I am fairly sure that keeping the length bits and content
bits mixed results in smaller and simpler code (probably
faster as well).  Even at the one to one ratio of Elias
Gamma, I think mixing the bits would give simpler code.

--
http://www.erols.com/johnfine/
http://www.geocities.com/SiliconValley/Peaks/8600/







Date: Mon, 13 Jul 1998 11:45:56 -0400
From: "John S. Fine" <johnfine at erols.com>
Reply-To: johnfine at erols.com
MIME-Version: 1.0
To: David Cary <d.cary at ieee.org>
CC: "Ojala Pasi 'Albert'" <albert at cs.tut.fi>
Subject: Re: "little workspace" decompressors

David Cary wrote:

> They all unary code a number U >= 0 with U zeros, then a one, followed by
> (aU+b) data bits (where b >= 0, and usually a > 0), for a total length of
>   (U(a+1) + 1 + b) bits.

  I was using U > 0 including the "stop" bit;  But it describes the
same thing, so we can use your form.

> This plain flavor of of the Elias Gamma (a=1, b=0) code seems easy to decode
> since the code *is* the value of the code; once just needs to know how many
> bits to grab. But I suspect rearranging a la John's method creates a decode
> routine much shorter and faster than the routine needed to decode the plain

  I would guess a LITTLE shorter and MAYBE faster.

> On the other hand, I just discovered some codes [*] that are *better* than
> the these (aU+b) codes in the sense that, no matter what the distribution
> of integers (as long as it is monotonic), the compressed file will be
> smaller than if we used any of these (aU+b) codes.

  I don't think that is possible.

> Perhaps something like this would be close enough to fibonacci,
> yet still be easy to decode:
> 000 to 101 are the 6 different terminal codes
> (use base 6: multiply accumulated total by 6, add 0..5 to make final result)
> 110 to 111 are the 2 different non-terminal codes.
> (use base 2: multiply accumulated total by 2, then add 0 or 1)

> Since (we assume) smaller numbers are always *more* common than large numbers,
> the 2 codes where "base 6" is worse than John's "stop bit" code
> are less common than the 2 codes where "base 6" is better than John's "stop
> bit" code -- the net effect is using this "base 6" code compresses to fewer
> bits than using the "stop bit" code -- no matter what the actual
> distribution is.

  You stopped too early in examining the distribution.  You have simply
picked a code that does low numbers slightly better and high numbers
MUCH worse (you thought it did high numbers SLIGHTLY worse because you
stopped too soon).

  Compare 3-bit codes with 6 terminal values to three bit codes with
4 terminal values.  This table shows the total number of values that
can be represented in N or fewer bits.

bits   6T   4T
 3      6    4
 6     18   20
 9     42   84
12     90  340

--
http://www.erols.com/johnfine/
http://www.geocities.com/SiliconValley/Peaks/8600/










From: "Ojala Pasi 'Albert'" <albert at cs.tut.fi>
Subject: Re: "little workspace" decompressors
To: johnfine at erols.com
Date: Wed, 15 Jul 1998 13:26:21 +0300 (EET DST)
Cc: albert at cs.tut.fi, d.cary at ieee.org, feldspar at cryogen.com
MIME-Version: 1.0

> I am fairly sure that keeping the length bits and content
> bits mixed results in smaller and simpler code (probably
> faster as well).

Well, of course this depends on the architecture of the chip
performing the decode. In my case the most effective way to
read bits is one at a time (through Carry), thus they _are_
read one at a time.

There is of course the loop overhead, which could be reduced
by making two calls to the getbit routine. The decode code would
become slightly larger (6 bytes ~ 2%) because I still need the
 N
b  routine for the linear code decode.

And after really thinking about it, if and when you can easily
handle multiple bits at a time (barrel shifter available),
using this stop-bit-arrangement does indeed seem faster (less
jumps per bit).

> I think mixing the bits would give simpler code.

Unary + linear is simpler code. If you mean "faster/simpler to decode",
then you are absolutely right. I may even try it myself, although my
main concern is the size of the decoder. I would need a definite
speed increase (and there are several ways to increase the speed
of my decoder, but they need a longer decoder).

-Pasi







To: johnfine at erols.com, "Ojala Pasi 'Albert'" <albert at cs.tut.fi>
From: David Cary <d.cary at ieee.org>
Subject: Re: "little workspace" decompressors

Re: "little workspace" decompressors

Yes, rearranging the bits makes no difference in how many bits there are
(how big the output file will be).
I think John is right in saying that, given any particular code,
if we scramble up the bits we can make the decoder program shorter and faster.

I never considered scrambling the bits before,
but now I think it's cool.
I think that all the (aU+b) codes John mentioned
have a (at least one) "easy-to-decode" scrambled version.

They all unary code a number U >= 0 with U zeros,
then a one, followed by (aU+b) data bits (where b >= 0, and usually a > 0),
for a total length of
  (U(a+1) + 1 + b) bits.
(Ojala Pasi uses the equivalent U ones,
then a zero -- I guess that was easier to decode for him).
The Elias gamma code has a = 1, b = 0. Plain 7-bit ASCII uses a=0, b=7. John's code has a=b=2.

Perhaps it would be interesting
for the compressor program to select some "optimum" value of a and b
for the particular file being compressed.
Especially when, as with the Pasi Ojala algorithm,
the decompressor program is embedded in the compressed data
and can be optimized for that particular value of a and b.
Perhaps the compressor would only try a couple of values for a and b
that are particularly easy to decode.
For example, we could generalize John's code to all the b=a codes:
scramble 1 "continue" bit with (a) bits of data to make fixed-length blocks of (a+1) bits
(where a=2 is John's code).
(One could brute-force try compressing with a=1, then a=2, then a=3.
Perhaps there is a efficient algorithm for the compressor to decide which a is optimal ...
depends heavily on the distribution of values ...
we want a code that "looks like" the optimal huffman code)

This plain flavor of of the Elias Gamma (a=1, b=0) code seems easy to decode
since the code *is* the value of the code;
once just needs to know how many bits to grab.
But I suspect rearranging a la John's method
creates a decode routine much shorter and faster
than the routine needed to decode the plain version.

         plain      scrambled (1x = last block)
  1          1              1
  2        010           0 10
  3        011           0 11
  4      00100        0 00 10
  5      00101        0 00 11
  6      00110        0 01 10
  7      00111        0 01 11
  8    0001000     0 00 00 10
  9    0001001     0 00 00 11
  A    0001010     0 00 01 10
  B    0001011     0 00 01 11
  C    0001100     0 01 00 10
  D    0001101     0 01 00 11
  E    0001110     0 01 01 10
  F    0001111     0 01 01 11
 10  000010000  0 00 00 00 10
 11  000010001  0 00 00 00 11

Both codes are, of course, identical in bit length,
number of zeros, number of ones, etc. --
the only difference is in the size and speed of the decompression program.

On the other hand, I just discovered some codes [*]
that are *better* than the these (aU+b) codes in the sense that,
no matter what the distribution of integers (as long as it is monotonic),
the compressed file will be smaller than if we used any of these (aU+b) codes.
Unfortunately, I have not (yet) figured out a small routine to decode these codes.
I suppose I could just have a lookup table for the first N codes,
then revert back to John's code if it is one of the rare items not in the lookup table.
This may be a win in size (if the extra compaction in the data compensates for the larger program)
and speed (if the most common table-lookup is faster than bit-by-bit decoding),
but it just doesn't seem elegant.

I've been trying to think of how to take advantage of "edge effects"
in small-file compression.
As a particular example, after the decompressor has decoded, say, 200 bytes,
if it were really intelligent then it would "know" that a offset value in a (offset, length) LZ77 code
couldn't possibly be more than 200 bytes;
so perhaps we can somehow shorten the length (in bits) of the codes that *are* possible.
More generally, it doesn't seem "right" to me
that most compression algorithms get significantly different compressed file sizes
on a file vs. the time-reversed image of that file.
Huffman is the only algorithm I know of that gets exactly the same compressed file size either way.
If I could somehow make the conditions when the decompressor is near the end of the file
"more like" the conditions when it is near the start of the file,
perhaps I could squeeze out a few more bits of compression.
The length of the uncompressed file is usually known to the compressor before compression even begins;
the decoder often has a rough estimate
(and sometimes the exact value) of the total compressed code length and the total decompressed data length;
perhaps it can take advantage of this meta-information...

[*] The Elias delta and the Fibonacci codes.
2 radically different ways of overcoming the inherent inefficiency of the unary number encoding.
Elias delta has 3 parts: a unary number encoding U, then 2^N bits encoding V,
then 2^V bits encoding the actual desired number.
I think I remember reading that Elias delta was already "optimal", in some sense --
there was no advantage to going to 4 parts.
(But I don't really understand why that was true).

Fibonacci codes,
rather than having a unary code of 0 bits that ends at the 1st 1 bit,
instead has a more compact code that ends the first time it hits 2 consecutive 1 bits.
This is similar to the "bit-stuffing" in most long-distance communication protocols
(and CD-ROM format and some magnetic disk formats),
where the data is not allowed to have N 0 bits in a row (because the receiver would lose clock sync).
*Every* time there is (N-1) "0 bits" in a row, the transmitter *always* stuffs a "1 bit".
When decoding, every time there is (N-1) "0 bits", always throw away the following "1 bit".
Is there a efficient way of decoding these values (perhaps by scrambling the bits) ?
Or a code similar to this (better than (aU+b) ) that can be decoded with a short program ?
(perhaps read pairs of bits, and stop when both bits equal 11 -- meanwhile decode using base 3 ?).

I first read about Elias delta and Fibonacci codes at
  http://www.ics.uci.edu/~dan/pubs/DC-Sec3.html#Sec_3.3
>3.3 Universal Codes and Representations of the Integers
...
>The first Elias code is one which is simple but not optimal. This code, gamma
...
>Elias delta
[is optimal]
...
>While the Fibonacci codes are not asymptotically optimal,
they compare well to the Elias codes as long as the number of source messages is not too large.
Fibonacci codes have the additional attribute of robustness,
which manifests itself by the local containment of errors.
...
>a Fibonacci code provides better compression than the Elias code
until the size of the source language becomes very large.
...
>We describe only the order 2 Fibonacci code;
the extension to higher orders is straightforward.
>
>N             R(N)               F(N)
>
> 1                           1   11
> 2                       1   0   011
> 3                   1   0   0   0011
> 4                   1   0   1   1011
> 5               1   0   0   0   00011
> 6               1   0   0   1   10011
> 7               1   0   1   0   01011
> 8           1   0   0   0   0   000011
...
>16       1   0   0   1   0   0   0010011
...
>32   1   0   1   0   1   0   0   00101011
>
>    21  13   8   5   3   2   1
>
>Figure 3.7 -- Fibonacci Representations and Fibonacci Codes.


Perhaps something like this would be close enough to fibonacci,
yet still be easy to decode:
000 to 101 are the 6 different terminal codes
(use base 6: multiply accumulated total by 6, add 0..5 to make final result)
110 to 111 are the 2 different non-terminal codes.
(use base 2: multiply accumulated total by 2, then add 0 or 1)

      base 6   "stop bit"

         000          001
         001          011
         010          101
         011          111
         100      000 001
         101      000 011
     110 000      000 101
     110 001      000 111
     110 010      010 001
     110 011      010 011
     110 100      010 101
     110 101      010 111
     111 000      100 001
     111 001      100 011
     111 010      100 101
     111 011      100 111
     111 100      110 001
     111 101      110 011
 110 110 000      110 101
 110 110 001      110 111
 110 110 010  000 000 001
 110 110 011  000 000 011
 110 110 100  000 000 101
 110 110 101  000 000 111
 110 111 000  000 010 001
...
Since (we assume) smaller numbers are always *more* common than large numbers,
the 2 codes where "base 6" is worse than John's "stop bit" code
are less common than the 2 codes where "base 6" is better than John's "stop bit" code --
the net effect is using this "base 6" code compresses to fewer bits than using the "stop bit" code --
no matter what the actual distribution is.
But this probably increases the complexity of the decoder.

>From: John Fine <johnfine at erols.com>
...
>  There are actually a whole range of variable-length codes
>that you can get by unary coding a number U and having
>(aU+b) data bits.  Note that some of the value is carried
>in the length bits, because you never use a longer code
>than is needed for a given value.
...
>  Actually I used that flatter one for offset and a code more
>like Elias Gamma (which I had never heard of) for length.
...
>  I am fairly sure that keeping the length bits and content
>bits mixed results in smaller and simpler code (probably
>faster as well).  Even at the one to one ratio of Elias
>Gamma, I think mixing the bits would give simpler code.
>
>--
>http://www.erols.com/johnfine/
>http://www.geocities.com/SiliconValley/Peaks/8600/


>From: "Ojala Pasi 'Albert'" <albert at cs.tut.fi>
...
>The code lengths for the first couple of values become
>  Gamma:  1 3 3 5 5 5 5 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 ...
>  John's: 3 3 3 3 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 9 9 9 ...
>So, the result depends entirely on the value distribution.
...
>-Pasi

(posted to comp.compression; also mailed direct)

--
+ David Cary "http://www.ionet.net/~caryd_osu/david"
| Future Tech, Unknowns, PCMCIA, digital hologram, <*> O-











To: "John S. Fine" <johnfine at erols.com>, Pasi Ojala <albert at cs.tut.fi>, Antaeus Feldspar <feldspar at cryogen.com>
From: David Cary <d.cary at ieee.org>
Subject: Re: "little workspace" decompressors


Yes, John is quite right-- the 6T code I just made up gets worse very quickly at high numbers.
If I were lucky enough to have an input file that needed these large numbers extremely rarely,
then my 6T code would do a better job compression than the 4T code;
but if the data has a more even distribution, then the 4T code would be better.

Is this true for the Fibonacci codes as well --
are they always better, as I originally claimed, or is it data-dependent ?

1st hand:
It *seems* like Fibonacci would always be better --
if one looks at all the length N unary codes (there is only 1: (N-1) zeroes, followed by a 1),
and compared it to all the length N Fibonacci codes
(which includes the unary code,
plus some other codes with lots of zeros with single "1" buts scattered through them,
followed by a "11"),
it seems there are always more Fibonacci codes of a specific length
(which means it can represent more data).
(I'm having a little difficulty coming up with
a formula predicting exactly how many Fibonacci codes there are of length N).

2nd hand:
On the other hand, one can use the "pigeon-hole",
"counting" argument --
since, when decoding, all possible bit sequences mean *something* unique,
there is no redundancy.
it would then seem that one cannot guarantee that one code will always get better compression than another.
In fact, one can pick any code,
then synthesize a artificial data file to make that code look the best
by using numbers with a frequency proportional to 1/(#bits needed to represent that number in this code).

How do I resolve the contradiction between these 2 hands ?

U
	U-bit unary codes
			U-bit Fibonacci codes

1	1: 1		0
2	1: 01		1: 11
3	1: 001		1: 011
4	1: 0001		2: 0011, 1011
5	1: 00001	3: 00011, 01011, 10011
6	1: 000001	5: 000011, 001011, 010011, 100011, 101011
u	1: 0...01	f(u) (is there a formula ?)

(neglecting the aU+b data bits that follow the code)

Hm... if there were *always* at least many Fibonacci codes as unary codes
(the column at right always had at least 1, instead of that pesky 0 at the top)
then the 1st hand paragraph would be true.
What sorts of distributions would compress better using unary codes ?

Which of these codes
(Fibonacci, Fibonacci aU+b, unary aU+b, Elias delta)
has the best match with the 1/f distribution
typical of the distribution of words in natural languages --
and, so I hear, the distribution of patterns in DNA ?

  http://www.ics.uci.edu/~dan/pubs/DC-Sec3.html#Sec_3.3
>3.3 Universal Codes and Representations of the Integers
...
>We describe only the order 2 Fibonacci code; the extension to higher orders is straightforward.
>
>N             R(N)               F(N)
>
> 1                           1   11
> 2                       1   0   011
> 3                   1   0   0   0011
> 4                   1   0   1   1011
> 5               1   0   0   0   00011
> 6               1   0   0   1   10011
> 7               1   0   1   0   01011
> 8           1   0   0   0   0   000011
...
>16       1   0   0   1   0   0   0010011
...
>32   1   0   1   0   1   0   0   00101011
>
>    21  13   8   5   3   2   1
>
>Figure 3.7 -- Fibonacci Representations and Fibonacci Codes.


>Date: Mon, 13 Jul 1998 11:45:56 -0400
>From: "John S. Fine" <johnfine at erols.com>
>To: David Cary <d.cary at ieee.org>
>CC: "Ojala Pasi 'Albert'" <albert at cs.tut.fi>
>Subject: Re: "little workspace" decompressors
>
>David Cary wrote:
...
>> On the other hand, I just discovered some codes [*] that are *better* than
>> the these (aU+b) codes in the sense that, no matter what the distribution
>> of integers (as long as it is monotonic), the compressed file will be
>> smaller than if we used any of these (aU+b) codes.
>
>  I don't think that is possible.
>
>> Perhaps something like this would be close enough to fibonacci,
>> yet still be easy to decode:
>> 000 to 101 are the 6 different terminal codes
>> (use base 6: multiply accumulated total by 6, add 0..5 to make final result)
>> 110 to 111 are the 2 different non-terminal codes.
>> (use base 2: multiply accumulated total by 2, then add 0 or 1)
>
>> Since (we assume) smaller numbers are always *more* common than large numbers,
>> the 2 codes where "base 6" is worse than John's "stop bit" code
>> are less common than the 2 codes where "base 6" is better than John's "stop
>> bit" code -- the net effect is using this "base 6" code compresses to fewer
>> bits than using the "stop bit" code -- no matter what the actual
>> distribution is.
>
>  You stopped too early in examining the distribution.  You have simply
>picked a code that does low numbers slightly better and high numbers
>MUCH worse (you thought it did high numbers SLIGHTLY worse because you
>stopped too soon).
>
>  Compare 3-bit codes with 6 terminal values to three bit codes with
>4 terminal values.  This table shows the total number of values that
>can be represented in N or fewer bits.
>
>bits   6T   4T
> 3      6    4
> 6     18   20
> 9     42   84
>12     90  340
>
>--
>http://www.erols.com/johnfine/
>http://www.geocities.com/SiliconValley/Peaks/8600/









Date: Thu, 30 Jul 1998 10:11:51 -0400
From: "John S. Fine" <johnfine at erols.com>
To: David Cary <d.cary at ieee.org>
Subject: Re: "little workspace" decompressors

David Cary wrote:

> Is this true for the Fibonacci codes as well -- are they always better, as
> I originally claimed, or is it data-dependent ?

bits   Fib   6T    4T
1        -    -     -
2        1    -     -
3        2    6     4
4        4    -     -
5        7    -     -
6       12   18    20
7       20    -     -
8       33    -     -
9       54   42    84
. . .
12     232   90   340
15     986  154  1364

  Maybe I misunderstand "better".  The values in the
table above are cumulative.  In 9 or fewer bits, fib
can represent 54 different values; 4T can represent
84 different values.

--
http://www.erols.com/johnfine/
http://www.geocities.com/SiliconValley/Peaks/8600/


lossy text compression

[this section is very unfinished ...]

Currently, good-quality lossy text compression can only be done by humans. It's one of the few remaining areas where humans are clearly superior to silicon-based computers.

random ideas

I've been thinking about "conflation" (``confounded'' ?), mixing together unrelated items, and how to undo this (in hopes that such a transformation could lead to better compression).

From: d_cary < at my-dejanews.com >
Subject: undoing conflation
Date: 28 Oct 1998 00:00:00 GMT
Newsgroups: comp.compression



Assume that I have some pseudo-English text such that each word is a
standard English word and occurs with pretty much the standard frequency for
that word in English text, but each word has absolutely no correlation to the
next ("memoryless").
(There are, of course, strong correlations between the letters inside a word,
as well as common word prefixes, common word suffixes, etc).

The best possible compression I could hope for would be to
use each word as a "symbol" and compress one symbol at a time
with arithmetic coding (marginally better than Huffman coding), right ?
(I assume the file is very big relative to the size of the frequency table).

If I scramble each word with any reversible word-&integer transformation
before compressing,
the compressed file will be exactly the same size
as the one I created without this transformation, right ?

If I just randomly selected clumps of letters
(say, 4 letters at a time), paying no attention to the word boundaries,
compressing one clump (one symbol) at a time with arithmetic (or Huffman)
compression, I think the compressed file would be much worse (right ?)
than the previous 2 files.
If I scramble each clump with any reversible clump-&integer transformation,
creating a "conflated file",
the compressed file will be exactly the same ("much worse") size, right ?
(Is there a better term for this than "conflated file" ?).

Here's the question: Say I have a "conflated file" -- a sequence of (short)
integers with the above characteristics. How do I compress it (lossless) as
small as possible. Rather than immediately hitting it with arithmetic coding,
the best thing to do is the inverse integer->clump transformation, then
compress one word (symbol) at a time with Huffman coding. But I don't know
which clump->integer transformation was used. Now what ? Is there some way I
could un-do (some of) the damage, somehow break the integers up into
variable-length strings (doesn't have to be the original text), re-group them
into "better" multi-byte words ? If I could do that, I could use the
above-mentioned "optimal" arithmetic coding.

If this were possible, one might even be able to apply it to compressing
standard text -- -- ideally, it would recognize that the letter "T" is
conflating the idea of "next letter is capitalized" with the idea of the
letter "t". Once we un-did that conflation, we would have a text file without
any Upper Case letters, but we would have to add a new symbol (I arbitrarily
choose A) to indicate that the next letter is a capital.

(conflated)
This is a test. If this were ...
(unconflated)
Athis is a test. Aif this were ...

Of course, now we have the extra overhead of having to stick extra
information in the compressed file telling the uncompressor, once the
"unconflated file" has been decompressed, how to re-conflate it so we can
recover the original conflated file I was given.

But compressor can now recognize that both "This" and "this" are the same
word, and that the "A" symbol is almost always preceded by a space or a
newline, allowing it to create a smaller file.

Is it possible to get a net benefit on the
  Canterbury Corpus
? I hope so :-).

--
David Cary
http://www.rdrop.com/~cary/html/data_compression.html

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own

Other compression-related ideas:

error detection and correction

ECC: error correcting codes

EDAC: error detection and correction

[FIXME: move all this information to "Data Link Error Detection / Correction Methods" http://massmind.org/techref/method/errors.htm ]

CRC: cyclic redundancy check (see spin_dictionary.html#crc for other CRC acronyms )

Data compression is often preparatory to communicating it over some "limited-bandwidth" real-time communications channel or storing it on a limited storage media. After squeezing out most of the redundancy, we *deliberately* add more redundancy. All real-time communications channels add enough redundancy (in the form of headers, trailers, and CRC codes) to allow the errors to be detected, so the receiver can ask the transmitter to re-send the data. Some communications channels and most storage devices add even more redundancy (Hamming codes for DRAMs, Reed-Solomon codes for CD-ROMs, and other ECCs) to let us detect and correct errors at the receiver or media reader.

The simple hardware to do CRCs looks very similar to the simple hardware to do LFSR machine_vision.html#spread_spectrum

program compression

2001-12-01:DAV: started section. (Thanks to "Jerry Lim Han Sin" who encouraged me to write down my thoughts on program compression)

Lots of people complain about huge, bloated executables. What can we do about it ?

related local files:

Normally I (DAV dav_info.html) utterly despise machine/OS dependent stuff -- because this makes it far too difficult to *improve* my machine. But in this special case, it's cool -- executable files are inherently machine/OS dependent *anyway*, so it doesn't make things any worse to use techniques that also are machine/OS dependent.

Ways of compressing programs: [FIXME: is there a better terminology I could use than ``functional'' vs. ``non-functional'' ? ] ( "Metaprogramming and Free Availability of Sources: Two Challenges for Computing Today" article by François-René Rideau http://fare.tunes.org/articles/ll99/index.en.html mentions

)

"The Code Compression Bibliography" (currently maintained by Mario Latendresse) http://www.iro.umontreal.ca/~latendre/codeCompression/ uses the term "code compression" for what I call "program compression" or "executable compression". (Unfortunately, the term "code" conflicts with the jargon used in data compression).

DAV: The P. Koopman quote directly points to 3 ways of decreasing program size:

space-optimizing compiler

If you're using gcc or a similar compiler, consider using the

-Os

(optimize for size) compiler option.

[FIXME: move information on subroutine threading, direct threading, and indirect threading to here] [or to http://en.wikipedia.org/wiki/Threaded_code ]

In some cases (when the CALL instruction on this particular architecture is very compact, and ... it's just as compact to handle control structures in-line than it is to CALL subroutines to handle them), subroutine threading can be just as compact as the other kinds of threading.

compact instruction set

DAV wonders what a very compact instruction set would look like, and what a good tradeoff in size/speed/power would be.

[FIXME: how to properly split this topic between data compression and computer_architecture.html#considerations ?]

A compact instruction set can either be implemented directly in hardware, or (trading off some run-time speed) emulated/interpreted (for example, the BASIC Stamp and Pcode), or (trading off some start-up time) recompiled before running (for example, JIT Java).

There are 3 main variants that I know of to creating a compact instruction set:

I think this illustrates "premature optimization" -- the misplaced focus on making individual instructions small (cb) interferes with the real goal of making the entire program small (cc). If you're only allowed to look at one instruction at a time (cb), the best you can do is a Huffman-style compression. But if you're able to look at even 2 or 3 instructions at a time (cc), you can often get much better compression -- LZ, LZW, and similar kinds of compression. (But I keep thinking that LZ, LZW compression won't work if the program is already well-factored).

related links:

other functional compression

I am especially interested in ``lossy'' program compression. (When an embedded system lacks enough RAM to run the full-up version, ... a smaller program that functionally *does* all the same stuff, although it's impossible to recover the exact original file from the compressed version, and it may have slightly worse performance. Example: recognize ``unrolled loops'', then shrink them back down to a single cycle of the loop. Example: "refactoring" .

However, "functional compression" is really ...

Tradeoff between performance and size:

program compression links:

file formats

[FIXME: should I combine all my file format information in one place ? Or does it make sense to divide it into 2 parts:

]

[FIXME: html-ify "file format considerations"; add use the Unicode trick of "FFFE" to detect and correct byte-swap. ]

unsorted


Original Author: David Cary. This page split off from computer_graphics_tools.html on 1998-07-10. and has backlinks

known by AltaVista

known by Yahoo!

known by infoseek

comments, suggestions, errors, bug reports to

David Cary feedback.html
d.cary@ieee.org.

Return to index // end http://david.carybros.com/html/data_compression.html /* was http://rdrop.com/~cary/html/data_compression.html */