data compression

updated 2009-04-29.

Contents:

introductions to data compression beginner tutorials
data compression, in general
lossy text compression
2D Image Compression including
- fractal image compression
- wavelet image compression
- other 2D image compression
- color image compression: Color coordinate systems and conversion between them.
1D data compression (including some lossless text compression) including
- Huffman coding.
- Reverse Huffman coding
Benchmark Images; Benchmark Files
0D short data compression (including some lossless text compression)
program compression various ways to compress executable software, especially methods that create executable software that executes and runs similar to the original.
representing integers
random ideas
error detection and correction (deliberately adding small amounts of redundancy) info on CRC codes, FEC (forward error correction), etc. [FIXME: should I split this out into a seperate file ?]
interpolation
file formats
unsorted

Graphics file formats and parsers computer_graphics_tools.html#file_formats for information on *standard* compression protocols (such as PNG and JPEG).
c_programming.html#refactoring
If the data you're trying to compress corresponds to points on a curve, you might try to ``compress'' the data with 1d_design.html#curve_fitting , then ``decompress'' the data with 1d_design.html#interpolation . The most common ``curve'' is audible sound. There's been tons of research on speech compression and music compression -- I have a couple of related links on music.html . This is nearly always lossy compression -- however, in instrumental music, you might be able to ``match'' the curve for a certain duration with a certain note on a musical instrument; and argue that compressing that huge waveform down to that short description (play middle C on the piano for a quarter note) really only loses ``noise''.
Machine Vision machine_vision.html for technical details on the cameras needed to collect the images before compression, and on some of various image processing tools one applies to these image files.

"It is my ambition to say in ten sentences what others say in a whole book." -- Friedrich Nietzsche

"Short words are best and the old words when short are best of all." -- Winston Churchill

The most valuable talent is that of never using two words when one will do. -- Thomas Jefferson

If you would be pungent, be brief; for it is with words as with sunbeams. The more they are condensed, the deeper they burn. - Robert Southey

"In the multitude of words there wanteth not sin: but he that refraineth his lips is wise." -- Proverbs 10:19

data compression, in general

There are several different orthogonal ways to categorize data compression:

lossy v. lossless
0D v. 1D v. 2D v. 3D. (taking advantage of data locality -- -- for example, the Huffman code is provably optimal for 0D, but other algorithms are better at exploiting spacial locality in higher dimensions).
context dependent v. non-context dependent
variable-to-block vs. block-to-variable (block-to-block compression is always lossy) (can variable-to-variable compression always be decomposed into a variable-to-block+block-to-variable compression ?)
short (<~200 chars), medium (~20 K chars), large (>~2 M chars).
zero-latency decoding (can start emitting decompressed text immediately after receiving the first code word, emitting at least 1 output char for every compressed code word)(LZRW), medium-latency coding (after a initial table or dictionary, can start emitting decompressed text after only a few more chars)(Huffman), and full-latency (must have practically the entire file available before the first few chars can be decoded)(bzip). (For telephone applications in particular, a algorithm with very low latency will be chosen even over a full-latency algorithm that could slash the bandwidth in half.)

Most people have assumed that the decompressor has access to the complete compressed data file from the beginning. However, DAV has been thinking about ways to synchronize a decoder in the middle of a data stream -- for example, immediately after you turn on your TV, you want to be able to start watching a compressed digital video signal almost immediately. This is similar to ``medium-latency'', except it requires the TV to be able to decode the last part of the TV program even if it completely missed out on the first part. Some fractal compression techniques have this property ... I've been calling this ``history-limited decompression''.

introductions to data compression beginner tutorials

Secret Code Class Materials by Louis Howell http://www.webcom.com/nazgul/codeclass.html looks like a fun way to introduce kids to binary and information representation.
a brief introduction designed for ``gifted students in the 8 to 12 age group''. The concepts of ``coding'' and various forms of data representation are very important to data compression. (In particular, Morse code and the ASCII character set). The understanding that some letters are more frequent than others is good for understanding Huffman compression. the understanding that some letter pairs are far more frequent than the letter frequencies alone would suggest is good for understanding LZ77 and related compression algorithms. Has good info on letter-pair frequences (his charts have first letter on left, second letter on top; one chart based on the King James Bible).
(example: http://members.ozemail.com.au/~xenophon/code.html )
Stuart Cheshire notes that often something appears slow because it has bad latency; adding compression always makes latency worse. "For every Network Service, there's an equal and opposite Network Disservice" by by Stuart Cheshire (no date) http://www.stuartcheshire.org/rants/Networkdynamics.html and "It's the Latency, Stupid" by Stuart Cheshire, May 1996. http://www.stuartcheshire.org/rants/Latency.html
_The Code Book_ by Simon Singh ``That book is amazing. I'd recommend anybody who hasn't read it, read it. Try the cipher challenge too.. it's pretty fun.'' -- recc. nirgle 2002-01-27 http://slashdot.org/comments.pl?sid=26962;cid=2910482
_Introduction to Information Theory and Data Compression_ by Darrel Hankerson, Greg A. Harris, and Peter D. Johnson Jr. http://www.dms.auburn.edu/compression/ Check out ``Obtaining the JPEGtool Matlab and Octave scripts'' and ``Other compression-related information''
http://ddj.com/topics/compression/ has some quick tutorials for the beginner, has some well-written articles on data compression, plus a discussion forum. [meta ?][changing/periodical ?]
The Data Compression Library http://DataCompression.info/ by Mark Nelson ``a comprehensive set of links to as many data compression resources as I can track.'' points to *lots* of data compression topics. [meta] [FIXME: think about moving most of my links into his database; keep only my original stuff here] I especially like the way he marks which links are ``overviews'' for beginners. /* was http://www.dogma.net/DataCompression/ */ [FIXME: consider subscribing to his Data Compression Newsletter ]

More links to data compression, in general (Other pages that, like this one, have lots of links to compression information):

I'm considering moving the entire contents of this page to http://en.wikibooks.org/wiki/Data_Compression .
Data Compression Reference Center http://www.rasip.fer.hr/research/compress/ short, simple, easy-to-understand explainations of the popular compression terminology: "Pulse Code Modulation", "Huffman Coding", "LZ77", "LZSS", "LZ78", "LZW", and the "GIF", "PNG", "FIF" (Fractal Image compression) file formats. It explains the LZW (used in GIF) on page http://www.rasip.fer.hr/research/compress/algorithms/fund/lz/lzw.html "Another interesting variation [on LZW] is the LZMW algorithm. It forms a new entry in the dictionary by concatenating the two previous ones. This enables a faster buildup of longer strings."
A in-depth technical article explaining the theory behind compression and cataloging the various types, by Debra A. Lelewer and Daniel S. Hirschberg http://www.ics.uci.edu/~dan/pubs/DataCompression.html static coding, universal codes (integer representations), adaptive coding, susceptibility to error (both bit flips and deleted bits: ``while it is common for Huffman codes to self-synchronize, this is not guaranteed; and when self-synchronization is assured, there is no bound on the propagation of the error. An additional difficulty is that self-synchronization provides no indication that an error has occurred. ... The Elias codes ... are not at all robust. ... The Fibonacci codes ... on the other hand, are quite robust. ... There is no evidence that adaptive methods are self-synchronizing. ... loss of synchronization can be catastrophic ... error propagation ... '') DAV: I've been playing with ``limited horizon'' ideas, modifying LZRW (and other adaptive programs) to guarantee resynchronization within N characters. With large enough N, it takes only a few more bits to transmit a file (under error-free conditions) than the original LZWR (or other original program). And when an error does occur ... seems like it would be useful for, say, compressed digital TV, where one could turn on the TV any time, and after N bits arrive be guaranteed to get a good pictures ... while the ``reset code'' currently used seems to give significantly worse compression. ... extending this to ``circular compression'' might even give *better* compression than the original LZRW. I wish I had more time to play with this. "bounded error propagation" [FIXME: to_program]
Compression pointers: http://www.internz.com/compression-pointers.html A large list of links organized by research topic -- fractals, wavelets, snippets (?), audio compression, a list of free software organizations and source code, and a huge list of people interested in compression.
comp.compression Frequently Asked Questions (comp.compression FAQ) http://www.faqs.org/faqs/compression-faq/part1/preamble.html or http://www.cis.ohio-state.edu/hypertext/faq/usenet/compression-faq/top.html or ftp://rtfm.mit.edu/pub/usenet-by-hierarchy/comp/compression/
Michael Walden maintains a good list of compression links. Since its name keeps changing, start at http://www.voicenet.com/~mwalden/ then select "Software : Data Compression".
Wikipedia: data compression

2D Image Compression

specifically designed to compress 2D images.

The 2 hottest topics in 2D image compression (circa 1999, when DAV spent a lot of time in this area) are "wavelets" and "fractals" (which really have many similarities).

The goal here is to try to improve on computer_graphics_tools.html#png .

other 2D image compression

The Art of lossless image compression http://www.geocities.com/SiliconValley/Bay/1742/artest14.html "the results of 10 tests performed 10-Oct-1999 to compare lossless compression of True Color (24-bit) images" [FIXME: do they have all the standard images I use ?]
http://www.elis.rug.ac.be/~denecker/doc/papers/prorisc97/ "Check http://www.elis.rug.ac.be/~denecker/doc/prorisc97/ for comparison of state of the art lossless image coding. We discuss: LJPEG, FELICS, CALIC, LOCO, BTPC, S+P."
It is very difficult to compress an image after it's been run through the standard continuous-to-halftone rasterizer algorithms -- one gets *much* better compression by compressing the original continuous images, delaying the halftone process until after decompression. When doing lossy compression of a complete document consists of lots of different images each at its own color depth (from black-and-white text to full 8 bit grayscale), it's difficult to get a consistent image quality over the entire document while compressing. Robert Maxwell Case http://www.YOUniverse.com/ has apparently patented a digital halftone method that is much friendlier to data compression algorithms, and a file format and image viewers to view it on lower-resolution, higher-depth displays.
Image Processing and Compression http://www.wmin.ac.uk/~seamang/image.html has a copy of Shannon's 1948 paper (!) on entropy coding, and lots of links to all kinds of image compression information.

fractal image compression

see also computer_graphics_tools.html#fractals for fractals in general (as artistic and mathematical objects).

(FIXME: move to http://en.wikipedia.org/wiki/Fractal_Compression )

[FIXME: write fractal image compression with extra ``erode'' and ``dilate'' parameters]

http://DataCompression.info/Fractal.shtml has lots of links to books and papers on using fractals to compress images.
Iterated Systems http://www.iterated.com/ (the company behind Fractal Image Format (FIF) files) A Netscape 2.0 plug-in designed for inline support of fractal images with compression over 100:1
IFS - Fractal Image Compression http://www.verrando.com/university/ifs.html "We use insanely complicated mathematics to achieve big compression ratios"
papers by Yuval Fisher on fractal compression, and links to other sites related to fractal image compression.
- Fractal Image Encoding http://inls.ucsd.edu/y/Fractals /* was http://legendre.ucsd.edu/pub/Research/Fisher/fractal.html */ lots of links and papers on fractal image compression.
- Complex Analytic Dynamical Systems http://inls.ucsd.edu/y/Complex/ /* was http://legendre.ucsd.edu/Research/Fisher/complex.html */
- Fractal Video Compression http://inls.ucsd.edu/y/Fractals/Video/fracvideo.html
- http://inls.ucsd.edu/y/Fractals/enc.c (old and buggy ?) fractal encoding source
- http://inls.ucsd.edu/y/Fractals/dec.c (old and buggy ?) fractal decoding source
There is an article in Dr. Dobb's (JAN 96) on fractal compression. (which *may* be avail at Dr. Dobb's Journal /* was http://www.dobbs.com/ ??? */ : all source code is available ... online .... anonymous FTP from site ftp.ddj.com http://www.ddj.com/ftp/ /* was ftp.mv.com (192.80.84.1) */ in the /pub/ddj directory.

wavelet image compression

Introductions to Wavelets

Amara's Wavelet Page http://www.amara.com/current/wavelet.html has a short intro to wavelets, and lots of links to wavelet, fourier, gabor, and related information.
Wavelet Links; Introductions to Wavelets http://www.wmin.ac.uk/~seamang/wavelet.html
Biorthogonal Smooth Local Trigonometric Bases by Björn Jawerth and Wim Sweldens http://cm.bell-labs.com/who/wim/papers/trigon.ps [FIXME: read]

Wavelets applied to Image Compression

Unfortunately, my clever ideas for differential average equal-size (DAvES) variation of wavelet compression are not currently online. Maybe if I got more feedback.html I would be more motivated to put it online.
C++ source code for the "Wavelet Image Compression Construction Kit", lots of other wavelet information including "A Wavelet-based Analysis of Fractal Image Compression". http://www.cs.dartmouth.edu/~gdavis/
"Lossless Compression Using An Improved Nonlinear Integer-Arithmetic Discrete Wavelet Transform With Application To Medical Images" by Inderpreet Singh http://www.engr.uvic.ca/~igarang/mod_paper2/mod_paper1.html
fast wavelet transforms, including the Haar, Walsh, and Hadamard transform. http://www.iro.umontreal.ca/~pigeon/Ondelettes/TransformeesRapides.html English translation: http://www.iro.umontreal.ca/~pigeon/Wavelets/TransformeesRapides.html (also has info on the Hartley Transform, very similar to Fourier Transform except it uses only real numbers on input and output, more like DCT).
"Wavelets" section of the Comp.dsp FAQ: DSP Algorithms http://www.bdti.com/faq/2.htm#29
"Algorithm for Lossy/Progressive/Lossless Image Compression" for _NASA Tech Brief_ 1997 Dec http://www.nasatech.com/TSP/PDFTSP/NPO20141.pdf mirror: http://www-msim.jpl.nasa.gov/HPCC/eric/eric.html | ../mirror/NPO20141.pdf describes (in excellent detail) a nice algorithm called "ERIC", "efficient reversible image compression", suitable for implementing on very simple processors. For lossy encoding, ERIC approximates the values in the higher-frequency wavelet coefficient subimages with fewer bits -- for 1 level of decomposition, dividing the original image into 4 planes X00, X01, X10, and X11, it removes no bits from X00 (the average), 1 bit from each coefficient in X01 and X10 (horizontal and vertical differences), and 2 bits from each coefficient in X11 (the double-difference). I have implemented a Integer Wavelet Transform compression/bi_wavelet_column.m | compression/biwavelet.m slightly different than the IWT used by ERIC. I *think* that my implementation is superior :-) for lossless/progressive image compression, ... but perhaps my implementation of the IWT messes up the the "lossy" bit-dropper downstream. [note to self: If I ever implement the bit-dropper and the remaining parts of a IWT-based compression algorithm, consider encoding all the values with Gray code, then run-length-encoding the MSbits (sign bits), then the next-most-significant-bits, etc. Compare this "rotated bitwise" compression to "standard" compression on the string of values themselves]. After the IWT, ERIC does a interesting variant of run-length encoding (it only compresses runs of zeros, but instead of replacing all the symbols in a run with a special "this is a run" symbol (which could be a zero) and a symbol counting how many zeros there were, it replaces all the symbols in a run with special symbols indicating "this is a run of 2^4 zeros", "this is a run of 2^3 zeros", ... "this is a run of 2^1 zeros", which are single symbols to feed into the huffman compressor. ) and then standard Huffman compression, creating the final compressed file.
Compression with Reversible Embedded Wavelets (CREW) http://www.crc.ricoh.com/CREW/CREW.html "CREW is a new form of still image compression that is lossy and lossless, bi-level and continuous-tone, progressive by resolution and pixel depth, and can preserve the source image at encode and quantize for the target device at decode or transmission."
GWIC - GNU Wavelet Image Codec http://jole.fi/research/gwic/ (experimental wavelet image compression, with source code, of course)

color image compression

Color coordinate systems and conversion between them.

See also clustering and color quantization machine_vision.html#cluster .

I'm most interested in simple lossless stuff.

color space transformations (see primary_colors.html ). Inderpreet Singh
http://www.engr.uvic.ca/~igarang/paper2_96/node3.html uses the interesting (integer) discrete lossless color space transformation RGB888 YIQ899
- Y = floor2( G + floor2(R + B) ) = floor2( G + t) (8 bits)
- I = R - B (9 bits)
- Q = floor2( R + B ) - G = t - G (9 bits)
- t = floor2( R + B ) = Y + ceil2(Q). (8 bits)
- R = Y + ceil2(Q) + ceil2(I) = t + ceil2(I)
- G = Y - floor2(Q)
- B = Y + ceil2(Q) - floor2(I) = t - floor2(I)
[Here I've translated it into DAV notation, where for integer values x, N,
```
	floor2(x) = floor( x / 2 ) = arithmetic right shift,
	
```
and
```
	ceil2(x) = ceil( x / 2 ) = floor2(x+1) = -floor2(-x).
	
```
such that
```
	floor2(x) + ceil2(x) = x, and
	ceil2(x) - floor2(x) = (0 or 1), so
	floor2( 2N + ceil2(x) - floor2(x) ) = N
	floor2( N + x ) + ceil2( N - x ) = x.
	
```
.
FIXME: My notations avoid lots of little "+1" and "-1" floating around ... but perhaps it would be clearer to write in executable C notation, perhaps with #define ceil2(x) ((x+1) >> 1)
Gormish, Michael J., Edward L. Schwartz, Alexander Keith, Martin Boliek, and Ahmad Zandi. "Lossless and nearly lossless compression for high quality images." February Proc. of IS&T/SPIE's 9th Annual Symposium, Vol. 3025, San Jose, CA, 1997-02. Available http://www.crc.ricoh.com/pub/pdf/SPIE97HighQuality.pdf or http://www.rsv.ricoh.com/~schwartz/ (1999-04).
describes a reversible integer RGB (8 8 8 bits) to Y Ur Vr (8 9 9 bits) color transform and shows that for many (but not all) images, there is a net gain in spite of the fact that 2 of the components require 9 bits of accuracy instead of the original 8. This transform also allows a grayscale image to be extracted from the compressed data without decompressing the entire image.)
suggests
- Yr = floor( (R + 2G + B)/4 ) (8 bits)
- (Ur+1) = R - G (9 bits)
- (Vr+1) = B - G (9 bits)
- G = Yr - floor( ( (Ur+1) + (Vr+1) )/4 )
- R = (Ur+1) + G
- B = (Vr+1) + G
(Lu 1997 p. 281) for many more details on color decomposition. The Crew YrUrVr seems to be a reversible version of the "simplified color coordinate system" LMN. It also suggests just replacing the luminance image with the green component G, the GMN decomposition. "By convention, the decomposed color components are called /color screens/".

1D data compression

Specifically designed to compress 1D streams of symbols, such as English text. 2D images are often rearranged into a linear list of pixels (e.g., by scanning one row at a time, or walking a Hilbert path), then compressed with one of these tools.

Some of these (such as Huffman) don't even take advantage of the 1D correlation between adjacent symbols, and just take advantage of "correlations between random pairs", i.e., statistics of the entire file taken as a unordered set. These algorithms would compress any file with identical statistics (for example, by shuffling the order of the items in any arbitrary manner) just as well. [This could be phrased better ...]

Others of these (LZ77 and descendents) completely ignore the unordered statistics, and simply copy repeated phrases.

Theory:

The Lossless Compression (Squeeze) Page http://www.cs.sfu.ca/cs/CC/365/li/squeeze/ cute little Java demos to try to illustrate Huffman, Adaptive Huffman, and LZW.
http://piclist.org/techref/method/compress/etxtfreq.htm English text frequencies: frequencies of the letters, common digraphs, trigraphs, quads, and the top 300 words.
"Vocabulary Analysis of Project Guttenberg" by Zachary Booth Simpson, May 2000 http://www.mine-control.com/zack/guttenberg/index.html mentions "there are only a limited number of words in the English language (~400,000 in this sample)"
"Fun With Words" http://www.rinkworks.com/words/ has a section on letter frequencies and word frequencies
words in the Bible tsv.html#words
well-written article on LZW compression (and ways to speed up the implementation), with source code. http://dogma.net/markn/articles/lzw/lzw.htm /*was http://web2.airmail.net/markn/articles/lzw/lzw.htm */
"DadaDodo: Exterminate All Rational Thought" http://www.jwz.org/dadadodo/ Cute little application to generate Markov chains from English text. Apparently the same Markov chain idea is used at http://www.webcorp.com/newpolibabble.htm "Clintov and Gingov ... generate randomized "speeches" that have the same distribution of three-word phrases as real speeches by Clinton and Gingrich." and similar to
http://www.die.net/random/c/c/o/j/w/c/b/a/b/index.htm Markov random-text generator.
COMPUTING AS COMPRESSION: http://www.sees.bangor.ac.uk/~gerry/sp_summary.html
LANGUAGE LEARNING AS COMPRESSION: http://www.sees.bangor.ac.uk/~gerry/lang_learn.html
Data compression used by NASA: http://www.mrc.unm.edu/Uses/award/ the "USES" compression algorithm by Dr. Pen-Shu Yeh.
Some philosophical thoughts on data compression http://www.maui.net/~shaw/celes/dcmind.html
International Telecommunication Union (ITU) http://www.itu.int/ has some recommendations for digital transmission of high-quality sound. http://www.itu.ch/itudoc/itu-t/rec/j/j52_27299.html International Telecommunication Union: Telecommunication Standardization Sector

Source code available at:

programs by Steven Pigeon http://www.iro.umontreal.ca/~pigeon/programs/programs.html has source for: Adaptive Huffman compression; a "flatland" utility that destroys usually-invisible data in order to get much better GIF compression; some fractal generators; /* was http://www.iro.umontreal.ca/~pigeon/programs_anglais.html */
Ogg Vorbis http://www.vorbis.com/ ``Ogg Vorbis is a completely open, patent-free, professional audio encoding and streaming technology with all the benefits of Open Source.'' audio compression
FLAC ( Free Lossless Audio Codec ) http://flac.sourceforge.net/ audio compression
Markus Günther Kuhn http://www.cl.cam.ac.uk/~mgk25/ "My freely available JBIG-KIT portable ANSI C library, which implements a highly effective lossless bi-level image compression algorithm"
LZO, a small, fast, lossless, compressor distributed under the GPL: http://oberhumer.com/opensource/lzo/ (by Markus F.X.J. Oberhumer) /* was http://www.infosys.tuwien.ac.at/Staff/lux/marco/lzo.html */ [FIXME: also check out his links page
Markus F.X.J. Oberhumer: Data Compression Links http://www.oberhumer.com/mfx/compression_links.php ]
Ross Williams http://www.ross.net/, the one who started the comp.compression newsgroup, has some stimulating suggestions on ways to possibly improve LZ and other compression schemes, with source code. (Some of his files are mirrored on LZRW4: ZIV AND LEMPEL MEET MARKOV http://www.cs.pdx.edu/~idr/compression/ )
http://www.dc.ee/Files/Programm.Packing - LZRW1, LZW, LZSS, Dynamic Markov, Example that reads in all the characters from a file and builds a Huffman encoding tree using an STL priority queue.
Jean-loup Gailly, http://w3.teaser.fr/~jlgailly/ wrote gzip; his site includes source to gzip, zlib, and a fractal compression program.
The Data Compression Book http://web2.iadfw.net/markn/tdcb/tdcb.htm by Mark Nelson and Jean-loup Gailly.
Mark Nelson http://dogma.net/markn/articles/bwt/bwt.htm has source to a simple run-length encoder (rle), the Burrows-Wheeler transform (bwt and bwta), a Move To Front encoder (mtf), a adaptive arithmetic encoder (ari), and their inverse functions. Pipe them all together, and you get better compression than PKZIP. /* was http://web2.airmail.net/markn/articles/bwt/bwt.htm */

Huffman

[FIXME: Not all my Huffman links are here; some of the 1D_compression links describe several compression ideas including Huffman compression. Perhaps I should copy them here and directly link to their Huffman page].

I (David Cary) have written some code (in MatLab) to compress images using Huffman compression. (Most implementations of Huffman can only handle 8 bits/symbol; mine can handle 12 bits/pixel easily). I started with code from John C. Kieffer #kieffer but my implementation is much, much faster (albeit more difficult to understand). The main routines are imwrite_compressed.m and imread_compressed.m which call other subroutines in the same directory. http://rdrop.com/~cary/program/image_processing/
Huffman Encoding http://www.ee.uwa.edu.au/~plsd210/ds/huffman.html has one page on Huffman compression (pretty easy to understand, but not quite enough for an implementation).
Will http://www.cjkware.com/wamckee/ has some sample C++ code for Huffman compression
John C. Kieffer
- http://www.ece.umn.edu/users/kieffer/ http://www.ece.umn.edu/users/kieffer/ see EE 5581: Information Theory and Coding EE 5585: Data Compression [FIXME: read the Conference Presentations here]
- Chapter 3: Huffman Codes ftp://oz.ee.umn.edu/users/kieffer/seminar/notes3.pdf by John C. Kieffer
- Source Coding MATLAB Toolbox http://www.ee.umn.edu/users/kieffer/programs.html has some huffman compression source code in Matlab. (very well factored)
Practical Huffman coding http://www.compressconsult.com/huffman/ practical issues you need to know for writing a fast and reasonable memory efficient huffman coder. Recommends using canonical Huffman codes http://www.compressconsult.com/huffman/#canonical The canonical Huffman codes are constructed such that
- - shorter codes have numerically (if filled to right with 0s) higher value than longer codes.
- - within same length, numerical values increase with alphabet.
This implies that:
- The all-zeros code is the longest code (There may be many other codes the same length as this one).
- The all-ones code is the shortest code (There may be many other codes the same length as this one).
- Within the same length, numerical values increase by 1 (if filled to the left with 0s).
- No more than ceil(log2(alphabetsize)) rightmost bits of the code can differ from zero.
- The canonical Huffman tree has sorted each "generation" level of the Huffman tree with deepest-node-on-left; and leaves on the same level are further sorted in alphabetical order. This implies that the all-zeros codeword (tracing down to the leaf at the far left) is the longest codeword, and the all-ones codeword (tracing down to the leaf at the far right) is the shortest codeword.
There is only one standard Huffman code for a particular file, but many other Huffman codes that give just as good compression for that file. For example, if you are trying to compress a file of 100 characters with these frequencies
```
	a: 10
	b: 10
	c: 20
	d: 20
	e: 40
```
there are 3 bit lengths ("Kraft vectors") that give the best compression (220 bits). They are
```
	L = (1, 2, 3, 4, 4)
	L = (1, 3, 3, 3, 3)
	L = (2, 2, 2, 3, 3).
```
Given a "Kraft vector", one can find the unique canonical Huffman code for a file. For the above optimum Kraft vectors,
```
	a: 0000
	b: 0001
	c: 001
	d: 01
	e: 1

	a: 000
	b: 001
	c: 010
	d: 011
	e: 1

	a: 000
	b: 001
	c: 01
	d: 10
	e: 11
```
This last canonical Huffman code is the unique "standard" Huffman code. The "standard" (DAV: I made up this terminology. Is the a better one ?) Kraft vector is generated by preferentially merging sub-trees with smallest depth. For certain 8-bit binary files, the optimum Kraft vector indicates one code needs a ~ 256 bit code. Since decoders can be made simpler if there is a limit on code length, some algorithms (like the one in JPEG ?) generate a non-optimum Kraft vector that has a maximum code length of 16 bits (this restriction gives slightly worse compression) for 8 bit values.

Newsgroups: comp.compression
From:  (Antaeus Feldspar)
Subject: Re: Literature/HTML page on optimal Huffman codes
Date: Mon, 3 Nov 1997 20:48:21 GMT

In article <345dc108.17828360@news.worldonline.nl>,
Thiadmer Riemersma  wrote:
>Hello everybody,
>
>Do you know a book, article or HTML page where the creation of *optimal*
>Huffman tables is explained. I have read on various occaisions that the
>Huffman tree should always be better or as good as Shannon-Fano coding. One
>book (with examples of both) said, "Well, Shannon-Fano is a little more
>compact than Huffman in this example, but that's because we have not
>optimized the Huffman tree." Hence my question.

	I think your book has played fast and loose with terminology.  If
we are working with accurate statistics, a prefix code created through
Huffman coding is going to be the most compact possible -- when codes are
limited to integral numbers of bits, that is.  Huffman codes are prefix
codes; but not all prefix codes are Huffman codes.

>By the way, if you know a compact way to store a Huffman table?

	Easiest way is to adopt restrictions for your Huffman trees that
make only one of all possible trees for a given set of codelengths the
legal tree.  Then, storing the codelengths is sufficient to reconstruct
the entire tree.  An example, used in "deflate" compression, is the
following set of restrictions:
	-- Shorter codes go to the left of longer codes.
	-- Symbols with equal codelengths go in lexicographic order from
left to right.
	Any Huffman tree can be converted into a Huffman tree that meets
these restrictions, without changing the lengths of the codes and
changing the efficiency of the code -- and as pointed out before, this
means that storing just the lengths of the codes is the same as storing
the whole tree.

>Thanks in advance,
>Thiadmer
>
>* * *
>Note: My E-mail address has been altered to avoid spam.


--
! -jc IS feldspar at netcom.com                               !
! "'Asa Nisi Masa!'  How strange!  But what does it mean?" !
! *** Fight spam!  Sign up at http://www.cauce.org/ !  *** !

"the Independent JPEG group's implementation of JPEG ... uses Huffman coding"
http://www.engr.mun.ca/~john/coder.html mentions "Improvements to adaptive Huffman coding by Knuth [16] and Vitter [17] "
[16] D E Knuth, "Dynamic Huffman Coding", Journal of Algorithms, Vol 6, pp 163-180, 1985.
[17] J S Vitter, "Design and Analysis of Dynamic Huffman Codes", Journal of the ACM, Vol 34, No 4, pp 825-845, October 1987.
Adaptive Huffman Coding http://www.ics.uci.edu/~dan/pubs/DC-Sec4.html
In-Place Calculation of Minimum-Redundancy Codes http://www.cs.mu.oz.au/~alistair/abstracts/inplace.html (includes source code)(Linear time Huffman tree computation )

Huffman's famous greedy algorithm solves this problem in $O(n \log n)$ time if $f$ is unsorted; and can be implemented to run in $O(n)$ time if the input array $f$ is sorted. In this paper the space usage of the greedy method is considered. It is shown that if $f$ is sorted then the array $c$ can be calculated in-place, with $c_i$ overwriting $f_i$, in $O(n)$ time by using $O(1)$ additional space. The new implementation leads directly to an $O(n \log n)$ time and $n+O(1)$ words of extra space implementation for the case when $f$ is not sorted. The proposed method is simple to implement and executes quickly.
[ This shuffles everything up one slot every iteration ... perhaps it would run significantly faster if it never shuffled anything up, using double the amount of memory (still O(n) space and O(n log n) time). ]
Linear time Huffman tree computation ???

Re: Dynamic Huffman algorithms
Author:
            Matthew D Moss
Email:
            moss at null.net
Date:
            1998/10/02
Forums:
            comp.compression

ralionmaster@geocities.com.remove.after.com wrote:
>
> I have asked similar question before. Could someone
>advise where I could obtain algorithm(s) on dynamic Huffman
>implementations?

Take a look at the October 1998 issue of Dr Dobb's Journal.

In the "Algorithm Alley" column, Steven Pigeon and Yoshua Bengio tackle this
problem.  Their variation of the Huffman algorithm is adaptive, not as
memory intensive for sparse symbol sets, and doesn't require you to store
the encoding table.

http://www.ddj.com/

DAV: Here is one way to think about Huffman and Arithmetic coding. Both of them assume statistical independence (0-order Markov source), so no matter how we shuffle the values in the file, we get the same file size A after Arithmetic compression of any of these shuffled files. We get the same file size H after Huffman compression of any of these shuffled files. (This is assuming zero padding the last incompletely filled byte and somewhere storing the length of the compressed file. Other ways of handling end-of-file may make the compressed version of these shuffled files slightly longer or shorter than other compressed versions.) Imagine that we sort the file so the least common letters occur first, and the most common occur last (if 2 letters have the same frequency, sort them alphabetically). For the file "ioaeeiae" (Lu 1997), this imaginary sorted file is "oaaiieee". Then tag each character in this sorted file with its position in the file:

o: 000
a: 001
a: 010
i: 011
i: 100
e: 101
e: 110
e: 110

This can map directly to Arithmetic coding (???). Lu (Lu 1997 p. 312) uses a Arithmetic coder where letters are sorted by frequency similar to this. Canonical Huffman coding, in effect, adds duplicate characters and deletes (!) characters such that the total length of this sorted file is a power-of-2:

o: 000
o: 001 // additional `o'
a: 010
a: 011
i: 100
i: 101
e: 110
e: 111 // deleted one `e' !

Then, when compressing the original (unsorted) file, the compressed code word is just enough to locate that letter in this (sorted) file: in this case, "a" can be located in 2 bits: "01", because *all* letters at locations that begin with "01" are "a".

Another example: "ebeaabeb" is sorted "aabbbeee". Canonical Huffman, in effect, creates this sorted file:

a: 000
a: 001
b: 010
b: 011 // deleted one `b' !
e: 100
e: 101
e: 110
e: 111 // added 1 `e' !

Here, "e" can be indicated by the code word "1", since all locations in this file that start with "1" are "e". "b" is indicated by the code word "01", since all locations that start with "01" are "b".

For some initial data files (for example, aabbeee), Huffman always adds letters, never deletes letters. For those sorts of data files,

  compressed_length(a) = ceil(log2(L/f(a)))

However, for many data files, Huffman deletes letters; so

   ceil(log2(L/f(a))) < compressed_length(a)

and we get better compression than we expected for *other* letters (letters different than the one that got deleted). Sometimes this "improvement" is even better than than Arithmetic coding (it appears to violate entropy); i.e., sometimes

  compressed_length(a) < log2(L/f(a))  // sometimes

. However, *total* entropy is not violated, since the *worse* compression on letters just like the one that was deleted more than make up for it.

If we have hardware that is limited in the size (in bits) of the code words it can handle, then algorithms that generate (occasionally non-optimal) Kraft vectors have the effect of adding lots of copies of the least common letter(s).

Ways of representing the code words (in the decoder; in the header of the compressed file)

Many programs written to *teach* Huffman codes represent them in memory like this:

	(length)(bitstring) pairs.

When you encode 8 bit symbols, then the worst-case bitstring is 255 bits. So one way of holding all the codes is an array of 256 of these (length)(bitstring pairs). The (length) can range from 1 bit to 255 bits. (a length of 0 indicates "this symbol never occurs").

However, if you use canonical Huffman codes, the least common symbols all start with lots of zero bits. A better way of representing all the Huffman codes is to truncate all those zero bits:

	(length)(remaining bits)

By using canonical Huffman codes, the worst-case "remaining bits" bitstring (when compressing 8 bit symbols) is 8 bits. The "remaining bits" bitstring is all zeros for the least-frequently-occuring symbol. The "remaining bits" bitstring has at least one 1 bit for the other 255 symbols.

The decompressor needs to know what the Huffman codes are for each compressed file. (They are different for each compressed file). So a natural place to store this information is in the header of each compressed file. It would be nice if this information could be represented as compactly as possible.

I know I was astonished to learn that this table could be represented (in the header of the compressed file) by an array of 256 bytes

	(length)

. The actual code words (or even the "remaining bits") do *not* need to be stored in the header (if you use canonical Huffman codes).

The compressed file header just stores the *lengths* of the codeword for each symbol, and the decompressor regenerates the code words from that.

Assuming your maximum code length <= 15 bits, (see #huffman_height for details on how to guarantee some particular maximum code length, at some tiny loss in compression ) the simplest possible header for huffman compression is to pack 4 bits per symbol to encode that code length (0 == never occurs in file, 1 = occurs about half the time, 15 == occurs rarely ... or would it be better for "0000" to indicate a length of 16 == occurs even more rarely ?). If you have 256 symbols, then this header requires 256*4 bits = 1024 bits = 128 bytes.

Obviously, if the header requires 128 bytes, then it's impossible to get any useful compression for files of 129 bytes or less.

There are ways to compress the Huffman header in a variable-length way. Those ways result in an even shorter header for typical English text files.

[FIXME: other ways to reduce size of Huffman header]

(redundant ?)

[FIXME: doesn't this just repeat above ?]

[data compression#huffman]
From: d.cary at ieee.org
To: d.cary at ieee.org
Subject: Re: Arbitrary Huffman tree and weights distribution (was: huffman code length)
Date: Thu, 15 Jul 1999 21:34:58 GMT
Newsgroups: comp.compression,alt.comp.compression,sci.math
Organization: Deja.com - Share what you know. Learn what you don't.
X-Article-Creation-Date: Thu Jul 15 21:34:58 1999 GMT
X-Http-Proxy: 1.1 x26.deja.com:80 (Squid/1.1.22) for client 38.193.64.47
X-Http-User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)
X-MyDeja-Info: XMYDJUIDd_cary

I've been thinking about a different paradigm to "explain" the Huffman
algorithm. This one has no trees. And it creates a easy-to-understand
concept of the "canonical" Huffman code.

Starting with the file you want to compress, count how many times each
symbol occurs (histogram). Then sort the file,
not in alphabetical order, but in order of least-frequently-occurring-
symbol first.

For example, the file "ebeaabeb" is sorted

 000 a
 001 a
 010 b
 011 b
 100 b
 101 e
 110 e
 111 e

Then we could encode the original file by just transmitting the offset
of each letter in this sorted file. This leads to 3 bits/symbol
encoding. (I think that arithmetic coding is very similar to this idea).
However, notice that all the symbols in this sorted file that start
with "00" all lead to `a'. This means that we could encode all `a's in
the original file with only those 2 bits, leading to 2 bits for a (00),
2 bits for the b (01) and 2 bits for the e (11). Note that the "10"
code is never used, since it only codes for b and e, which are already
covered.

The canonical Huffman algorithm works its magic on this sorted file,
adding and deleting a few letters, creating

 000 a
 001 a
 010 b
 011 b // deleted one `b' !
 100 e
 101 e
 110 e
 111 e // added 1 `e' !

The length of a Huffman sorted file is always a power of 2.
Each letter occurs exactly some power-of-2
times, letters that occur some smaller power-of-2 times occur first. In
a "canonical" Huffman sorted file, letters that occur the same power-of-
2 times are sorted in alphabetical order. This means that given that
power-of-2 ("the code length") of each symbol, we can easily regenerate
this canonical Huffman sorted file.

As before, we could encode the original file by encoding each character
by its offset in this sorted file. This leads to 3 bits per symbol
At decoding, we take the code lengths of each symbol, and regenerate
the canonical sorted file. Then we decode by looking up bits in the
compressed bit stream as offsets to this file. Note that if the first
codeword you're starting to decode starts with "1", you already know
that must have been a `e' in the original file. So the remaining bits
in the offset don't need to be stored in the file. This means the
canonical Huffman code for this file is 2 bits for a (00), 2 bits for b
(01), and 1 bit for e (1).

This immediately leads to all the standard properties for canonical
Huffman codes, including "the all-zeros code is the longest code", "the
all-ones code is the shortest code", etc.

In a few initial data files (for example, aabbeeee), letters are
already in a power-of-2 frequency distribution, and the total file
length is already a power-of-2. The Huffman "magic" never makes any
changes to these files. For those files,
  Lo = Ls
  bl(a) = log2(Lo/f(a)), so
  f(a) = Lo / 2^(bl(a)).
where
  bl(a) = bit length of the symbol `a'
  f(a) = how many times 'a' occurs in the original source file
  Ls = length of this hypothetical sorted file modified by Huffman,
  Lo = length of original file.
Any possible bitlength distribution *could* have been generated by a
file of this type.

To other initial data files (for example, aabeeeb), Huffman only adds
letters (to round their frequencies up to the next power of 2), never
subtracts letters. (Is this right ? Would Huffman ever round up to more
than the next power of 2 ?)
For these files,
  Lo < Ls
  bl(a) = log2(  Ls/f(a)  )
  bl(b) = floor(log2(  Ls/f(b)  ))
where
  `a' is any letter that was unchanged
  `b' is any letter that had extra copies added.
Since
  x-1 < floor(x) <= x (Perhaps a tighter bound can be found here)
we can derive
  log2(  Lo/f(b)  ) < bl(b)
  Ls / 2^(1+bl(c)) < f(c) < Ls / 2^(bl(c)).
where
  `c' is any type `a' or type `b' symbol in the original file.

( I thought I was going to get
  Lo / 2^(bl(b)) < f(b) < 2*Lo / 2^(bl(b))
but I think this is incorrect).

Unfortunately, there are lots of tricky cases where Huffman deletes
some duplicate letters (and possibly adds other duplicate letters, in
the general case). For those files, Ls could be bigger or smaller
or the same as Lo. For any letter `d' that had duplicate copies deleted,
  floor(log2(Ls/f(d))) < bl(d).
The remaining letters still get
  bl(b) = floor(log2( Ls/f(b) ))
Sometimes some of the remaining letters are compressed even better than
Arithmetic coding (it appears to violate entropy); i.e., sometimes
  bl(c) < log2( Lo/f(c) ).
However, total entropy is not violated since the worse compression on
letters who had some copies deleted more than make up for
this "improvement".
In particular, for the extremely skew case where Lo/f(d) = 1.00001
(or, in general, where Lo/f(d) < 2 approximately),
  floor( log2( Ls/f(d) ) = 0 < bl(d) = 1.
So for files where Huffman deletes some characters, we get
Letters with unchanged frequency
  bl(a) = log2(  Ls/f(a)  )
  f(a) = Ls / 2^(bl(a)).
Letters with copies added
  bl(b) = floor(log2(  Ls/f(b)  ))
  Ls / 2^(1+bl(b)) < f(b) < Ls / 2^(bl(b)).
Letters with copies deleted
  floor(log2(Ls/f(d))) < bl(d)
  Ls / ( 2^(bl(d)+1) ) < f(d).

Of course, for every symbol in every kind of file that has Q unique
symbols,
  0 < bl(c) < Q
  0 < bl(d) < Q.
(except for the special case of Q=1 (Ls = 1) or Q=0 (Ls = 0), where the
compressed data has length zero bits).

Maybe if I could get a better handle on when exactly these copies are
deleted, I could answer the original question.

I suspect that if you look at every symbol that has the same bit length,
only the one closest to the end of the alphabet could possibly have
copies deleted. If that were true, then for all symbols of the same bit
length except for the last,
  Ls / 2^(1+bl(c)) < f(c) < Ls / 2^(bl(c)).
where
  Ls = sum{i=each unique symbol}( 2^(max - bl(i)) )
  max = the longest bit length.

Reverse Huffman coding

David Cary developed "reverse Huffman coding" to minimize the transmission cost whenever where different symbols have different cost.

(Other people, including Peter Wayner, have independently discovered/used reverse Huffman for other purposes)

Wikipedia:Dasher DAV: Dasher takes advantage of Wikipedia:Fitts_Law , -- the fact that larger areas can be selected more quickly with the pointer than smaller areas. Dasher makes the areas for the frequently-used letters and words much larger (sacrificing the size of the least-frequently used letters). Even though it takes much longer to type those letters on the Dasher keyboard (compared to a standard pointer keyboard that makes every letter the same size), it takes less total time to type out a typical message. Compared to a standard keyboard, the time lost on those infrequent letters is more than compensated by the time gained on the more-common letters and words. Dasher sizes areas using reverse Huffman (or is it reverse arithmetic coding?)
_Disappearing Cryptography: Information Hiding: Steganography and Watermarking_ book by Peter Wayner 2002 ISBN: 1558607692 http://search.barnesandnoble.com/booksearch/isbnInquiry.asp?isbn=1558607692 mentions ``reverse Huffman'' . Apparently related source code is available at http://www.wayner.org/ .
"Huffman Coding with Unequal Letter Costs" http://www.cs.ust.hk/faculty/golin/pubs/LOP_PTAS_STOC.pdf by Mordecai J. Golin, Claire Kenyon, Neal E. Young, seems to be a completely different solution to the same problem ...

This requires a Huffman compression algorithm such that decompressing *any* "compressed" file, then compressing that "plaintext", generates identically the same "compressed" file. See David Scott http://bijective.dogma.net/ for another application that requires this same property.

Say you have Q unique symbols that can be transmitted, each one x with cost c(x). For example, you have a battery-powered device that turns a LED off and on, where each 9 bit symbol always starts with the LED on one bit time, and the total cost is the battery drain which only occurs while the LED is turned on. In other situations, the "cost" may also involve the transmission time.

Theorem: To minimize transmission cost, you want to equalize the cost per bit of information transmitted, for each symbol.

Proof: [FIXME]

We assume that the data to be transmitted has already been compressed into a bitstream, such that no matter what values the previous bits have been, the probability that the next bit is a 1 is always approximately 1/2.

The cost can be any arbitrary function of the symbol x. This would commonly include some combination of

battery drain of transmitter while transmitting symbol x (some symbols use more energy)
battery drain of receiver while receiving symbol x (usually the same for every symbol)
time to transmit (the most common communication schemes use a fixed time for every symbol, but reverse Huffman can also be used when each symbol has a unique time-to-transmit)
probability of correct reception (difficult-to-receive symbols, perhaps without enough synchronization bits, have the additional cost of occasionally needing to re-synchronize).
signal/noise ratio

If every one of the unique symbols can be transmitted in the same amount of time, and the "time to transmit" dominates the cost, then the minimum cost occurs when all symbols will be transmitted (approximately) equally likely.

Using Arithmetic coding isn't much better than Huffman at equalizing cost/bit when there are lots of symbols, but it can be far superior when there are only 2 or 3 symbols.

In some cases, it may be a good idea to concatenate "simple" symbols to make a "big" symbol. For example, if you have 3 "simple" symbols to work with, you might get better cost/bit if you concatenate 3 of them to create a symbol set of 27 "big" symbols, and then feed the cost of each of those 27 symbols into the reverse Huffman algorithm. Concatenation also simplifies convoluted transmission constraints such as "when transmitting these -1, 0, +1 values, the average must always be 0".

Simple example: We have a LED transmitter/ photodiode receiver pair that transmit 4 bit symbols to each other. Every symbol required to have a minimum of 1 "on" bit to maintain synchronization between transmitter and receiver, so that leaves 15 valid symbols.

The cost for this application is entirely the battery drain when the LEDs are on, a constant 1 unit for each time period. We neglect the time needed to transmit information and the battery power used by the transmitter and receiver while the LED is off.

LED	cost	encoded bits
0000	0	(never transmitted)
0001	1	001
0010	1	01
0011	2	000001
0100	1	10
0101	2	000010
0110	2	000011
0111	3	000000000
1000	1	11
1001	2	000100
1010	2	000101
1011	3	00000001
1100	2	00011
1101	3	00000010
1110	3	00000011
1111	4	000000001

Without reverse Huffman transmission, you might transmit each symbol equally, so each symbol encodes approximately log2(15) =~= 3.9 bits of information (using bit stuffing). at an average cost of (average cost)/(average bits)= = (32/15)/(log2(15)) = 0.546 cost/bit.

Or you might transmit only 3 bits per symbol, using the 8 lowest-power symbols, to get a slightly better rasio of (average cost)/(average bits) = = (12/8)/(3) = 0.500 cost/bit.

But the optimum cost/bit is to use reverse Huffman transmission. Then the probability of each symbol being transmitted becomes [ 64 128 8 128 8 8 1 128 8 8 2 16 2 2 1]/512. The length in bits for each symbol is [ 3 2 6 2 6 6 9 2 6 6 8 5 8 8 9]. So the average(cost/bit) = sum( probability(i).*cost(i)./bit(i) ) = 0.4611 cost/bit.

We can see how well this choice of bitpatterns fits to our goal of a equal cost/bit by calculating cost ./ bit = [0.3333 0.5000 0.3333 0.5000 0.3333 0.3333 0.3333 0.5000 0.3333 0.3333 0.3750 0.4000 0.3750 0.3750 0.4444] = [120 180 120 180 120 120 120 180 120 120 135 144 135 135 160]./360 .

Well, it's a lot closer to a constant cost/bit than the simpler strategies above.

Open questions: Is it possible to have a "reverse adaptive huffman" adapt to changing line conditions ?

[FIXME: not a good explaination of end effects. See bijection ] End effects: The above neglects what happens at the end of a transmission. What if you are nearing the end of transmitting a message, and you find that the remaining bits to be transmitted are "1100" ? Obviously you find that you need to transmit a "1000" pattern (interpreted as "11"); then what could you send that would be interpreted as "00" ? One simple protocol would be to append a virtually infinite string of zeros after the message; this protocol would then transmit a "0111" message (interpreted as "000000000") to transmit the remaining "00" data bits and a bunch of bogus "0" bits. Hopefully the receiver would properly ignore the extra 0 bits. Perhaps if the message to be transmitted ended in zeros, the transmitter could halt transmissions after the last 1 bit was transmitted, and the receiver would somehow fill in the mising 0 bits. (Appending a virtually infinite string of ones after the message could also be interesting). Is it possible for the receiver to truncate extra zeros (or even fill in missing zeros) even if it doesn't know the exact length of the original text ? Maybe if it knows the original text was a exact multiple of 8 bits in length ... no, that wouldn't work, because in the special case where we lacked 1 bit to make up a full byte, and the the final transmitted symbol was "0111", then the 9 "0" bits that decodes into would be ambigous as to whether it was intended as the final "0" bit of the message, or whether the message ended with another full byte of zeros. (Appending a virtually infinite string of ones looks like it would work in this case ... will it always work ?)

What is the maximum height of a Huffman tree ?

from a thread at http://www.deja.com/[ST_rn=md]/threadmsg_md.xp?AN=487814206 and more info at http://www.compressconsult.com/huffman/#maxlength

As you can see from the above, if you have 2^15 symbols (or less) in your data file, you can refer to any one of them using a 15 bit offset from the start of the file. I was incorrect in concluding that therefore the maximum code length would be 15 bits, since Huffman *inserts* lots of symbols (in that mythical sorted file), so the Huffman offset may be much longer than 15 bits.

alternatively, you can loose a *tiny* amount of compression by using a code table that is slightly less than optimal, to force code lengths to a maximum of 15 bits.

Q: What is the maximum height
of a Huffman tree with an alphabet of 8 bit (256 different characters)?
A: The worst case height is 255.

The worst case is, of course, exceedingly improbable in practice.

If you have control of the Huffman code assignment (which you probably
do as an encoder), then you can set an a-priori limit on the code
length.  This means that the lowest-probability symbols will have
slightly-less-than-optimal code assignments, but since they occur
infrequently by definition, the loss in compression ratio is tiny.
The resulting code simplifications may be well worth that price.

The JPEG standard, for example, requires that no code be longer than
16 bits.  (Since this is a requirement of the standard, both encoders
and decoders are able to take shortcuts that assume this is the
maximum code length.)  The standard offers a simple approximate
algorithm for adjusting a true Huffman code tree to meet this maximum
depth.
                        regards, tom lane
                        organizer, Independent JPEG Group

but,

From: "DI Michael Schindler" 
Subject: Re: huffman code length
Date: 10 Jun 1999 00:00:00 GMT
Organization: Compression Consulting
Newsgroups: comp.compression

hi Tom,

you are correct.

Two things you did not mention:
JPEG uses a very small alphabet; in the case here a maximum codelength
of 8 bit is not practical (we have a 256 symbol alphabet here!)

To estimate loss the factor the  number of bits for the longest code
is greater than then number of bits needed for simple storing symbols
is a good measure.

The original poster also wanted to know codetable size for a
*canonical* code:
If you omit the leading zeros/ones (which one depends on wether the
shortest code is all zeros or ones) you can store codelength+trailing bits
pairs in memory. The number of trailing bits needed is limited by
log2(alphabetsize); see my huffman coding webpage
http://www.compressconsult.com/huffman/ for details.

Michael

From: "DI Michael Schindler" 
Subject: Re: huffman code length
Date: 16 Jul 1999 00:00:00 GMT
Organization: Compression Consulting
Newsgroups: comp.compression,alt.comp.compression,sci.crypt,sci.math

hi!

the formula below is right only if you take the longest subtrees
in case of equality of weights. If you prefer shorter trees
the # of samples raises faster.
The reason is that in the case of this worst-case sample
distribution you add just about half the number of samples you
already have but need an additional bit.

Note the codelength is not important; important is the part of
the code that is nonconstant for coding and decoding issues.
This can be limited to ceil(log2(ALPHABETSIZE)) bits when using a
canonical code (which you should for decoding speed reasons).

Michael

For the "standard" code (where we prefer shorter trees by merging the least-recently-merged tree when there is equality of weights), then the worst-case (pathological) codelength is still (ALPHABETSIZE-1) bits. The pathological case for 7 symbols (a tree with depth 6) is ((((((1 1) 1) 3) 4) 7) 11)

length of file (# of samples)	maximum codelength
0..1	0
2	1
3..4	2
5..8	3
9..14	4
15..24	5
25..40	6
41..66	7
67..108	8
109..176	9
177..286	10
287..464	11
465..752	12
753..1218	13
1219..1972	14
1973..3192	15
3193..5166	16
5167..8360	17

The width of each range is 1 more than the lower end of the previous range.


I already knew about preferring shorter trees (easy to do by preferring
the least-frequently-merged nodes when building the tree during
compression), but somehow I neglected to take that into account when
building that table. Thanks for pointing out that flaw. Using
standard (prefer shorter trees) encoding,

    length of file (# of samples)
               maximum code length (standard encoding)
     0..1      0
     2         1
     3..4      2
     5..8      3
     9..14     4
    15..24     5
    25..40     6
    41..66     7
    67..108    8
     ...
  1973..3192  15
  3193..5166  16
  5167..8360  17

The width of each range is the lower end of the previous range.
(Did I get it right this time ?)

Unfortunately, this doesn't seem to make a significant difference (5166
characters vs. 4180 characters).

Using standard (prefer shorter trees) encoding,
the worst-case (pathological) case for 8 unique symbols is a depth of 7,
  ((((((((1+1)+1)+3)+4)+7)+11)+18) = 46 symbols
which is almost, but not quite, the standard Fibonacci sequence.
Each frequency is the sum of the 2 previous frequencies,
but this sequence starts out "1 1 1 3" rather than "1 1".

But a depth of 7 can occur with only 41 symbols,
albeit with 9 unique symbols:
  ((((((((1+1)+(1+1))+1)+4)+6)+10)+16) = 41
This is even less like the standard Fibonacci sequence.
Each frequency is the sum of the 2 previous frequencies,
but this sequence starts out
"1 1 1 1 1 4 6" rather than "1 1".

Thanks for telling me that, after the initial zeros, even the worst-
case canonical codes are relatively short. I didn't know that. I can
see that that would definitely simplify Huffman compressors and
decompressors. If I have time later, I will see if that really makes my
software (which already uses canonical codes) go any faster (or slower).

"DI Michael Schindler" 
mentioned:
> the formula below is right only if you take the longest subtrees
> in case of equality of weights. If you prefer shorter trees
> the # of samples raises faster.
> The reason is that in the case of this worst-case sample
> distribution you add just about half the number of samples you
> already have but need an additional bit.
>
> Note the codelength is not important; important is the part of
> the code that is nonconstant for coding and decoding issues.
> This can be limited to ceil(log2(ALPHABETSIZE)) bits when using a
> canonical code (which you should for decoding speed reasons).

...
> d_cary at my-deja.com wrote
...
> > the Fibonacci sequence of frequencies
> > gives the maximum possible code length.
> >
> >     length of file (# of samples)
> >                maximum code length
> >          2      1
> >      3...4      2
> >      5...7      3
> >      8...12     4
> >     13...20     5
> >     21...33     6
> >   ......      ...
> >   1597...2583  15
> >   2584...4180  16
> >   4181...      17

obsolete:

Date:  Mon, 31 May 1999 01:30:46 GMT
From:  d_cary 
Subject:  Re: Frequency distribution of Huffman encoding tree

Mok-Kong Shen  wrote:
...
> My example given for a balanced tree of 4 symbols appears to
> contradict your formula. (Your range is [0.25, 0.5]).
>
> M. K. Shen

You're right. The formula I posted was wrong. I would really like to
know the correct formula (and maybe even a URI or paper reference).

Maybe the correct formula is
    2^(-Hi-1) <= fi <= 2^(1-Hi)
where
    fi = ni/L = number of times "i" occurs / total length of file
    Hi = number of bits assigned by Huffman to the "i" code.
?

No, that's not right either. For the special case of only 2 symbols,
Huffman always assigns 1 bit per symbol, unless the probability of one
of the symbols is identically zero (and the other is identically 1). So
for the case of 2 symbols, one cannot find a tighter bound than
  0 < fi < 1.

I'm pretty sure a tighter bound can be found if one has more than 2
symbols with non-zero probability.

So the formula must depend not only on the frequency that a symbol
occurs, but also on how many symbols there are.

LZ77 and other copy-repeat algorithms

"Abraham Lempel Honored by IEEE" http://marknelson.us/2007/07/13/lempel-award/ Dr. Abraham Lempel and Jacob Ziv

[FIXME: todo: write some code to try out the speculative ideas] [post a summary to the data compression wiki, leaving out the speculative ideas]

2009-01-06:DAV: ideas for short data compression: ... a variant of LZRW: rather than fixed size for all files, perhaps allow compressor to adapt "offset" and "length" field size for the particular file to be compressed. When compressing short files, it's important to be able to compress byte pairs, which repeat far more often than byte triplets. (But I can imagine byte pairs being so rare when compressing large files that it's not worth dealing with them)

LZRW1-A and LZJB both use (in effect)

literals: 9 bits (1 bit literal/phrase flag, and 8 bit literal)
phrase: 17 bits (1 bit literal/phrase flag, and 16 bits of offset+length).

Contrary to what both Williams and Bonwick claim, 2 byte copies actually *would* save space (albeit only 1 bit each) -- 2 literals take 18 bits, but a 2 byte copy takes 17 bits.

LZRW1-A phrase: 1 bit literal/phrase flag, 4 bit length, 12 bit offset
LZJB phrase: 1 bit literal/phrase flag; 6 bit length, 10 bit offset
miniLZO (LZO1X decompressor "often the best choice among LZO algorithms" ): ??? I see that for long "incompressible" runs of L bytes, miniLZO uses a variable-length "length" code indicating that a run of L literals follow, which for 8 < L gives less than 1 bit of overhead for each literal byte.

Perhaps the tradeoff between "length" and "offset" should be made on a file-by-file basis.

Or perhaps go even further, and make "length" and "offset" independently adjustable on a file by file basis. This slows down the decompressor (because the "items" in the compressed file are no longer byte aligned, although we maintain byte alignment in the decompressed data). And it slows down the compressor a lot, because it must search for the optimum "offset" and "length" field sizes.

Does it make sense to somehow make the compressed phrase merely 8 bits, so even a 1 byte copy might take less space than a 1 byte literal (9 bits)?

Rather than the compressor attempting all possible variations of "offset" and "length", is there any way to more quickly calculate the optimum size?

... also: perhaps mash in a "universal code" (variable length) for either the offset or the length or both ... ... also: perhaps the complex "escape code" system used in pucrunch in order to reduce the expansion of completely random data ...

For example, Using 10 bits for literals, 9 bits for compressed phrases and the most common single bytes (not the most common bytes in the source code; the most common bytes that would otherwise be encoded as 10 bit literals): (using 1 or 2 bits in the "control byte", and a single byte for the remainder, would keep the compressed code byte aligned)

DAV's example for 9 bit compressed phrases:

compressed file:
  * special ID code
  * other metadata
  * # of "common single bytes" used
  * table of "most common single bytes"
  * # of lengths used
  * table of "most common lengths"
  * #bits used in offset field
  * zero or more items
  * CRC

Each item is either a literal item, a "substitute single byte" item, or a copy item. If the length field is, for example, 3 bits, then

  000: copy 1 byte
  001: copy 2 bytes
  010: copy 3 bytes
  011: copy 4 bytes
  100: if offset is one of the L special values, use a substitute literal. Otherwise, copy 6 bytes.
  101: copy 8 bytes.
  11x: x and the following 7 bits are a literal byte.

The "substitute single byte" only makes sense when the offset field is short enough that a "substitute single byte" is shorter than a literal item.

A single byte copy can be shorter than a literal. For example, when the "length" is 3 bits and "offset" is 5 bits (8 bits), but a literal is 9 bits. (I'm not sure this happens often enough to actually save any bits. With short source texts, the overhead of the table may make it not worthwhile. With long source texts, where the overhead of the table is negligible, using those codes for more "copy" options may give a shorter compress file than using those codes for "substitute bytes".)

To extend the range of those 5 bits of offset, the offset is not a literal offset in bytes, but a context offset -- given the previous byte, the Nth time the previous byte occurred.

Unlike Huffman compression of symbols emitted by a memoryless source, there are several valid "length" fields. For example, say a block 301 bytes long is exactly repeated.

somehow make the "length" field extensible enough that it can exactly represent a length of 301. (this requires a variable-length "length" field, because a fixed-length "length" field capable of handling "301" requires at least 9 bits, which is not optimum when the vast majority of lengths are less than that.)
split the block up more-or-less arbitrarily into 2 or more pieces. Is there some way we could reduce the information content, picking some "canonical" way of splitting it up, such that that canonical representation used fewer compressed bits?

Various ways to optimize "continue copying from where we left off". Such a thing is common in English text. Some common cases include:

this long phrase is just like that previous long phrase, except 1 byte was substituted in the middle.
this long phrase is just like that previous long phrase, except 1 byte was inserted in the middle.
this long phrase is just like that previous long phrase, so long that a single "copy" item can't represent the whole length.

If we use special "extra long" lengths, the obvious thing is to make them powers of 2 ... but is that really optimum? Perhaps some other lengths would be better. For example, rather than 2^n (natural binary), use (2^n)-1 instead (Booth encoding). In particular, I think I would like to try the lengths used in Combsort11 : 1, 2, 3, 4, 6, 8, 11, ... with a shrink factor of around 1.24 ... 1.5 . fast decoding: look up lengths in table or use easily-calculated function: * f(x) = (1<<x) ("obvious powers of 2"): 1, 2, 4, 8, 16, 32, ... * f(x) = floor(x*x / 4) (squaring, rather than the exponential I want ... but still better than linear): 0, 0, 1, 2, 4, 6, 9, 12, 16, 20, 25, 30, 36, ... * f(x) = floor(x*x / 8) (squaring, rather than the exponential I want ... but still better than linear): 0, 0, 0, 1, 2, 3, 4, 6, 8, 10, 12, 15, 18, ... * f(x) = ( 2 + (1&x) ) << (x>>1) (approximately powers of sqrt 2): 2, 3, 4, 6, 8, 12, 16, 24, 32 ... If I hit a phrase that isn't exactly the same length as the available lengths, then I break it up into pieces, starting with the next-shortest phrase length and work down. But should I emit those pieces longest-to-shortest or shortest-to-longest? For example, should I break a phrase of length 10 up into 8, 2 or 2, 8 ? And if we're not doing powers-of-two, there seems to be several different ways to break them up: a phrase of length 10 could also be broken into 6, 4. Of course, this is all moot if we use a variable-length "universal code" for the length.

Benchmark Images; Benchmark Files

Lots of standard test images and standard test text files, plus comparisons of various compression programs.

If you just invented a new compression algorithm, consider using the standard benchmark images listed in the comp.compression FAQ http://www.faqs.org/faqs/compression-faq/part1/section-30.html
Maximum Compression's goal is to show the maximum achievable compression ratio for several filetypes (text, executable etc). The best 120+ programs for every filetype are compared in a table indicating compression ratios and ... http://MaximumCompression.com/ best compression ratio on:
- 1995 CIA World Fact Book (2.9 Mbyte English text): 0.9951 bits/byte without a dictionary, 0.8938 bits/byte with a dictionary.
- Acrobat Reader 5.0 executable (3.7 Mbyte): 1.8982 bits/byte ("UPX is not lossless" ?)
- (a bitmap) 1.1235 bits/byte
Information Theory and Reliable Communications Computer Project http://www.ee.sunysb.edu/~phamdo/COURSES/ESE535/Project/description.html has some benchmark real-world data sets: 512x512 image of Lena ten seconds of audio from the song `Fast Car' by the artist Tracy Chapman. the entire content of the book The Origin of Species by Charles Darwin. with some matlab code.
http://www.imagecompression.info/ has some benchmark test images, and some results using popular lossless and lossy image compression algorithms.
A Radiological Image Database http://www.bioinf.uni-hannover.de/~gnu/radio_db/radio_db.html
a gallery of Television Test Cards and Tuning Signals http://www.meldrum.co.uk/mhp/
Multispectral Scanner Landsat Data http://edcwww.cr.usgs.gov/glis/hyper/guide/landsat
Computer Vision Test Images http://www.cs.cmu.edu/afs/cs/project/cil/ftp/html/v-images.html includes human faces, toys, hyperspectral datasets, stereo pairs, handwriting, fingerprints, particle movement sequences, and other real and synthetic images.
Waterloo BragZone: testing and comparing lossy image compressors (especailly fractal compressors): http://links.uwaterloo.ca/bragzone.base.html
The Canterbury Corpus: testing and comparing lossless compressors: US mirror at http://dna.stanford.edu/canterbury/ original at http://corpus.canterbury.ac.nz/ /* was http://www.cosc.canterbury.ac.nz/~tim/corpus/ */ (a proposed replacement for the Calgary Corpus)
Archive Comparison Test (A.C.T.) by Jeff Gilchrist http://www.geocities.com/SiliconValley/Park/4264/act.html Compression/Archive Benchmarks
CCITT standard fax images http://www.cs.waikato.ac.nz/~singlis/ccitt.html (bilevel)
Kodak True Colour images http://geocities.com/eri32/kodak.html
Dr. Scott Acton keeps many demo images in ftp://master.ceat.okstate.edu/pub/ecen5793/ . The files that are exactly 65 536 Bytes long are raw 256x256 pixels x 8 bit/pixel grayscale images. Alas, these files are no longer there. Have they, perhaps, been moved somewhere near http://viva.ee.virginia.edu/ ?
Video Significance-Linked Connected Component Analysis http://meru.cecs.missouri.edu/web_video/web_video.html has a test image sequence.
Space Movie Archive http://www.vol.it/mirror/spacemoviearchive/htdocs/anim-e_mirror.html has some interesting movies.
Troops http://www.theforce.net/troops/ is a hilarious parody of "Cops" and "Star Wars".
TerraServer http://www.terraserver.com/ Not merely a *map*, but actual photographs of a lot of Terra (Earth). aerial and space photographs.
The Space Science and Engineering Center (SSEC) http://www.ssec.wisc.edu/data/ "browse-quality geostationary satellite weather images" "NOAA GOES-8 Satellite Images" "Volcano Watch (the world's ten most active volcanoes)"
The 6 CCITT standard fax images ftp://ftp.funet.fi/pub/graphics/misc/test-images/ mirror: http://www.cs.waikato.ac.nz/~singlis/ccitt.html

short data compression

Many data compression algorithms (1D and 2D) are designed to work with large amounts of data, and compress them all-or-nothing.

These actually *expand* the data (by a few bytes) if one tries to use them for "short" snippets of text or other data less than a few hundred bytes long.

If one has a large database, those data compression algorithms can give excellent compression, but it takes a very long time to decompress the database and find something in it.

Here I've listed algorithms that are a bit less efficient on "large" blocks of data, but are appropriate for "small" amounts of data (or for random-access indexing into large databases with small amounts of data at each entry). Perhaps you could think of them as 0D.

One might think that low-latency algorithms must all be 0D, operating on small blocks of data, but [elsewhere] there are many "single-pass" and "adaptive" algorithms with low latency (such as LZW compression, adaptive Huffman compression, etc.).

[FIXME: split out into seperate section] I also have some "small" data compression algorithms, designed to be used on slow machines with very little RAM and ROM. There's several applications, and several variants that trade off speed, RAM, and ROM. One semi-popular application is "I want to squeeze all kinds of cool stuff into my video game, but I want it all to fit on one disk / one ROM. Oh, and I want it to boot fast, too.". We want to create a bootable executable on that disk that loads the rest of the info off the disk and decompresses it. In this case, we want to minimize the total (decompression binary + compressed video game data) size -- since smaller size means it takes less time to get it off the disk. On one hand, we want the algorithm very simple (so that the decompression binary is small); on the other hand, we want the algorithm very sophisticated (so that the compressed video game data is small).

Fixed Length Compression:
Bigrams / Digrams
Storage Unit: 8 bits (0-255)
- Use 0-87 for blank, upper case, lower case, digits and 25 special characters
- Use 88-255 for bigrams (master + combining)
- master (8): blank, A, E, I, O, N, T, U
- combining(21): blank, plus everything [every lower case letter] but J, K, Q, X, Y, Z [DAV:... my list of letter frequencies gives etaoins hrdlcumwfgypb vkjxqz -- is there a bigram reason why Y was excluded instead of V? Or why U was included instead of S ?]
- total codes: 88 + 8 * 21 = 88 + 168 = 256
- Pro: Simple, fast, requires little memory.
- Con: Based on a small symbol set
- Con: Maximum compression is 50%.
- Average is lower (33%?).
- Variation: 128 ASCII characters and 128 bigrams.
- Extension: Escape character for ASCII 128-255.
...
Variable-Length Encoding: Huffman Codes
- ...
- Variation: Can be used on words (as opposed to characters).
- English text, with symbols for characters, is approximately 5 bits per character
- English text, with symbols for characters and 800 frequent words, yields 4.8-4.0 bits per character
- ...
-- Copyright © Jamie Callan, Bruce Croft, and/or James Allan http://ciir.cs.umass.edu/cmpsci646/Slides/ir11%20compression.pdf
DAV: hm... perhaps put some of the most-frequent words into those "special characters":
http://ciir.cs.umass.edu/cmpsci646/Slides/ir10%20text%20stats.pdf Top 50 words from 423 short TIME magazine articles (243,836 word occurrences, lowercased, punctuation removed, 1.6 MB) :
```
Word : Freq
---
the 15659
of   7179
to   6287
a    5830
and  5580
in   5245
that 2494
for  2197
was  2147
with 1824
his
is
he
as
on
by
at
it
from
but
u (? U-Haul ?)
had
last
be
who
has
not
an
s (I suppose this is from possessives with apostrophe+s)
have
were
their
are
one
they
its
all
week
government
when
would
been
out
new
which
up
more
into
only
will   488
```
The classic algorithm for "short" English text data compression is "base 40 encoding": encode the 26 letters and 10 digits (36 symbols total) and a few other crucial symbols (including space) into a number from zero to just under 40. Then you can take clusters of 3 symbols, represent it as a 3-digit "base 40" number. This number ranges from 00-00-00 to 39-39-39, which in base 10 is a number from 0 to 63 999, which in hex is a number from 0x0000 to 0xF9FF, which will fit into 2 bytes with a tiny amount ("less than 1 bit") to spare. So 3 letters will fit in 2 bytes.
2 main variants:
- One of the 40 symbols (0) is the special "end-of-string" symbol, leaving only 39 remaining symbols.
- RADIX-50 : All 40 symbols can be used in a string. All possible 3 letter combinations are in the range 0x0001 to 0xFA00, so the 16 bit word with the value 0x0000 (or 0xFFFF) is used as a "end of string" or "end of file". This requires all strings to be some multiple of 3 bytes. The end of the string is padded with 0, 1, or 2 space characters to bring the length to a multiple of 3.
[Who first came up with this scheme ? I heard rumors it was used on some of the early Zork adventures, but this turns out to be untrue -- see next item].
Zork Standard Code for Information Interchange (ZSCII) http://www.gnelson.demon.co.uk/zspec/sect03.html packs 3 letters into 2 bytes (like the above scheme). Each letter occupies 5 bits (5*3 = 15 bits), leaving one extra bit per 16 bit word. It only allows 32 different letters, but it is simpler (only requires shift-and-mask, rather than multiply-and-add) than base 40. [ Thanks to Dale King for the link !]
Both ZSCII and base 40 encoding: Both schemes reserve "escape codes", "special" characters to switch to alternate letters/numbers/punctuation/capitals character sets, so the full character set size is unlimited.
Frotz -- A Portable Z-Machine Interpreter http://www.geocities.com/SiliconValley/Heights/3222/frotz.html Frotz is freeware: It may be used and distributed freely provided no commercial profit is involved. (c) 1995, 1996 Stefan Jokisch .
"Base64 encoding" is similar ZSCII encoding (both are shift-and-mask). "Ascii85 encoding" and "basE91 encoding" are similar to base 40 encoding (they are multiply-and-add). [What RFC details the standard method ?]
The TurboPacker at http://www.geocities.com/SiliconValley/Byte/4242/lynx/tools.html /* was http://rgpc72.gp.fht-esslingen.de/students/elw5basc/lynx/tools.html */ (a LZ packer) supposedly runs quite fast for a 65c02. (To compile this source code, use the small-C compiler for the 65c02 at http://www.geocities.com/SiliconValley/Byte/4242/lynx/cc65.html ).
u-LAW and A-LAW compress each audio sample independently, making them a 0D compression techniques. See computer_graphics_tools.html#audio for details.
The Mud Client Compression Protocol http://homepages.ihug.co.nz/~icecube/compress/ using zlib-based compression (interesting bandwidth statistics charts).

representing integers

What are the different ways of representing integers ?

see fixme - floating point; endian

[FIXME: this leaves out gray codes, PN shift codes ...]

[FIXME: point to different ways of representing real numbers -- IEEE floating point, continued fraction representation, factorial notation, etc.]

Compression Basics http://www.cs.tut.fi/~albert/Dev/pucrunch/packing.html by Pasi 'Albert' Ojala

a general overview of the different compression algorithms (Huffman, LZ77), and has the best section on "Representing Integers" I've seen on the web, covering Elias Gamma Code, Elias Delta Code, Fibonacci Code, Golomb and Rice Codes, ...
"Variable Length Codes" by Steven Pigeon http://www.iro.umontreal.ca/~pigeon/vlc_shortcuts.html (umm... better to start at http://www.iro.umontreal.ca/~pigeon/ then hit "science" | "compression" | "variable-length codes" ) covers # Universal Codes # Unary Codes # Chaitin # Phasing in Codes # Golomb Codes # Fibonacci Codes # Recursive Elias Codes # (Start, Step, Stop) Codes # Start/Stop Codes # Fano # Huffman

[FIXME: summarize this conversation]

From: CBFalconer 
Subject: Re: compact storage of integer values
Date: 01 Feb 2001 00:00:00 GMT
Approved: clc at plethora.net
Organization: AT&T Worldnet
Reply-To: cbfalconer at worldnet.att.net
Newsgroups: comp.lang.c.moderated

Harald Kirsch wrote:
> > Situation: a HUGE (no HHHUUUGGGEEE) number of mostly small data
> elements must be kept in memory. Their size is stored along with
> them. 7 bit are sufficient to hold the size most of the
> time. Nevertheless larger sizes can be expected.
> > I came up with the following scheme, where 7 should be read (CHAR_BIT-1):
> > If size fits in 7 bit, store it.
> If size fits in n*7 bits:
> store (n-1) times 7 bit in a char and set the 8th bit
> store the last 7 bit in the last char (without 8th bit set)
> > The idea is to use the high-bit of a char to indicate if more bytes
> follow.

I suggest you first evaluate what the MAX value can be.  I suspect
you don't need the infinite linking.  If MAX is expressible in N
bytes, then use 0..255-N+1 as values, and M = N+2 .. 255 to
signify that the following M-N bytes hold a value.

I think this has less wastage, crams more into the 1 byte
representation, etc.

--
Chuck F (cbfalconer at my-deja.com) (cbfalconer@XXXXworldnet.att.net)
http://www.qwikpages.com/backstreets/cbfalconer
   (Remove "NOSPAM." from reply address. my-deja works unmodified)
   mailto:uce@ftc.gov  (for spambots to harvest)
--
comp.lang.c.moderated - moderation address: clcm at plethora.net

[This is in response to a suggestion by
Hans-Bernhard Broeker and Andy Isaacson
 who (independently ?) proposed a specialized version
 of this general scheme with
N = 128
which is simple to decode:
 hi bit = 0: this byte is a small literal (0...127).
 hi bit = 1: clear the hi bit, then use the result as the number of 8 bit chunks following.
]

To: "John S. Fine" <johnfine at erols.com>, Pasi Ojala <albert at cs.tut.fi>, Antaeus Feldspar <feldspar at cryogen.com>
From: David Cary <d.cary at ieee.org>
Subject: "little workspace" decompressors

Dear "John S. Fine" <johnfine at erols.com>
and Pasi Ojala <albert at cs.tut.fi>
and Antaeus Feldspar <feldspar at cryogen.com>,

I saw you all in the comp.compression newsgroup.

You all seem to be interested in "little workspace" decompressors.
I just thought I'd point out your web pages to each other.
I am really impressed with the clever scheme Pasi Ojala came up with.

How does the Elias Gamma codes (that Pasi Ojala re-invented) compare
with the functionally-similar variable-length code John Fine invented ?
Both map more common (smaller value) integers to shorter (in bits) prefix codes.
(Of course, knowing the actual value frequencies and applying huffman
gives one of several optimum encodings. What are the actual value frequencies for typical files ?).
For this application,
we also want a code that a very small/fast routine can decode
(slightly less-than-optimum encoding can trade off for smaller/faster decode routine).

>From:         "John S. Fine" <johnfine at erols.com>
>Take bits in triplets from high order to low order, with the rightmost
>bit in each triplet used as a stop bit:
>001=0  011=1 ... 111=3 000001=4, 000011=5 ... 010001=8 ... 110111=19
>... 000000001=20 ... 110110111=83 etc.  It may not be obvious, but
>the above can be decoded in a trivial amount of x86 assembler code
>with no tables or workspace.

Pasi Ojala < albert at cs.tut.fi >
http://www.cs.tut.fi/~albert/Dev/pucrunch/
some interesting thoughts on data compression --
-- optimizing for very low-RAM decompression agents -- minimizing *both*
decompression *code* as well as intermediate data used in decompression.
("in-place" decompression; the compressed data includes the decoder program;
RAM is smaller than the total compressed data + total decompressed data)
Includes source code implementing his ideas.

>Subject:      Comment on these compression ideas, please
>From:         "John S. Fine" <johnfine at erols.com>
>Date:         1998/06/27
>Newsgroups:   comp.compression
>...
>http://www.erols.com/johnfine/

I'm thinking about building very tiny embedded processors,
that instead of communicating status via LEDs or a tiny LCD panel,
use the serial port to
send full-fledged HTML pages (with pointers to pretty graphics on some other machine)
as status; perhaps even feedback forms for control.
Since these processors have only tiny amount of RAM (but slightly larger amounts of ROM),
Antaeus Feldspar has a good idea to try to apply a bit of compression.

p.s.: Given a device spewing HTML pages out its serial port,
does any of you know the techy details involved in actually displaying them
on a commodity PC using some web browser ?
I *could* use a terminal program, capture-as-text, save,
then load into my web browser -- but surely there's a better way.

From: "Ojala Pasi 'Albert'" <albert at cs.tut.fi>
Subject: Re: "little workspace" decompressors
To: d.cary at ieee.org (David Cary)
Date: Fri, 10 Jul 1998 11:53:14 +0300 (EET DST)
Cc: johnfine at erols.com, albert at cs.tut.fi, feldspar at cryogen.com
MIME-Version: 1.0

> How does the Elias Gamma codes (that Pasi Ojala re-invented) compare
> with the functionally-similar variable-length code John Fine invented ?

(Note: do not swallow before biting a bit. I may jump to conclusions here.)

John's code assumes (and benefits from) more flat distribution.
The code can be transformed into:
   n     n+1
  0 1(bb)
by rearranging the "stop bits" to the beginning of the code. So, in fact
we have a unary code which tells us how many actual bits to read (times 2).
(Btw, rearranging the code like this may even make it easier and faster
 to decode.)

We can compare this to the Elias Gamma Code:
   n   n
  0 1 b

The code lengths for the first couple of values become
  Gamma:  1 3 3 5 5 5 5 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 ...
  John's: 3 3 3 3 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 9 9 9 ...
So, the result depends entirely on the value distribution.

> What are the actual value frequencies for typical files ?

_My_ typical frequencies were very skewed, that's why I selected
Gamma Code. Your typical files probably differ :-)

Oh, and btw, my literal tag system automatically takes advantage
of 7-bit literals and literal locality.

-Pasi
--
"We believe that no race can be truly intelligent without laughter."
	-- Delenn to Sheridan in Babylon 5:"A Race Through Dark Places"

Date: Sun, 12 Jul 1998 08:52:08 -0400
From: John Fine <johnfine at erols.com>
Reply-To: johnfine at erols.com
MIME-Version: 1.0
To: "Ojala Pasi 'Albert'" <albert at cs.tut.fi>
CC: David Cary <d.cary at ieee.org>, feldspar at cryogen.com
Subject: Re: "little workspace" decompressors

Ojala Pasi 'Albert' wrote:
>
> > How does the Elias Gamma codes (that Pasi Ojala re-invented) compare
> > with the functionally-similar variable-length code John Fine invented ?

  There are actually a whole range of variable-length codes
that you can get by unary coding a number U and having
(aU+b) data bits.  Note that some of the value is carried
in the length bits, because you never use a longer code
than is needed for a given value.

  If I understand correctly, Elias Gamma uses (a=1, b=-1).
My code has (a=2, b=0).  (Note, I am taking U to be the
total number of length bits, not the value encoded by them).

> John's code assumes (and benefits from) more flat distribution.

  Actually I used that flatter one for offset and a code more
like Elias Gamma (which I had never heard of) for length.

> by rearranging the "stop bits" to the beginning of the code. So, in fact
> we have a unary code which tells us how many actual bits to read (times 2).
> (Btw, rearranging the code like this may even make it easier and faster
>  to decode.)

  I am fairly sure that keeping the length bits and content
bits mixed results in smaller and simpler code (probably
faster as well).  Even at the one to one ratio of Elias
Gamma, I think mixing the bits would give simpler code.

--
http://www.erols.com/johnfine/
http://www.geocities.com/SiliconValley/Peaks/8600/

Date: Mon, 13 Jul 1998 11:45:56 -0400
From: "John S. Fine" <johnfine at erols.com>
Reply-To: johnfine at erols.com
MIME-Version: 1.0
To: David Cary <d.cary at ieee.org>
CC: "Ojala Pasi 'Albert'" <albert at cs.tut.fi>
Subject: Re: "little workspace" decompressors

David Cary wrote:

> They all unary code a number U >= 0 with U zeros, then a one, followed by
> (aU+b) data bits (where b >= 0, and usually a > 0), for a total length of
>   (U(a+1) + 1 + b) bits.

  I was using U > 0 including the "stop" bit;  But it describes the
same thing, so we can use your form.

> This plain flavor of of the Elias Gamma (a=1, b=0) code seems easy to decode
> since the code *is* the value of the code; once just needs to know how many
> bits to grab. But I suspect rearranging a la John's method creates a decode
> routine much shorter and faster than the routine needed to decode the plain

  I would guess a LITTLE shorter and MAYBE faster.

> On the other hand, I just discovered some codes [*] that are *better* than
> the these (aU+b) codes in the sense that, no matter what the distribution
> of integers (as long as it is monotonic), the compressed file will be
> smaller than if we used any of these (aU+b) codes.

  I don't think that is possible.

> Perhaps something like this would be close enough to fibonacci,
> yet still be easy to decode:
> 000 to 101 are the 6 different terminal codes
> (use base 6: multiply accumulated total by 6, add 0..5 to make final result)
> 110 to 111 are the 2 different non-terminal codes.
> (use base 2: multiply accumulated total by 2, then add 0 or 1)

> Since (we assume) smaller numbers are always *more* common than large numbers,
> the 2 codes where "base 6" is worse than John's "stop bit" code
> are less common than the 2 codes where "base 6" is better than John's "stop
> bit" code -- the net effect is using this "base 6" code compresses to fewer
> bits than using the "stop bit" code -- no matter what the actual
> distribution is.

  You stopped too early in examining the distribution.  You have simply
picked a code that does low numbers slightly better and high numbers
MUCH worse (you thought it did high numbers SLIGHTLY worse because you
stopped too soon).

  Compare 3-bit codes with 6 terminal values to three bit codes with
4 terminal values.  This table shows the total number of values that
can be represented in N or fewer bits.

bits   6T   4T
 3      6    4
 6     18   20
 9     42   84
12     90  340

--
http://www.erols.com/johnfine/
http://www.geocities.com/SiliconValley/Peaks/8600/

From: "Ojala Pasi 'Albert'" <albert at cs.tut.fi>
Subject: Re: "little workspace" decompressors
To: johnfine at erols.com
Date: Wed, 15 Jul 1998 13:26:21 +0300 (EET DST)
Cc: albert at cs.tut.fi, d.cary at ieee.org, feldspar at cryogen.com
MIME-Version: 1.0

> I am fairly sure that keeping the length bits and content
> bits mixed results in smaller and simpler code (probably
> faster as well).

Well, of course this depends on the architecture of the chip
performing the decode. In my case the most effective way to
read bits is one at a time (through Carry), thus they _are_
read one at a time.

There is of course the loop overhead, which could be reduced
by making two calls to the getbit routine. The decode code would
become slightly larger (6 bytes ~ 2%) because I still need the
 N
b  routine for the linear code decode.

And after really thinking about it, if and when you can easily
handle multiple bits at a time (barrel shifter available),
using this stop-bit-arrangement does indeed seem faster (less
jumps per bit).

> I think mixing the bits would give simpler code.

Unary + linear is simpler code. If you mean "faster/simpler to decode",
then you are absolutely right. I may even try it myself, although my
main concern is the size of the decoder. I would need a definite
speed increase (and there are several ways to increase the speed
of my decoder, but they need a longer decoder).

-Pasi

To: johnfine at erols.com, "Ojala Pasi 'Albert'" <albert at cs.tut.fi>
From: David Cary <d.cary at ieee.org>
Subject: Re: "little workspace" decompressors

Re: "little workspace" decompressors

Yes, rearranging the bits makes no difference in how many bits there are
(how big the output file will be).
I think John is right in saying that, given any particular code,
if we scramble up the bits we can make the decoder program shorter and faster.

I never considered scrambling the bits before,
but now I think it's cool.
I think that all the (aU+b) codes John mentioned
have a (at least one) "easy-to-decode" scrambled version.

They all unary code a number U >= 0 with U zeros,
then a one, followed by (aU+b) data bits (where b >= 0, and usually a > 0),
for a total length of
  (U(a+1) + 1 + b) bits.
(Ojala Pasi uses the equivalent U ones,
then a zero -- I guess that was easier to decode for him).
The Elias gamma code has a = 1, b = 0. Plain 7-bit ASCII uses a=0, b=7. John's code has a=b=2.

Perhaps it would be interesting
for the compressor program to select some "optimum" value of a and b
for the particular file being compressed.
Especially when, as with the Pasi Ojala algorithm,
the decompressor program is embedded in the compressed data
and can be optimized for that particular value of a and b.
Perhaps the compressor would only try a couple of values for a and b
that are particularly easy to decode.
For example, we could generalize John's code to all the b=a codes:
scramble 1 "continue" bit with (a) bits of data to make fixed-length blocks of (a+1) bits
(where a=2 is John's code).
(One could brute-force try compressing with a=1, then a=2, then a=3.
Perhaps there is a efficient algorithm for the compressor to decide which a is optimal ...
depends heavily on the distribution of values ...
we want a code that "looks like" the optimal huffman code)

This plain flavor of of the Elias Gamma (a=1, b=0) code seems easy to decode
since the code *is* the value of the code;
once just needs to know how many bits to grab.
But I suspect rearranging a la John's method
creates a decode routine much shorter and faster
than the routine needed to decode the plain version.

         plain      scrambled (1x = last block)
  1          1              1
  2        010           0 10
  3        011           0 11
  4      00100        0 00 10
  5      00101        0 00 11
  6      00110        0 01 10
  7      00111        0 01 11
  8    0001000     0 00 00 10
  9    0001001     0 00 00 11
  A    0001010     0 00 01 10
  B    0001011     0 00 01 11
  C    0001100     0 01 00 10
  D    0001101     0 01 00 11
  E    0001110     0 01 01 10
  F    0001111     0 01 01 11
 10  000010000  0 00 00 00 10
 11  000010001  0 00 00 00 11

Both codes are, of course, identical in bit length,
number of zeros, number of ones, etc. --
the only difference is in the size and speed of the decompression program.

On the other hand, I just discovered some codes [*]
that are *better* than the these (aU+b) codes in the sense that,
no matter what the distribution of integers (as long as it is monotonic),
the compressed file will be smaller than if we used any of these (aU+b) codes.
Unfortunately, I have not (yet) figured out a small routine to decode these codes.
I suppose I could just have a lookup table for the first N codes,
then revert back to John's code if it is one of the rare items not in the lookup table.
This may be a win in size (if the extra compaction in the data compensates for the larger program)
and speed (if the most common table-lookup is faster than bit-by-bit decoding),
but it just doesn't seem elegant.

I've been trying to think of how to take advantage of "edge effects"
in small-file compression.
As a particular example, after the decompressor has decoded, say, 200 bytes,
if it were really intelligent then it would "know" that a offset value in a (offset, length) LZ77 code
couldn't possibly be more than 200 bytes;
so perhaps we can somehow shorten the length (in bits) of the codes that *are* possible.
More generally, it doesn't seem "right" to me
that most compression algorithms get significantly different compressed file sizes
on a file vs. the time-reversed image of that file.
Huffman is the only algorithm I know of that gets exactly the same compressed file size either way.
If I could somehow make the conditions when the decompressor is near the end of the file
"more like" the conditions when it is near the start of the file,
perhaps I could squeeze out a few more bits of compression.
The length of the uncompressed file is usually known to the compressor before compression even begins;
the decoder often has a rough estimate
(and sometimes the exact value) of the total compressed code length and the total decompressed data length;
perhaps it can take advantage of this meta-information...

[*] The Elias delta and the Fibonacci codes.
2 radically different ways of overcoming the inherent inefficiency of the unary number encoding.
Elias delta has 3 parts: a unary number encoding U, then 2^N bits encoding V,
then 2^V bits encoding the actual desired number.
I think I remember reading that Elias delta was already "optimal", in some sense --
there was no advantage to going to 4 parts.
(But I don't really understand why that was true).

Fibonacci codes,
rather than having a unary code of 0 bits that ends at the 1st 1 bit,
instead has a more compact code that ends the first time it hits 2 consecutive 1 bits.
This is similar to the "bit-stuffing" in most long-distance communication protocols
(and CD-ROM format and some magnetic disk formats),
where the data is not allowed to have N 0 bits in a row (because the receiver would lose clock sync).
*Every* time there is (N-1) "0 bits" in a row, the transmitter *always* stuffs a "1 bit".
When decoding, every time there is (N-1) "0 bits", always throw away the following "1 bit".
Is there a efficient way of decoding these values (perhaps by scrambling the bits) ?
Or a code similar to this (better than (aU+b) ) that can be decoded with a short program ?
(perhaps read pairs of bits, and stop when both bits equal 11 -- meanwhile decode using base 3 ?).

I first read about Elias delta and Fibonacci codes at
  http://www.ics.uci.edu/~dan/pubs/DC-Sec3.html#Sec_3.3
>3.3 Universal Codes and Representations of the Integers
...
>The first Elias code is one which is simple but not optimal. This code, gamma
...
>Elias delta
[is optimal]
...
>While the Fibonacci codes are not asymptotically optimal,
they compare well to the Elias codes as long as the number of source messages is not too large.
Fibonacci codes have the additional attribute of robustness,
which manifests itself by the local containment of errors.
...
>a Fibonacci code provides better compression than the Elias code
until the size of the source language becomes very large.
...
>We describe only the order 2 Fibonacci code;
the extension to higher orders is straightforward.
>
>N             R(N)               F(N)
>
> 1                           1   11
> 2                       1   0   011
> 3                   1   0   0   0011
> 4                   1   0   1   1011
> 5               1   0   0   0   00011
> 6               1   0   0   1   10011
> 7               1   0   1   0   01011
> 8           1   0   0   0   0   000011
...
>16       1   0   0   1   0   0   0010011
...
>32   1   0   1   0   1   0   0   00101011
>
>    21  13   8   5   3   2   1
>
>Figure 3.7 -- Fibonacci Representations and Fibonacci Codes.

Perhaps something like this would be close enough to fibonacci,
yet still be easy to decode:
000 to 101 are the 6 different terminal codes
(use base 6: multiply accumulated total by 6, add 0..5 to make final result)
110 to 111 are the 2 different non-terminal codes.
(use base 2: multiply accumulated total by 2, then add 0 or 1)

      base 6   "stop bit"

         000          001
         001          011
         010          101
         011          111
         100      000 001
         101      000 011
     110 000      000 101
     110 001      000 111
     110 010      010 001
     110 011      010 011
     110 100      010 101
     110 101      010 111
     111 000      100 001
     111 001      100 011
     111 010      100 101
     111 011      100 111
     111 100      110 001
     111 101      110 011
 110 110 000      110 101
 110 110 001      110 111
 110 110 010  000 000 001
 110 110 011  000 000 011
 110 110 100  000 000 101
 110 110 101  000 000 111
 110 111 000  000 010 001
...
Since (we assume) smaller numbers are always *more* common than large numbers,
the 2 codes where "base 6" is worse than John's "stop bit" code
are less common than the 2 codes where "base 6" is better than John's "stop bit" code --
the net effect is using this "base 6" code compresses to fewer bits than using the "stop bit" code --
no matter what the actual distribution is.
But this probably increases the complexity of the decoder.

>From: John Fine <johnfine at erols.com>
...
>  There are actually a whole range of variable-length codes
>that you can get by unary coding a number U and having
>(aU+b) data bits.  Note that some of the value is carried
>in the length bits, because you never use a longer code
>than is needed for a given value.
...
>  Actually I used that flatter one for offset and a code more
>like Elias Gamma (which I had never heard of) for length.
...
>  I am fairly sure that keeping the length bits and content
>bits mixed results in smaller and simpler code (probably
>faster as well).  Even at the one to one ratio of Elias
>Gamma, I think mixing the bits would give simpler code.
>
>--
>http://www.erols.com/johnfine/
>http://www.geocities.com/SiliconValley/Peaks/8600/

>From: "Ojala Pasi 'Albert'" <albert at cs.tut.fi>
...
>The code lengths for the first couple of values become
>  Gamma:  1 3 3 5 5 5 5 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 ...
>  John's: 3 3 3 3 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 9 9 9 ...
>So, the result depends entirely on the value distribution.
...
>-Pasi

(posted to comp.compression; also mailed direct)

--
+ David Cary "http://www.ionet.net/~caryd_osu/david"
| Future Tech, Unknowns, PCMCIA, digital hologram, <*> O-

To: "John S. Fine" <johnfine at erols.com>, Pasi Ojala <albert at cs.tut.fi>, Antaeus Feldspar <feldspar at cryogen.com>
From: David Cary <d.cary at ieee.org>
Subject: Re: "little workspace" decompressors

Yes, John is quite right-- the 6T code I just made up gets worse very quickly at high numbers.
If I were lucky enough to have an input file that needed these large numbers extremely rarely,
then my 6T code would do a better job compression than the 4T code;
but if the data has a more even distribution, then the 4T code would be better.

Is this true for the Fibonacci codes as well --
are they always better, as I originally claimed, or is it data-dependent ?

1st hand:
It *seems* like Fibonacci would always be better --
if one looks at all the length N unary codes (there is only 1: (N-1) zeroes, followed by a 1),
and compared it to all the length N Fibonacci codes
(which includes the unary code,
plus some other codes with lots of zeros with single "1" buts scattered through them,
followed by a "11"),
it seems there are always more Fibonacci codes of a specific length
(which means it can represent more data).
(I'm having a little difficulty coming up with
a formula predicting exactly how many Fibonacci codes there are of length N).

2nd hand:
On the other hand, one can use the "pigeon-hole",
"counting" argument --
since, when decoding, all possible bit sequences mean *something* unique,
there is no redundancy.
it would then seem that one cannot guarantee that one code will always get better compression than another.
In fact, one can pick any code,
then synthesize a artificial data file to make that code look the best
by using numbers with a frequency proportional to 1/(#bits needed to represent that number in this code).

How do I resolve the contradiction between these 2 hands ?

U
	U-bit unary codes
			U-bit Fibonacci codes

1	1: 1		0
2	1: 01		1: 11
3	1: 001		1: 011
4	1: 0001		2: 0011, 1011
5	1: 00001	3: 00011, 01011, 10011
6	1: 000001	5: 000011, 001011, 010011, 100011, 101011
u	1: 0...01	f(u) (is there a formula ?)

(neglecting the aU+b data bits that follow the code)

Hm... if there were *always* at least many Fibonacci codes as unary codes
(the column at right always had at least 1, instead of that pesky 0 at the top)
then the 1st hand paragraph would be true.
What sorts of distributions would compress better using unary codes ?

Which of these codes
(Fibonacci, Fibonacci aU+b, unary aU+b, Elias delta)
has the best match with the 1/f distribution
typical of the distribution of words in natural languages --
and, so I hear, the distribution of patterns in DNA ?

  http://www.ics.uci.edu/~dan/pubs/DC-Sec3.html#Sec_3.3
>3.3 Universal Codes and Representations of the Integers
...
>We describe only the order 2 Fibonacci code; the extension to higher orders is straightforward.
>
>N             R(N)               F(N)
>
> 1                           1   11
> 2                       1   0   011
> 3                   1   0   0   0011
> 4                   1   0   1   1011
> 5               1   0   0   0   00011
> 6               1   0   0   1   10011
> 7               1   0   1   0   01011
> 8           1   0   0   0   0   000011
...
>16       1   0   0   1   0   0   0010011
...
>32   1   0   1   0   1   0   0   00101011
>
>    21  13   8   5   3   2   1
>
>Figure 3.7 -- Fibonacci Representations and Fibonacci Codes.

>Date: Mon, 13 Jul 1998 11:45:56 -0400
>From: "John S. Fine" <johnfine at erols.com>
>To: David Cary <d.cary at ieee.org>
>CC: "Ojala Pasi 'Albert'" <albert at cs.tut.fi>
>Subject: Re: "little workspace" decompressors
>
>David Cary wrote:
...
>> On the other hand, I just discovered some codes [*] that are *better* than
>> the these (aU+b) codes in the sense that, no matter what the distribution
>> of integers (as long as it is monotonic), the compressed file will be
>> smaller than if we used any of these (aU+b) codes.
>
>  I don't think that is possible.
>
>> Perhaps something like this would be close enough to fibonacci,
>> yet still be easy to decode:
>> 000 to 101 are the 6 different terminal codes
>> (use base 6: multiply accumulated total by 6, add 0..5 to make final result)
>> 110 to 111 are the 2 different non-terminal codes.
>> (use base 2: multiply accumulated total by 2, then add 0 or 1)
>
>> Since (we assume) smaller numbers are always *more* common than large numbers,
>> the 2 codes where "base 6" is worse than John's "stop bit" code
>> are less common than the 2 codes where "base 6" is better than John's "stop
>> bit" code -- the net effect is using this "base 6" code compresses to fewer
>> bits than using the "stop bit" code -- no matter what the actual
>> distribution is.
>
>  You stopped too early in examining the distribution.  You have simply
>picked a code that does low numbers slightly better and high numbers
>MUCH worse (you thought it did high numbers SLIGHTLY worse because you
>stopped too soon).
>
>  Compare 3-bit codes with 6 terminal values to three bit codes with
>4 terminal values.  This table shows the total number of values that
>can be represented in N or fewer bits.
>
>bits   6T   4T
> 3      6    4
> 6     18   20
> 9     42   84
>12     90  340
>
>--
>http://www.erols.com/johnfine/
>http://www.geocities.com/SiliconValley/Peaks/8600/

Date: Thu, 30 Jul 1998 10:11:51 -0400
From: "John S. Fine" <johnfine at erols.com>
To: David Cary <d.cary at ieee.org>
Subject: Re: "little workspace" decompressors

David Cary wrote:

> Is this true for the Fibonacci codes as well -- are they always better, as
> I originally claimed, or is it data-dependent ?

bits   Fib   6T    4T
1        -    -     -
2        1    -     -
3        2    6     4
4        4    -     -
5        7    -     -
6       12   18    20
7       20    -     -
8       33    -     -
9       54   42    84
. . .
12     232   90   340
15     986  154  1364

  Maybe I misunderstand "better".  The values in the
table above are cumulative.  In 9 or fewer bits, fib
can represent 54 different values; 4T can represent
84 different values.

--
http://www.erols.com/johnfine/
http://www.geocities.com/SiliconValley/Peaks/8600/

lossy text compression

[this section is very unfinished ...]

Currently, good-quality lossy text compression can only be done by humans. It's one of the few remaining areas where humans are clearly superior to silicon-based computers.

haiku link_farm.html#haiku .
"Stuart Williams' ... Lord of the Rings Drabble Competition. ... (To remind those of you whose minds it may have slipped during the terrible events that destroyed Sauron and the race of Mordor once for all, a drabble is a short story that is exactly 100 words long, not including its title.)" http://www.pentrace.com/penbase/Data_Returns/full_article.asp?id=358
"Rule 17. Omit needless words !" Professor William Strunk Jr. -- quoted in _The Elements of Style_ book by William Strunk Jr. and E.B.White.
[FIXME: Where is the URL ? I remember reading an article somewhere on the web about mangling words: Using a dictionary of common english words, it mangled a text by taking each word, keeping the first letter and the last letter, then replacing as many internal letters with "*" as possible. Then the text could be recovered (lossless) by using the dictionary to figure out the only word that had that pattern. This mangled text seemed to compress better -- all those consecutive "*"s. star compression ? star code? But then he got even better compression by replacing each word with a number ... Aha, here it is: star encoding http://dogma.net/markn/articles/Star/ ] ... seems related to ... http://www.deviantart.com/view/3186332/ ... http://c2.com/cgi/wiki?EveryWordCanBeAbbreviatedToFourLetters ... scrambling all but the 1st and last letter of each word: ... discussion: http://science.slashdot.org/article.pl?sid=03/09/15/2227256&tid=134 ... more discussion http://slashdot.org/article.pl?sid=03/09/28/1328220&mode=thread&tid=134
"better compression scheme for text?" by lowwave (469952) on Monday September 15 It seems possible that one can create better compression scheme for text. If you fix both ends of a word, order the middle letters, then there are sequences appears more frequent than in nature writing. The trick is to unscramble the word based on the context. The scheme probably will introduce some errors. http://science.slashdot.org/comments.pl?sid=78625&threshold=1&commentsort=0&tid=133&tid=134&tid=186&mode=thread&cid=6970387 (DAV: I think I've seen "sort consonants alphabetically, then vowels alphabetically" somewhere before ... or perhaps this is just talking about standard alphabetical sequence.)
Open Source Grammar Checkers http://slashdot.org/article.pl?sid=00/01/14/0120248
Perhaps some of the transforms suggested in idea_space.html#modified_english would give better compression -- in particular, "undoing" some of the suffix combining rules so that simple LZSS compressor or a dictionary compressor can more easily recognize that ``candy'' and ``candies'' are practically the same word. For example, when compressing the text ``candies'' with a dictionary-based compressor, if we emit the ``word symbols'' <holy> followed by < suffix - ies >, (which are presumably already in the dictionary), then the dictionary doesn't need to store the ``words'' ``holies'' or ``hol''. For lossy text compression, it might be interesting to just leave off the suffixes.
Reader's Digest
The Webby Awards
- http://www.webbyawards.com/
- ``Though the number of categories this year swelled to over two dozen, the ceremony still clocked in at under two hours thanks to the strict Webby policy that holds speeches to a mere five well-chosen words.'' http://www.wired.com/news/culture/0,1284,36300,00.html
- http://webbyawards.com/press/speeches.php /* was http://www.webbyawards.com/main/press/speeches.html */ lists some speeches, including:
  Plus Magazine http://plus.maths.org/ : "Maths is elegant, interesting, useful."
- http://www.wired.com/news/culture/0,1284,36302,00.html
- http://www.wired.com/news/culture/0,1284,18591,00.html
Jason Rennie http://www.ai.mit.edu/~jrennie/research.html says "More generally, I am interested in the problems of classifying text, extracting information from text, retrieving information from text, document clustering and document summarization."
"How to Write an Abstract" by Phil Koopman (Draft 10/20/97) http://www.cs.cmu.edu/People/koopman/essays/abstract.html
the Dada Engine http://www.zikzak.net/~acb/dada/ looks like it will come in handy for my lossy text compression algorithm.
http://dsl.org/comp/cutups.shtml Markov chains of word frequencies, and English text mangling ... perhaps this could be useful in English text compression, perhaps even lossy text compression.

random ideas

I've been thinking about "conflation" (``confounded'' ?), mixing together unrelated items, and how to undo this (in hopes that such a transformation could lead to better compression).

From: d_cary < at my-dejanews.com >
Subject: undoing conflation
Date: 28 Oct 1998 00:00:00 GMT
Newsgroups: comp.compression

Assume that I have some pseudo-English text such that each word is a
standard English word and occurs with pretty much the standard frequency for
that word in English text, but each word has absolutely no correlation to the
next ("memoryless").
(There are, of course, strong correlations between the letters inside a word,
as well as common word prefixes, common word suffixes, etc).

The best possible compression I could hope for would be to
use each word as a "symbol" and compress one symbol at a time
with arithmetic coding (marginally better than Huffman coding), right ?
(I assume the file is very big relative to the size of the frequency table).

If I scramble each word with any reversible word-&integer transformation
before compressing,
the compressed file will be exactly the same size
as the one I created without this transformation, right ?

If I just randomly selected clumps of letters
(say, 4 letters at a time), paying no attention to the word boundaries,
compressing one clump (one symbol) at a time with arithmetic (or Huffman)
compression, I think the compressed file would be much worse (right ?)
than the previous 2 files.
If I scramble each clump with any reversible clump-&integer transformation,
creating a "conflated file",
the compressed file will be exactly the same ("much worse") size, right ?
(Is there a better term for this than "conflated file" ?).

Here's the question: Say I have a "conflated file" -- a sequence of (short)
integers with the above characteristics. How do I compress it (lossless) as
small as possible. Rather than immediately hitting it with arithmetic coding,
the best thing to do is the inverse integer->clump transformation, then
compress one word (symbol) at a time with Huffman coding. But I don't know
which clump->integer transformation was used. Now what ? Is there some way I
could un-do (some of) the damage, somehow break the integers up into
variable-length strings (doesn't have to be the original text), re-group them
into "better" multi-byte words ? If I could do that, I could use the
above-mentioned "optimal" arithmetic coding.

If this were possible, one might even be able to apply it to compressing
standard text -- -- ideally, it would recognize that the letter "T" is
conflating the idea of "next letter is capitalized" with the idea of the
letter "t". Once we un-did that conflation, we would have a text file without
any Upper Case letters, but we would have to add a new symbol (I arbitrarily
choose A) to indicate that the next letter is a capital.

(conflated)
This is a test. If this were ...
(unconflated)
Athis is a test. Aif this were ...

Of course, now we have the extra overhead of having to stick extra
information in the compressed file telling the uncompressor, once the
"unconflated file" has been decompressed, how to re-conflate it so we can
recover the original conflated file I was given.

But compressor can now recognize that both "This" and "this" are the same
word, and that the "A" symbol is almost always preceded by a space or a
newline, allowing it to create a smaller file.

Is it possible to get a net benefit on the
  Canterbury Corpus
? I hope so :-).

--
David Cary
http://www.rdrop.com/~cary/html/data_compression.html

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own

Other compression-related ideas:

CompLearn is a suite of utilities that you can use to apply compression techniques to the process of discovering and learning patterns. http://complearn.org/

error detection and correction

ECC: error correcting codes

EDAC: error detection and correction

[FIXME: move all this information to "Data Link Error Detection / Correction Methods" http://massmind.org/techref/method/errors.htm ]

CRC: cyclic redundancy check (see spin_dictionary.html#crc for other CRC acronyms )

Data compression is often preparatory to communicating it over some "limited-bandwidth" real-time communications channel or storing it on a limited storage media. After squeezing out most of the redundancy, we *deliberately* add more redundancy. All real-time communications channels add enough redundancy (in the form of headers, trailers, and CRC codes) to allow the errors to be detected, so the receiver can ask the transmitter to re-send the data. Some communications channels and most storage devices add even more redundancy (Hamming codes for DRAMs, Reed-Solomon codes for CD-ROMs, and other ECCs) to let us detect and correct errors at the receiver or media reader.

The simple hardware to do CRCs looks very similar to the simple hardware to do LFSR machine_vision.html#spread_spectrum

Wikipedia: Error detection and correction
endian_faq.html
Turbo codes http://directory.google.com/Top/Science/Math/Applications/Communication_Theory/Coding_Theory/Turbo_Codes/?tc=1 ??? Turbo codes are a forward error correcting code technique.
[FIXME: to read]
bit coding
Chip-by-Chip Turbo Coding for DS/SS Systems Chen ZHENG, Takaya YAMAZATO, Masaaki KATAYAMA, Akira OGAWA Vol.E82-A No.12 p.2751: http://search.ieice.or.jp/1999/files/e32026.htm http://search.ieice.or.jp/1999/files/e000a12.htm#e82-a,12,2751
A New Digitized Bit Timing Recovery Scheme Takao, Suzuki, and Shirato. http://search.ieice.or.jp/1999/files/e32020.htm#toshiaki32takao http://search.ieice.or.jp/1999/files/e000b08.htm#e82-b,8,1326 http://search.ieice.or.jp/1999/pdf/e82-b_8_1318.pdf
The Error Correcting Codes (ECC) Home Page http://imailab.iis.u-tokyo.ac.jp/~robert/codes.html links and source code in C: Reed-Solomon (RS) codes, Reed-Solomon errors-and-erasures decoder, BCH codes, Binary (48,36,5) BCH code, Binary (31,21,5) BCH code, Golay (23,12,7) code, CRC-32, "Turbo codes" "Viterbi decoding", Wireless Digital Communication, Properties of binary linear codes,
Ross Williams (see above) also has The CRC Pitstop http://www.ross.net/crc/ which has lots of theoretical detail as well as working C source code for calculating CRCs.
One item it points to is:
"A painless guide to crc error detection algorithms index v3.00 (9/24/96)" by Ross N. Williams, http://www.repairfaq.org/filipg/LINK/F_crc_v3.html
"A painless guide to crc error detection algorithms" by Ross N. Williams, 19 August 1993 http://www.geocities.com/SiliconValley/Pines/8659/crc.htm
"A painless guide to crc error detection algorithms" by Ross N. Williams 19 August 1993 http://massmind.org/techref/method/math/crcguide.html
```
From: (William Lewis)
Subject: Re: Document CRC
Organization: The Seattle Group
Date: Tue, 27 Dec 1994 00:50:13 GMT
```
...

Ross N. Williams has written "A Painless Guide to CRC Error Detection Algorithms", available from ftp.adelaide.edu.au:/pub/rocksoft/crc_v3.txt, which I found very useful. It goes from the basic mathematical theory, through how to choose a polynomial, to how to write efficient implementations, with examples in C. Lucid and practical. Check it out. -- William Lewis.
Protocols/Standards/Guides http://www.repairfaq.org/filipg/LINK/F_LINK_IN.html#LINKIN_004 points to "A painless guide to crc error detection algorithms" (above) and "Examples of CRC algorithm for uPs" http://www.repairfaq.org/filipg/LINK/F_crc_uP.html PIC16C71 crc code in asm (has assembly language for calculating 16 bit CRC on a PIC µProcessor)
"CRC Generating and Checking" white paper by Thomas Schmidt, Microchip Technology Inc. http://www.microchip.com/download/appnote/crc/00730a.pdf has a diagram of a hardware CRC generator (very similar to a LFSR [FIXME:] ), a "loop driven CRC implementation" (in PIC assembly code), a "table driven CRC implementation (in PIC assembly code).
"Embedded Small Table CRC16 Routines" by Ashley Roll 14 May 2002. http://digitalnemesis.com/ash/projects/EmbeddedCRC16/ includes public domain implementations in C programming language and PIC assembler language. "... implementation of the CCITT CRC16 which processes the message data in 4-bit chunks. This should be easily portable to any processor." This seems to be a good time/space tradeoff for small processors.
Basic File Integrity Checking by Jeremy Rauch 2000-08-14 http://www.securityfocus.com/focus/linux/articles/fileinteg.html
the 8 bit CRC used in Dallas Semiconductor iButton products. http://www.dalsemi.com/DocControl/PDFs/app27.pdf
An introduction to Reed-Solomon codes: principles, architecture and implementation http://www.4i2i.com/reed_solomon_codes.htm
http://users.originet.com.br/martinnorris/crcjava-noframes.html
Data Link Error Detection / Correction Methods http://piclist.org/techref/method/errors.htm points to practical information and source code for CRC codes, Fire codes, Reed-Solomon codes, Hamming codes, Viterbi convolutional coding, etc. [FIXME: link from #protocol]
CDs (compact disks) use CIRC (Cross Interleave Reed-Solomon Code) for error detection and correction. the codes are powerful enough to totally recover a burst error of greater than 4,000 consecutive bits -- about 2.5 mm on the disc. With full error correction implemented (this is not always the case with every CD player), it is possible to put a piece of 2 mm tape radially on the disc or drill a 2 mm hole in the disc and have no audio degradation. Some test CDs have just this type of defect introduced deliberately. http://repairfaq.ece.drexel.edu/sam/cdfaq.htm
http://repairfaq.ece.drexel.edu/sam/cdfaq.htm [FIXME: where am I collecting "compact disk" information ?]

program compression

2001-12-01:DAV: started section. (Thanks to "Jerry Lim Han Sin" who encouraged me to write down my thoughts on program compression)

Lots of people complain about huge, bloated executables. What can we do about it ?

related local files:

the executable and linking format (ELF) #elf
computer_architecture.html#subroutines

Normally I (DAV dav_info.html) utterly despise machine/OS dependent stuff -- because this makes it far too difficult to *improve* my machine. But in this special case, it's cool -- executable files are inherently machine/OS dependent *anyway*, so it doesn't make things any worse to use techniques that also are machine/OS dependent.

FAQ: How can I produce self-contained or smaller decompressors? http://prize.hutter1.net/hfaq.htm#self
DAV: I'm very interested in "functional compression", for example, re-writing a program so it still runs directly, functionally does the same thing, yet takes up less hard drive space. Somehow I feel this is different than standard compression, where you must uncompress the file before you can use it. Thinking about this leads to to clever ideas like "shared libraries" and "re-entrant code". I wonder if this can be generalized to things other than programs. Practical application: putting more and better programs onto a handheld computer with limited storage space http://www.handhelds.org/z/wiki/Making%20stuff%20smaller . The factoring that FORTH encourages would seem to help, by making it easier to re-use common subroutines, and share them between programs.
Program optimization goals that David knows about: (ways to choose among myriad implementations that all functionally do the same thing):
- optimize for minimum space (important for memory-constrained machines, such as some very early computers and resurging in importance for handheld devices)
- optimize for minimum time (traditionally very important because all early computers were very slow; still important for some "large" problems).
- optimize for minimum energy used (becoming the dominant goal with battery-powered devices)(becoming the dominant goal for rack-mounted servers, because increasing the power going to the data center and its cooling capacity is very expensive [FIXME: link to Google power discussion here])
[FIXME: read the article "Forth code size and programming effort" http://www.complang.tuwien.ac.at/projects/forth.html on executable code size. ]

Ways of compressing programs: [FIXME: is there a better terminology I could use than ``functional'' vs. ``non-functional'' ? ] ( "Metaprogramming and Free Availability of Sources: Two Challenges for Computing Today" article by François-René Rideau http://fare.tunes.org/articles/ll99/index.en.html mentions

...
0 -> 1 ... self-extractible archives ...
1 -> 1 ... Compacter: produces self-decompacting code. ...
...

)

"The Code Compression Bibliography" (currently maintained by Mario Latendresse) http://www.iro.umontreal.ca/~latendre/codeCompression/ uses the term "code compression" for what I call "program compression" or "executable compression". (Unfortunately, the term "code" conflicts with the jargon used in data compression).

``non-functional compression'': make a new data file, smaller than the old executable program. Requires the user to decompress it with a seperate decompression utility (typically as part of the install procedure) before executing.
flavors:
- Compress it like any other binary data file.
- Recognize that is an executable (for a particular machine/OS combination), and use that information to help give better compression. (Some ideas here work for *any* executable, and often help with compressing any natural language as well) (Some ideas are relatively generic (converting relative addresses to absolute addresses after the JMP and CALL opcodes, so that lots of calls to one particular place become lots of copies of the same byte sequence, which can be compressed easier) but the implementation is specific to a particular machine (intel, ARM, etc. all have different values for the CALL opcode) ) (some ideas are extremely specific to a particular machine/OS -- discovering the embedded icons in a program for a GUI OS, and compressing that section of the program with algorithms #2D_compression optimized for that sort of thing; )
``semi-functional compression'': make a new executable program, smaller than the old executable program. Does *not* require a seperate decompression program.
This makes clear the tradeoff between the size of the decompression algorithm and the size of the compressed data.
flavors:
- self-extracting archives: an executable that, when run, puts an executable program on the hard drive identical to the original executable program.
there seem to be several of these available. Here's a couple I've seen recently:
- http://www.filzip.com/ creates self-extracting archives of an entire directory of files.
- http://sfxmaker.powerarchiver.com/ converts ZIP archive files into self-extracting (EXE) files.
- http://download.com.com/ file compression utilities lots of utilities to help create self-extracting archives: [functional compression] [FIXME: check some out]
``lossless functional compression'': make a new executable program, smaller than the old executable program, and yet in a sense ``contains'' the complete original executable program.
By ``functional'', I mean that when you run the new executable, it acts like running the original executable.
By ``lossless'', I mean that with the right decompression tool, you can recover the original executable.
flavors:
- Using the ``semi-functional compression'' techniques, plus an extra line or 2 at the end of the decompressor to run the program just decompressed, then after that program has finished, delete the temporary file (or just leave in the system /tmp/ directory to be automatically deleted later). It's possible to recover the original .exe file from the compressed file.
- UPX: the Ultimate Packer for eXecutables http://sourceforge.net/projects/upx/ ... http://sourceforge.net/projects/kupx/ ... http://sourceforge.net/projects/upxshell/
- HTMLZip http://www.htmlzip.com/software.html creates self-extracting HTML pages. that (in a browser) look the same as the original HTML page.
- compressed Linux boot kernel [FIXME:]
- Pasi Ojala http://www.cs.tut.fi/~albert/Dev/pucrunch/ some interesting thoughts on data compression -- -- optimizing for very low-RAM decompression agents -- minimizing *both* decompression *code* as well as intermediate data used in decompression. ("in-place" decompression; the compressed data includes the decoder program; RAM is smaller than the total compressed data + total decompressed data) Includes source code implementing his ideas.
- ``Optimizing Embedded Linux: Who needs excess baggage?'' article by Todd Fischer 2002 _Dr. Dobb's Journal_ May 2002 http://www.ddjembedded.com/resources/articles/2002/0205f/0205f.htm has some general tips for minimizing memory usage (use compiler option -Os, which tells the compiler to optimize for size.). In some cases, compared to burning a (statically linked) application into ROM, *adding* Linux can actually reduce the total ROM+RAM memory budget (using the CRAMFS can allow you to use a much smaller ROM; also, Linux sets up the MMU and demand paging so that you need much less RAM than you would if the entire (uncompressed) application had to fit into RAM). Specific examples: ARM9 cross-development environment; roughly 2 MB of FLASH.
code density ``lossy functional compression'': make a new executable program, smaller than the old executable program, and yet functionally identical.
flavors:
- Thumb mode on the ARM microprocessor ``Thumb mode of operation, which expands 16-bit Thumb instructions into 32-bit ARM instructions on the fly,''
- ``source level'' compression. The idea of a high-level language is to get more stuff done per line of (human-readable) source code. This has several benefits including
  - -- Fewer lines of code means fewer places for bugs to lurk
  - -- fewer lines of code, in theory, implies less time needed to read the code, leaving more time for running and actually using the code. (The same advantages claimed for speedtalk idea_space.html#speedtalk )
  - -- smaller programs are sometimes easier to understand.
  See http://c2.com/cgi/wiki?RefactorMercilessly for more refactoring benefits, and discissions on when it is or is not appropriate to refactor.
  See http://c2.com/cgi/wiki?MultiLanguageRefactoring for some speculation on gradually switching to a higher-level language.
  Often programming languages have ``features'' that make it easier to compile (functions must be declared before they are called; some languages have ``hints''; some languages use postfix notation ). This makes the compiler much smaller, and makes it easier to justify putting the compiler on the run-time hardware (which has other benefits we won't go into here). Perhaps compressed human-readable source code could be much smaller than compressed executable code, giving a tangible benefit. In other words, perhaps putting a small compiler + compressed human-readable source code is cheaper (takes less ROM) than putting the compiled executable in ROM. Even if it's not clearly cheaper, there are intangible benefits. (intangible benefits of putting human-readable source code and the compiler on the run-time hardware are: it makes things much easier on the people trying to find and fix hardware problems and software bugs; it makes things much easier when people later try to add features ).
- ``Why Stack Machines ?'' http://www-2.cs.cmu.edu/People/koopman/forth/whystack.html
  
  ``Program size doesn't matter (much) for workstations. But, for embedded control it matters a lot, especially when you're limited to on-CPU chip memory, and the CPU has to cost less than $5-$10. Anecdotal evidence indicates stack computer program size can be smaller than CISC programs by a factor of 2.5 to 8 (and, another factor of 1.5 to 2.5 smaller than RISC, depending whom you want to believe). This comes not just from compact opcodes, but also from reuse of short code segments and implicit argument passing with subroutine calls. Code size comparisons I've seen don't take this into account. (P. Koopman 1989, Stack Computers, pg. 118-121.)''
  
  ``Program size measures depend largely on the language being used, the compiler, and programming style, as well as the instruction set of the processor being used.'' -- P. Koopman 1989 http://www-2.cs.cmu.edu/People/koopman/stack_computers/sec6_2.html#621
  For more info on stack machines, see see computer_architecture.html#stack .
- ``Fixed-length RISC processors typically require 40 to 60 percent more code storage than ColdFire VL instruction set designs for a given task.'' http://e-www.motorola.com/brdata/PDFDB/docs/MCF5XXXWP.pdf
- http://bknd.com/ mentions that an application that filled 22 KB of code on a 8051 was recompiled and only required 5.5 KB of code on a PICmicro. (Which I find suprising, since the PIC has only a single address register, and a single accumulator that everything must funnel through ... but the PIC *does* have a nice, short CALL instruction )

DAV: The P. Koopman quote directly points to 3 ways of decreasing program size:

(a) Train people to use a programming style / language that creates shorter programs
(b) use a space-optimizing compiler
(c) design a very compact instruction set.

space-optimizing compiler

If you're using gcc or a similar compiler, consider using the

-Os

(optimize for size) compiler option.

[FIXME: move information on subroutine threading, direct threading, and indirect threading to here] [or to http://en.wikipedia.org/wiki/Threaded_code ]

In some cases (when the CALL instruction on this particular architecture is very compact, and ... it's just as compact to handle control structures in-line than it is to CALL subroutines to handle them), subroutine threading can be just as compact as the other kinds of threading.

"Threaded Code" article by Anton Ertl. http://www.complang.tuwien.ac.at/forth/threaded-code.html "Note that, in contrast to popular myths, subroutine threading [a list of consecutive function calls] is usually slower than direct threading [a list of consecutive function pointers] "

compact instruction set

DAV wonders what a very compact instruction set would look like, and what a good tradeoff in size/speed/power would be.

[FIXME: how to properly split this topic between data compression and computer_architecture.html#considerations ?]

A compact instruction set can either be implemented directly in hardware, or (trading off some run-time speed) emulated/interpreted (for example, the BASIC Stamp and Pcode), or (trading off some start-up time) recompiled before running (for example, JIT Java).

There are 3 main variants that I know of to creating a compact instruction set:

(ca) make branches: subroutine call/return (including argument passing), and looping, extremely fast (so optimizing for speed is less likely to expand subroutine calls or unroll loops). This is really an indirect route to (a) and (b). FORTH processors and interpreters all do this. DSP processors often have ``zero-overhead looping''. I think some of the speed of RISC and the compactness of FORTH can be attributed to keeping subroutine arguments in on-chip registers, rather than external RAM.
(cb) Make common instructions take very few bits -- This leads to variable-length instructions. Make common things one wants to do take very few instructions -- this leads to lots of special-purpose instructions (such as the polynomial evaluation instruction on the VAX). CISC instruction sets are like this, often 1 instruction taking 1, 2, 3, or 4 bytes. FORTH processors use ``return from subroutine'' so frequently that many have a ``return bit'' in every instruction, so (in a sense) ``return from subroutine'' only takes 1 bit. Since branches are so common, ARM processors have ``every instruction is conditional'' conditional bits in every instruction. At the extreme case we have variable bit-length instructions as used on the BASIC Stamp ("DECODING THE BASIC STAMP" article by Chuck McManis 1994 http://massmind.org/techref/microchip/language/basic/stamp-decode.htm ) .
(cc) make large blocks of code very compressible. There doesn't seem to be much research in this area. Sometimes this conflicts with (cb) -- even though a CISC executable is almost always shorter than the corresponding (compiled from the same source code) RISC executable, quite often when with simple data compression (``non-functional compression'') is applied to both executables, the RISC .tar.gz file will be shorter than the CISC .tar.gz file.
DAV has found this to occur with the same image in a .DWG and longer .DXF files becoming compressed to short .DWG.ZIP and much shorter .DXF.ZIP files.
Anton Ertl has found this to occur with he same document in a .pdf and long .ps file becoming compressed to short .pdf.gz and much shorter .ps.gz files http://www.complang.tuwien.ac.at/~anton/why-not-pdf.html .

I think this illustrates "premature optimization" -- the misplaced focus on making individual instructions small (cb) interferes with the real goal of making the entire program small (cc). If you're only allowed to look at one instruction at a time (cb), the best you can do is a Huffman-style compression. But if you're able to look at even 2 or 3 instructions at a time (cc), you can often get much better compression -- LZ, LZW, and similar kinds of compression. (But I keep thinking that LZ, LZW compression won't work if the program is already well-factored).

other functional compression

I am especially interested in ``lossy'' program compression. (When an embedded system lacks enough RAM to run the full-up version, ... a smaller program that functionally *does* all the same stuff, although it's impossible to recover the exact original file from the compressed version, and it may have slightly worse performance. Example: recognize ``unrolled loops'', then shrink them back down to a single cycle of the loop. Example: "refactoring" .

However, "functional compression" is really ...

Tradeoff between performance and size:

Sometimes sophisticated techniques can make the executable a few bits smaller, but it isn't worth it because this new executable takes so *long* to uncompress that it's annoying. (An interpreter gets around this boot-time slowness, but then there's the run-time performance problem).
On the other hand, occasionally compressing a program *improves* performance. Example: magnetic media is so slow that sometimes it takes less time to read a short program off a floppy and decompress it than to read the original uncompressed file. Sometimes the same thing is true for main memory / cache memory.

program compression links:

"Code compression under the microscope" article by Jim Turley 2004-02-18 _Embedded Systems Programming_ http://www.embedded.com/showArticle.jhtml;jsessionid=MHP2VXF1PLPVSQSNDBCCKHY?articleID=17701289 [This sort of compression of 0-order entropy is helpful, but completely neglects the fact that compression of 1st-order and higher entropy usually gives much better compression. ] [... but it completely ignores MMU page-based compression ... keeping compressed code and constant data in ROM, decompressing it only when needed ... keeping a few of the most-recently-used pages in RAM ...]
People that build stack machines computer_architecture.html#stack also tend to write very compact code. For example:
``1x Forth'' by Charles Moore April 13, 1999 http://www.ultratechnology.com/1xforth.htm :

My contention is that every application that I have seen that I didn't code has ten times as much code in it as it needs. ...
Why would anyone want to write ten times as much as they would need to write?
... If it impossible for you to start with a clean piece of paper then you will have to write more code. ...
How big should a program be? For instance, how large should the TCP/IP stack be? I don't know. I couldn't know without sitting down and writing the code for it. But I should not be very big, a kiloword.
...
About a thousand instructions seems about right to me to do about anything. To paraphrase the old legend that any program with a thousand instructions can be written in one less. All programs should be a thousand instructions long.
How do you get there? What is the magic? How can you make applications small? Well you can do several things that are prudent to do in any case and in any language.
... inevitably the problem will change in a way that you didn't anticipate. ... Don't anticipate, solve the problem you've got.
Don't Complexify ...
Ten times code means ten times cost; the cost of writing it, the cost of documenting it, it the cost of storing it in memory, the cost of storing it on disk, the cost of compiling it, the cost of loading it, everything you do will be ten times as expensive as it needed to be. ... Ten times the bugs! And ten times the difficulty of doing maintenance on the code ...
This is why we are still running programs which are ten or twenty years old and why people can't afford to update, understand, and rewrite these programs: because they are significantly more complex, ten times more complex than they should be.
... How do you write one times programs?
You factor. You factor, you factor, you factor and you throw away everything that isn't being used, that isn't justified.
... you wrote a hundred words or so that discussed the application and you used those hundred words to write a one line definition to solve the application. It is not easy to find those hundred words, but they exist, they always exist.
Identify those aspects of what you are trying to do and saying we don't need to do that. We don't need checksums on top of checksums. We don't need encryption because we aren't transmitting anything that we don't need. You can eliminate all sorts of things.
...
I wish I knew what to tell you that would lead you to write good Forth. I can demonstrate. I have demonstrated in the past, ad nauseam, applications where I can reduce the amount of code by 90% percent and in some cases 99%. It can be done, but in a case by case basis. The general principle still eludes me.
...
When I first started in this business in fifty seven computers were used for calculating ... long complex algebraic expressions. ... today. ... I would guess that most computers don't compute, they move bytes around. ...
... I remain adamant that local variables are not only useless, they are harmful.

On the other hand, Anton Ertl has run some experiments and concluded that:

The code size measurements dispell another popular myth, that of the inherent size advantage of stack architecture code and of the bloat produced by optimizing C compilers. While a comparison of a header-stripping 16-bit Forth with a RISC (about 50% bigger code than CISCs) would give a somewhat different result, the reported size differences of more than an order of magnitude need a different explanation: differences in the functionality of the software and different software engineering practices come to mind.
... Interestingly, less than one byte of machine code is generated per line of C code.
-- Forth code size and programming effort by Anton Ertl http://www.complang.tuwien.ac.at/forth/performance.html#size -- Compilation of Forth to C by Anton Ertl, Martin Maierhofer http://www.complang.tuwien.ac.at/projects/forth.html
http://DataCompression.info/Lossless.shtml lists a few programs that compress web pages (apparently into a Javascript program that decompresses them at the viewer's end), such that the function identically to the original web page. Also points to "PECompact: A utility that compresses Win32 executables, leaving them fully functional, just smaller!"
"Java Optimization: How to optimize your Java programs so that they're faster, smaller, and more maintainable." by Jonathan Hardwick 1997 http://www-2.cs.cmu.edu/~jch/java/optimization.html

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." - Donald Knuth
...
Optimizing Java for Size
Reducing code size is particularly important for Java applets, since it directly affects the time taken to download an applet.
includes chapter "Tools for Optimizing Java" (profilers, etc). [FIXME: read] DAV is particularly interested in optimizing programs for size executable compression FIXME
http://www.muppetlabs.com/~breadbox/software/tiny/ has some amazing little gems of code. This is similar the sort of thing I want my "code compressor" to generate.
- factor.asm (exactly 1000 Bytes Linux Elf executable). It displays the prime factors of the integers specified on the command line, or on standard input if no arguments are given. ... significantly faster than the GNU implementation of factor for Linux. ... increased the maximum input value from 2^32 - 1 to 10^18 - 1 (or nearly 2^60). ... plus online help
- bf.asm (172 Bytes Linux Elf executable) ... An Eight-Instruction Turing-Complete Programming Language which was invented by Urban Mueller solely for the purpose of being able to create a compiler that was less than 256 bytes in size, for the Amiga OS. "It is quite possibly also the ugliest and most opaque program I have ever written." -- BR903 [FIXME: crosslink: turing tarpit ]
Some of the same code compression ideas (at a slightly higher-level) have gone into
Signature programs http://home.wxs.nl/~faase009/Signindex.html "Signature programs are short programs that people have been using as part of their signature. They are small (2 to 4 lines) programs, that are often rather cryptic because they have to save space as much as possible."
tinlink, a tool to create very small elf binary from pure binary files. The main purpose of this program is to create 4Kb intros under Linux http://sed.free.fr/tinlink/ [FIXME: help with exe-compressor ?]
``CodePack(TM) Code Compression for PowerPC(TM) Processors'' article by Mark Game and Alan Booker 1999 http://www-3.ibm.com/chips/micronews/vol5_no1/codepack.html

the code density (amount of instruction storage space required to accomplish a given function) associated with 32-bit fixed-length instructions can require more memory when compared with other processor architectures. To address this concern, IBM has developed CodePack, a unique method to store complete PowerPC instructions in memory in a compressed format. ...
... a silicon-efficient ASIC core that is placed between the processor and the memory controller in an integrated system-on-a-chip design. The core decompresses instructions on-the-fly only as needed by the processor. ...
Typically, use of the CodePack decompression core results in application code that fits in 35 to 40% less space. ...
...
The performance of compressed applications evaluated so far have been within ±10% of the performance of uncompressed versions of the same application. In some benchmarks that were run with slower memory (such as FLASH), the compressed code actually runs faster because compressed instructions are smaller and can be fetched from slow memory in less time.
...

DAV wonders:
Rather than having a specialized decompression hardware, could we get almost as good performance by making the a purely software decompressor, which runs when the MMU traps an address in code space that hasn't been decompressed yet (or that has been flushed out of the LRU cache of decompressed code) ? This lets software people experiment with heavily custom-tuned compression ideas, possibly a completely different algorithm for every application ... Maybe some modifications to the MMU to make it more fine-grained would be helpful.
[functional program compression] http://www.pobox.com/~qed/optimize.html DAV: (among many more important tips) mentions one trick many assembly language programmers do to squeeze programs smaller and run faster as well: If it doesn't matter what order to do something in a loop, it's usually faster and smaller to go from the end to zero.
Optimizing loop overhead
Before:
```
    for(i=0;i<100;i++) {
        map[i].visited = 0;
    }
	
```
After:
```
    i=99;
    do {
        map[i].visited = 0;
        i--;
    } while(i>=0);
	
```
Umm... that's nice, and another way to do that is
```
    i=100-1;
    do {
        map[i].visited = 0;
        i--;
    } while(0 < i);
    map[0].visited = 0;
	
```
This compiles to even more compact loop than the above on architectures such as the 68000 that have a dbnz -- decrement and branch if not zero -- instruction. [FIXME: move to http://c2.com/cgi/wiki?edit=BetterForLoopConstruct ]
``Code density ... equivalent functionality with fewer bytes of code'' -- unknown
2003-01-17:DAV: I've been thinking about "automatic refactoring" and program compression (making programs smaller, less bloated, even if it makes them run a tiny bit slower). It seems that it would be easier to write an automatic program refactoring tool to operate on Forth source code rather than C source code, because of Forth's simpler syntax.
"More small & fast software" http://unusualresearch.com/tinytcp/tinytcp.htm (look near the end of the file) points to Felix von Leitner http://www.fefe.de/ "diet libc - a libc optimized for small size." and the paper "Writing Small And Fast Software"
"Octahedron Assembly Language Demo" by Lenny Boreal http://6502group.org/octdemo.htm both the assembly language source and the dot com DOS executable file ... The demo program produces a tumbling octahedron in a mere 1024 byte dot com file, while it's computing pi to about 100,000 places. another amazing gem.
Andrew S. Tanenbaum, "Implications of structured programming for machine architecture", Commun. ACM, 21, #3, March 1978, pp. 237-246.

The demise of the goto and the advent of structured programming have altered the mix of machine instruction generated by compilers ... To find the new mix, ... many programs were written ... This resulted in a set of machine instructions with estimated frequencies of static use; ... The results confirm Knuth's observation that almost all operations done by a program are very simple. Huffman encoding of the operations based on the frequencies would lead to opcodes of 3 bits, 4 bits, 5 bits, etc., which is unsuitable hardware-wise (and software-wise). To obtain byte-sized opcodes, the smaller opcodes are combined with codes for frequent operands (since also the frequencies of the operands are known). For example, the opcode for PUSH_LOCAL occupies 4 bits, and the same byte can contain the offsets 0 through 11; this leaves another 4 8-bit opcodes to be used for rarer opcodes. Code size is compared to that for the PDP-11 and the Cyber and in both cases the EM-1 code is a factor 3 smaller.
-- http://www.cs.vu.nl/~dick/Summaries/CS/CompilerConstruction-1979.html
DAV: Yes, but wouldn't it be interesting to Huffman encode non-time-critical stuff anyway? Perhaps even use compression techniques that give better-than-Huffman compression? A "factor of 3" smaller -- very impressive.
"Code Optimization for Code Compression" by Milenko Drinic, Darko Kirovski, Hoi Vo (2003) http://citeseer.ist.psu.edu/ps/606877 mentions that "two types of correlations that occur in a program binary: horizontal and vertical. Horizontal correlations occur among fields of the same instruction. Vertical correlations occur across all instructions among the data (fields) of the same type. ... horizontal correlation of the RAR field is stronger than the vertical. On the other hand ... slightly higher correlation among extracted displacements [of CALL instructions] with respect to their correlation with the instructions where they are extracted from." ... "In general, since PPM is unable to capture vertical correlations in its model, we adopt the following separation policy: a sub-stream that corresponds to a certain field is separated from the program if its vertical correlation is stronger than its horizontal relation." DAV: This paper seems to be under the impression that it is either-or. One can either choose to compress an instruction field horizontally (correlating with other fields in that instruction and with previous instructions, the normal mode of most generic data compressors), or one can choose to compress an instruction field vertically (seperating it out of the stream somehow, correlating it with the same field in other instructions). I wonder if it is possible to generalize PPM so that we can take advantage of *both* kinds of correlation simultaneously. For example, if we know that, say, 1/4 the time a "CMP" opcode is followed by a reference to register R0 (requiring 2 bits) (horizontal), and we know that, say, 1/4 the time a reference to register R0 is followed by another reference to R0 (requiring 2 bits) (vertical), then when we see a reference to R0 followed by a CMP, then when we see one reference to R0, followed by a CMP, we could guess that it's even more likely to be R0 than either one alone -- predict it's R0 with probability 1/2, only requiring 1 bit. The idea of partitioning instructions into an "atomic stream" seems related to my ideas of "de-convolution" (de-correlation). (i.e., use the "horizontal" probability *and* the "vertical" probability, assume they are independent, and combine the probabilities using Bayes' theorem and http://en.wikipedia.org/wiki/Probability_theory )
DAV: ... perhaps it would also be interesting to speculate that the horizontal and vertical are not completely independent -- i.e., use some *other* formula for combining the probabilities ... One would expect that if, say, "5" were extremely common in a program, *both* the horizontal *and* the vertical probabilities would be high for "5" ... how to "de-convolve" (?) the probabilities to get an accurate prediction for "5" ?
DAV: It seems that LZW and PPM assume that some byte affects another byte a fixed distance away. I wonder if it's possible to model the effect that often, when one word is encountered, a context is set up so that another word will be encountered "soon" -- not necessarily immediately after this word, or at any particular fixed offset from this word. i.e., when a "test" or "compare" instruction occurs, there is very often a conditional branch in the next few instructions ... when "if" is mentioned, "then" is likely to also occur ... ... Some versions of this may be *simpler* than Markov code -- for example, since there are far fewer groups of 4 letters ''in alphabetical order'' than 4 letters, the storage requirements for predicting the next letter given a set of 4 letters of context (not in any particular order) is much less than predicting the next letter given a list of 4 letters in a particular order.
References these papers:
- T. A. Proebsting. Optimizing an ANSI C interpreter with superoperators. (1995) http://citeseer.ist.psu.edu/proebsting95optimizing.html , http://citeseer.ist.psu.edu/contextsummary/206737/0
- "Improving Table Compression with Combinatorial Optimization" by Buchsbaum, Adam L ; Fowler, Glenn S ; Giancarlo, Raffaele http://citebase.eprints.org/cgi-bin/citations?id=oai:arXiv.org:cs/0203018

file formats

[FIXME: should I combine all my file format information in one place ? Or does it make sense to divide it into 2 parts:

``information for *users* to pick which file format makes the most sense for what they are working on'', (over at computer_graphics_tools.html#file_formats )
``information for *programmers*, encouraging them to use an existing format if at all possible, tips and ideas for designing a new file formats (if absolutely necessary), and implementation details / free source code for dealing with various types of files'' (here at data_compression.html#file_formats )

]

[FIXME: html-ify "file format considerations"; add use the Unicode trick of "FFFE" to detect and correct byte-swap. ]

the executable and linking format (ELF) begins with the 4 magic bytes "\x7f" "ELF"
ELF is a common executable file format. ELF seems superior to a.out and COFF executable formats ... but of course I find plaintext source superior to any executable format.
Details about the ELF format are generally only useful for people who write OSes and who write compilers.
People obsessed with creating tiny exectables (``functional program compression'' #program_compression ) may also be interested.
- http://www.muppetlabs.com/~breadbox/software/tiny/ describes why DAV is interested in the ELF format. "A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux" is a fun way of learning about ELF.
- http://www.geocities.com/SiliconValley/Ridge/2544/asm/ELF.txt | mirror http://www.muppetlabs.com/~breadbox/software/ELF.txt is apparently a nicely-formatted ASCII text version of the official standard for the ELF format http://www.acm.uiuc.edu/sigops/rsrc/pfmt11.pdf
- a collection of programs that ... deal with the ELF file format. http://www.muppetlabs.com/~breadbox/software/elfkickers.html
- http://tict.ticalc.org/docs/bfd.html BFD is a package which allows applications to use the same routines to operate on object files whatever the object file format. A new object file format can be supported simply by creating a new BFD back end and adding it to the library.
- http://www.cs.ucdavis.edu/~haungs/paper/node10.html
- ``The ELF Object File Format: Introduction'' article by Eric Youngdale 1995-04-01 http://www.linuxjournal.com/article.php?sid=1059
- http://my.execpc.com/~geezer/osd/exec/ describes several executable file formats: COFF, ELF, Win32 PE, and summarizes ELF at http://my.execpc.com/~geezer/osd/exec/elf.txt .
- The ELF Virus Writing HOWTO http://www.lwfug.org/~abartoli/virus-writing-HOWTO/_html/ | mirror http://www.peacefulaction.org/alba/virus-writing-HOWTO/_html/index.html has some information anti-virus people should know.
  
  `` The primary quality of this document is reproducibility. Every tiny bit of information should be proved by a working example. Since I don't trust myself all output files are rebuild for every release. All sections titled "Output" are real product of source code and shell scripts included in this document. Most numbers and calculations are processed by a Perl script parsing these output files.
  ... Conversion to HTML is the last step of a Makefile that builds and runs all examples.
  [FIXME: this is very similar to what I'm doing with YARMAC ... any tips I can pick up here ?]
- http://www.wikipedia.org/wiki/Executable_and_Linkable_Format
If you're creating a new file type, or you are writing a program to handle lots of file types, you should probably check out computer_graphics_tools.html#magic-numbers
http://dmoz.org/Computers/Data_Formats/ lists tons of file formats.
If you must make up a new file format, consider basing it on sBOX.
sBOX File Format Specification v1.0 by Sean Barrett 2000 http://www.nothings.org/computer/sbox/sbox.html
sBOX, a meta-file format for creating file formats whose internal contents are indexed by name. sBOX is a simple and carefully engineered file structure that provides a base layer upon which other file formats can build.
...
The first twenty-four bytes of an sBOX file constitute the sBOX header. The first sixteen bytes are undefined; any set of values in the first sixteen bytes can still indicate a valid sBOX file.
The following four bytes (the seventeenth through twentieth) contain the sBOX signature, and consist of the following decimal values:
```
	115 98 48 88
	
```
...
the last four bytes of the file ... must contain the sBOX signature:
```
	115 98 48 88
	
```
...
It's possible to define a writeable sBOX file format in which all modifications occur by appending new data; no data in the file ever need be rewritten.
...
[FIXME: finish reading] [in C, the 4 signature bytes are "sb0X".]
Ryper _Encyclopedia of Graphics File Formats, Second Edition_ book by James D. Murray and William vanRyper (O'Reilly & Associates, 1996). Sebastopol, California http://www.ora.com/ [FIXME: recommendation ? move to a special ``books about data compression'' section ?]
DICOM is a standard medical image format.
ImageJ http://rsb.info.nih.gov/ij/ "It can read common file formats including TIFF, GIF, JPEG, BMP, DICOM, FITS and "raw". ImageJ was designed with an open architecture that provides extensibility via Java plug-ins. Custom acquisition, analysis and processing plug-ins can be developed using any Java development system. User-written plug-ins make it possible to solve almost any image processing or analysis problem. ImageJ is being developed using Metrowerks CodeWarrior, and the source code is freely available."
The Cross-Platform Page: Audio Formats http://www.mcad.edu/ericb/xplat.aud.html
The Cross-Platform Page: Encoding and Compression Formats http://www.mcad.edu/ericb/xplat.comp.html (standard file formats for file archives)
Image File Formats http://twiki.org/cgi-bin/view/Wikilearn/CsicImageFileFormats [FIXME: add information to that page: how to view these files; pointer to file format specification ]
If you like your pretty document formats, and you don't understand the dinosaurs who seem stuck in dusty old plain-ASCII, please read T. V. Raman's article "Welcome To The Universe Of Fancy Colored Paper!" http://www.cs.cornell.edu/Info/People/raman/publications/colored-paper.html, "an attempt to encourage the use of interchangeable document encodings." ... various colors of paper, colored light fixtures, custom spectacles (colored glasses), etc. ... related to the power of plain text .
On the other hand, if you like colored glasses, according to the article "Sunglasses Change Color on Command" by Tracy Staedter, people are developing glasses that can be adjusted at any time to almost any color. Those people are Chunye Xu, Minoru Taya, and Chao Ma at the University of Washington.

unsorted

http://www.djvuzone.org/

``LizardTech released the source code of the latest version of the DjVu Reference Library (v3.0) under the GNU General Public License. The software is available at http://lizardtech.com/products/djvu/referencelibrary/DjVuRefLib_3.0.html.
The DjVu Reference Library 3.0 contains the complete source code of decoders, simple versions of the encoder, and several utilities.''
Data Compression Methods http://piclist.org/techref/method/compress.htm
The LZS compressed data format http://www.hifn.com/products/product/lzs221.htm is apparently defined in "ANSI X3.241-1994".
Advanced Hardware Architectures, Inc. http://www.aha.com/ sells data compression, error correction, and communications ICs. (some specifically designed for scanners ... image compression)
Charles Bloom's Page http://www.its.caltech.edu/~bloom/ Lossless Compression Source Code (and theory)
data compression consulting http://www.compressconsult.com/
Data Compression Links http://wildsau.idv.uni-linz.ac.at/mfx/compress.html (and source code)
data compression newsgroups http://www.dejanews.com/bg.xp?level=comp.compression
http://members.xoom.com/topspeedsoft/ commercial pkzip-compatible compression and decompression subroutines.
Neal E. Young http://www.cs.dartmouth.edu/~ney/ nice web pages, compression, approximation algorithms, technical humor. Document image compression http://www.cs.dartmouth.edu/~ney/papers/codebook/
http://www.vqf.com/ image compression
http://www-isl.stanford.edu/~gray/compression.html image compression
http://rainbow.ece.ucsb.edu/ image compression includes demos !
http://members.aol.com/BCrow1962/ image compression and wavelets
For the latest Beta of Huffman Compression Engine II http://www.webworldinc.com/joejared/WLH7021j.zip Ways to get here from there: ICQ http://wwp.mirabilis.com/12016722 ftp://webworldinc.com/joejared http://www.webworldinc.com/joejared <Joe.Jared at 301.sasbbs.com> (Includes freeware Pascal source code with documentation on how to call it from your own programs) (adaptive Huffman)
http://www.ima.umn.edu/~pliam/doc/huff/ Self-similar Huffman Trees with Extreme Guessing Properties by John Pliam
"Lossless image compression algorithm harnesses entropy" http://www.eetimes.com/story/OEG19981113S0040 By R. Colin Johnson EE Times "Condensation, a perfectly lossless compression technology derived from quantum thermodynamics"

Image data compression
From: jrobin at essex.ac.uk ()
Date: 29 Jun 1995 14:17:21 GMT

Documentation of Binary Tree Predictive Coding
(BTPC is an image compression method ...
suited for coding multimedia images which combine text, graphics and photographs
... lossless and lossy settings)

http://monet.uwaterloo.ca/~john/btpc.html

evaluation source code at:

ftp://monet.uwaterloo.ca/pub/john/btpc.tar.Z

BTPC uses a binary pyramid, predictive
coding and Huffman coding.

John Robinson
john at monet.uwaterloo.ca

has moved to http://www.engr.mun.ca/~john/

Data compression



From: Jonathan Burt 
Newsgroups: alt.comp.compression
Subject: Re: QUANTUM
Date: 18 May 1995 12:16:29 +0100
Organization: None
...
 (Daniel James Williams) writes:
>   Quantum and a LOT of other compression programs are FTP'able from
>   garbo.uwasa.fi /pc/arcers
>   .  You have to get the file act-23.zip! It really explains how good
>   about 30 different compression programs are! (including Quantum).

	It's version 24 now, and has 45 compression programs, but thanks for the recommendation! :-)

Bye
  Jonathan
--
Jonathan Burt == 

Archive Comparison Table:
  -> garbo.uwasa.fi:/pc/arcers/act-24.zip
  -> oak.oakland.edu:/SimTel/msdos/archiver/act-24.zip & other SimTel mirrors
  -> wuarchive.wustl.edu:/pub/MSDOS_UPLOADS/archivers/act-24.zip

From: John Morris Smith 
Newsgroups: alt.comp.compression
Subject: Re: What dos program has THE BEST compression
Date: 19 May 1995 00:01:15 +0100
Organization: ToolMaster

....

If you haven't already read it, read the article A.C.T. May 1995 in this newsgroup (alt.comp.compression) or WWW to http://www.mi.net/act/act.html and browse through Jeff Gilchrist's excellent comparison of many zip/archive programs. --

Newsgroups: comp.sys.mac.apps,comp.sys.mac.comm,comp.compression
From: tbrown at minerva.cis.yale.edu (Tom Brown)
Subject: ZipIt 1.3.3 public beta testing
Organization: Yale University
Date: Sun, 21 May 1995 19:31:17 GMT

ZIPIT 1.3.3 PUBLIC BETA

ZipIt version 1.3.3b1 is now available for public beta testing. Major new
features include:

  o More bug fixes in the segmenting routines. (what, that's not a feature?)
  o Runs native on the Power Mac
  o Supports Internet Config for extension mapping and text/binary file
    information.

You can get the most recent beta version from:

  ftp://ftp.awa.com/pub/softlock/mac/products/zipit/zipit.hqx

(This is the fat version; you can also obtain PPC-only or 68K-only versions
in the same directory.)

If you would like to subscribe to a mailing list that will announce future
beta versions, and release versions, as they are made available, please
send email to tbrown@dorsai.org with the subject "subscribe zipit-announce".
(The subject must be typed exactly as above, all lowercase.)

Please send me both positive and negative feedback. I'd like to know that it's
working for some people! Thanks.

--
Tom Brown
tbrown at dorsai.dorsai.org
tbrown at minerva.cis.yale.edu

Dr. Dobb's Journal: all source code is available ... online .... anonymous FTP from site ftp.mv.com (192.80.84.1) in the /pub/ddj directory.

ALLEY.ASC
Title: ALGORITHM ALLEY
Keywords: SEP93    C++  ALGORITHMS   DATA COMPRESSION

Published source code presented by Tom Swan in his column where Tom examines data compression techniques. Also see ALLEY.ZIP. http://www.ddj.com/

Info-ZIP http://www.cdrom.com/pub/infozip/ Info-ZIP's purpose is to provide free, portable (Mac, Win32, Unix, OS/390, VMS, etc.), high-quality versions of the Zip and UnZip compressor-archiver utilities that are compatible with the DOS-based PKZIP by PKWARE, Inc. Several related applications, Windows DLLs that can be used with Visual BASIC, zlib dynamic link library for UNIX, nice list of links to other compression and archiver resources, etc.
_Compressing Audio and Video Over the Internet_ article by Mike Podanoffsky http://www.circuitcellar.com/articles/misc/86-podanoffsky.pdf ???
http://www.algoresearch.com/kitdoc/map/cctutor.cct1.htm "SDRKit is a collection of tools to let you generate custom compression and decompression algorithms achieving unprecedented compression ratios."
DAV: ideas for better compression with small files: - use huffman; transmit huffman table (a list of 256 numbers) using gamma or fibonacci codes. (the table only needs to store code lengths for each symbol, not the exact probabilities). (Perhaps this also helps with *huge* 2nd and 3rd byte symbol tables ... plus a dab of RLE for the long strings of zeros ...)
DAV: how to exploit context on *both* sides of a character ? The Burrows/Wheeler block-sorting algorithm extracts *all* the context following a character, but *none* of the context preceeding a character.
DAV: clustering and symbol-splitting: one can recognize that symbols have been conflated into a single byte by: - cluster symbols according to following-byte. (how to measure whether this is a "good cluster", and how many clusters should we have ?) - deflate symbols into symbol-pairs: "which symbol of the cluster" followed by "which cluster". Only the 2nd number helps predict the following byte. (How to pick proper value for 1st number ?) Can this be generalized to cluster multi-symbol clusters into nouns, verbs, adjectives, etc. ?
DAV: how does adaptive huffman coding work ? perhaps ... all estimated character probabilities start with, say, 7 (or some other initial set). During compression, each decoded letter *increases* its corresponding est char probability by 1. When some char probability reaches a maximum (say 16 ?), *all* probabilites are divided by 2 huffman tree rebuilt (or perhaps rebuild after every char). How to avoid giving some chars probability 0, which makes them impossible to encode ? options: - divide by 2, then add 1. - special "other" symbol in compressed stream that uses the divide by 2 and add 1 update, while all other symbols simply divide by 2.
Ocean Logic http://www.users.bigpond.com/oceanlogic/products.htm sells some data compression hardware [vlsi]
http://dir.yahoo.com/Science/Computer_Science/Compression/
Minimum Message Length Encoding (MML) http://www.cs.monash.edu.au/~lloyd/tildeMML/

[for standard image sequences in machine_vision.html]

Subject:
             Re: Ultrasound image sequences
        Date:
             27 Jan 1999 00:00:00 GMT
       From:
             "Michael J. Aramini" 
 Organization:
             Hewlett-Packard Co.
         To:
             boaz cohen 
 Newsgroups:
             sci.image.processing
  References:
             1

boaz cohen wrote:
> I need some ultrasound image sequences (B-mode scans).
> Does anyone have some?

There are a number of downloadable ultrasound images and cine loops
available on Hewlett-Packard's web site.

To access them you first need to register (at no cost to you) using
    http://www.medical.hp.com/DSRsupport/register.html

You can then download the image and loop files in (HP) DSR format
which is an extended version of TIFF.  You can also download a software
tool (as an MS-DOS binary) which converts from DSR format to standard
TIFF format.

-Michael

http://www.efg2.com/lab/ImageProcessing/ detailed examples of the StretchDIBits Windows API call (and other API calls) and using Delphi.
http://www.efg2.com/lab/library/ImageProcessing.htm lots of image processing links to: Tutorials, Database of Faces, Plants Photo Gallery, Thermal Image Gallery, The Tumor Image Library, Algorithms, Enhancement, optimal way to compute median value, Using MMX Technology to Speed Up Machine Vision Algorithms, Face Recognition, Vehicle Number Plate Recognition, Segmentation, Active Snakes, spectroscopic remote sensing, Watershed Transformation, Medical Applications, ...
WebSeek http://disney.ctr.columbia.edu/WebSEEK/ at Columbia University A Content-Based Image and Video Search and Catalog Tool for the Web
Simple splay-tree based compression http://www.cs.uiowa.edu/~jones/compress/ (includes C source for compression and decompression) "Splay compression performs unexpectedly well on images!" Filters for converting between Hilbert-curves and raster scans. "Scanning images along a Hilbert curve instead of along a conventional raster improves the performance by allowing more effective local adaption."
http://www.summus.com/ free Netscape plug-in to view wavelet compressed images
Flexible Parsing (FP): Optimal Parsing for Dictionary Based Compression http://www.dcs.warwick.ac.uk/~nasir/work/fp/ (experimental data compression)
Algorithmic Research BV http://www.algoresearch.com/ has some free demos to download illustrating their Smart Data Representation (SDR), allowing you to describe your data, then the "model compiler" creates a dedicated data compressor/decompressor tuned to your data, allowing you to compress your data far smaller than you can with generic data compression tools.
Data Compression Consulting http://www.compressconsult.com/ has some interesting pages, including Szip http://www.compressconsult.com/szip/ a freeware portable general purpose lossless compression program. Szip is best compressor for large text files in Canterbury Corpus tests. and
JBIG2 Working Draft http://www.jpeg.org/public/jbigpt2.htm JBIG2 is a standard format for bi-level (black/white) image compression
http://jpeg2000.epfl.ch/
JPEG 2000: The Next Wave(let) article by Luisa Simone http://www.zdnet.com/devhead/stories/articles/0,4413,2246166,00.html
MPEG-2 video reference source code http://www.mpeg.org/MSSG
[book: to read] Michael Barnsley, researcher from Georgia Institute of Technology, wrote the popular book _Fractals Everywhere_. The book presents the mathematics of IFS, a new tool for modeling techniques in computer graphics, called as the Collage Theorem. The Collage Theorem states what an IFS must be like in order to represent an image.
free codec for fractal still image- and video-compression at: ftp://ftp.informatik.uni-stuttgart.de/pub/fractal_compression/ (1997)
Jean Gailly http://w3.teaser.fr/~jlgailly/
Independent JPEG Group http://www.ijg.org/ IJG is an informal group that writes and distributes a widely used free library for JPEG image compression.
MPEG for MATLAB http://www.cl.cam.ac.uk/~fapp2/software/mpeg/ The MATLAB source files are freely available
How JPEG works http://www.geocities.com/SiliconValley/Peaks/7940/jpeg/indexeng.html lots of Matlab source code; it appears to include a Huffman coder.
[FIXME] Matlab code that does Huffman coding/decoding Appears to be *very* frequently asked on comp.soft-sys.matlab. And occasionally on comp.compression. Post my code there, ask for any suggestions.
Signal Processing and the Multimedia Information Infrastructure http://www-isl.stanford.edu/~gray/iii.html has some nice image coding (Art, scientific image, medical images, segementation, fractal image coding, etc). and image processing links; speech coding and speech processing links; Video coding; wavelet coding.
Mitsuharu ARIMURA's Bookmarks on Source Coding/Data Compression http://www.hn.is.uec.ac.jp/~arimura/compression_links.html Standard Movie Data, Fractal Image Compression, Wavelet and Subband Coding,
Eric E. Majani (http://www.nasatech.com/TSP/PDFTSP/) and Zandi (http://www.crc.ricoh.com/CREW) mention "bit-significance embedded encoding of signed integers". It is easy (?) to convert from sign-magnitude encoding to bit-significance encoding. The sign bit for the number is inserted after the most significant 1 bit, rather than in a fixed location before any of the magnitude bits. In other words, positive numbers get incremented by the value of their most-significant non-zero bit; negative numbers are expressed as their magnitude incremented by 2* the value of that magnitude's most-significant non-zero bit. (What about gray codes ? Can we show that gray codes are better or worse than bit-significance encoding ? What about converting the magnitude to gray code, then embedding the sign bit ?)
Michael Gormish http://www.rsv.ricoh.com/~gormish/ has some nice compression papers
Wavelet Based Image Compression System by David Wanqian Liu http://reality.sgi.com/davidliu/java/WaveComp/WaveComp.html#future (source code in Java is available ?) "Future Work Implement Huffman coding" DAV:FIXME:todo: Give him my wavelet and Huffman links.
Data Compression Conference (DCC) http://www.informatik.uni-trier.de/~ley/db/conf/dcc/
Image/Video Transmission http://www.causalproductions.com/TEMP/INDEX/IP97S30F.HTM http://www.dreamlabs.com/~webland/c_p/ip97/ip97s315.htm (mirrors ?)
Predictive coding, such as the differential pulse code modulation (DPCM) patented by Cutler [22], predicts the next sample based on the previously coded samples and codes the error between the prediction and the actual sample. The main objective of DPCM is to narrow down the range of coded samples. However, images are not stationary in texture, and some predictors may work in certain areas, but not in others. Hence, the concept of adaptive predictors [3, 31, 83] was introduced so that the coder and decoder could choose from a set of predictors at the cost of transmitting the parameters needed for the prediction.
22 C. Cutler. Differential quantization of communication signals. U.S. Patent 2 605 361, 1952.
-- "Document Image Compression and Analysis" by Omid Ebrahimi Kia http://documents.cfar.umd.edu/LAMP/Media/Publications/Papers/okia97/Thesis-HTML.html
A Wavelet System in Matlab By Abdou S. Youssef, ACMD http://math.nist.gov/mcsd/Reports/96/yearly/node70.html "We have used it to examine thousands of wavelets. The result was that only about 20 wavelets are useful in practice for data compression."
[FIXME: find and read the VBL94 paper: J. Villasenor, Belzer, and Liao. "Filter evaluation and selection in wavelet image compression". In _Data Compression Conference_, Snowbird, Utah, 1994. IEEE]
[ http://www.faqs.org/faqs/compression-faq/ Part 1, section 14 Has a bunch of links to arithmetic coders ] [compression] http://www.webworldinc.com/joejared/
"As a good choice and a default convention, the 0 symbol should be the most frequent." -- (Lu 1997 p. 313) Lu, Ning. Fractal Imaging. San Diego CA: Academic Press, 1997.
Data Compression Sites http://www.wmin.ac.uk/~seamang/gencompress.html
The bzip2 and libbzip2 official home page http://www.bzip2.org/ "bzip2 is a freely available, patent free (see below), high-quality data compressor. It typically compresses files to within 10% to 15% of the best available techniques, whilst being around twice as fast at compression and six times faster at decompression." open-source (BSD-style license) used for compressing the Linux kernel source "bzip2 compresses files using the Burrows-Wheeler block-sorting text compression algorithm, and Huffman coding. Compression is generally considerably better than that achieved by more conventional LZ77/LZ78-based compressors, and approaches the performance of the PPM family of statistical compressors."
The Standard Function Library (SFL) http://www.imatix.com/html/sfl/ has a few compresion and decompression routines. "The SFL is written in ANSI C ... It comes with complete sources and documentation in HTML. ... The SFL is free software that you may use and distribute for private or commercial purposes according to the SFL License Agreement.
??? http://www.hifn.com/ Hardware Data Compression chips " http://www.faqs.org/faqs/compression-faq/ www.faqs.org has copies of all Usenet FAQs that are approved for posting in news.answers (which means pretty nearly any FAQ that its author believes is worth archiving...)." JPEG-LS, the emerging lossless/near-lossless compression standard http://www.hpl.hp.com/loco/ and pointers to other lossless image compression algorithms.
``The second law of thermodynamics'' article by Brig Klyce http://www.panspermia.org/seconlaw.htm discriminates between ``thermodynamic entropy'' and ``logical entropy''.
[FIXME: I have some comments on data compression as applied to chess somewhere ... ] Forsyth Notation http://everything2.com/index.pl?node_id=1106914 ``Forsyth Notation is the primary method of quickly recording the position of every piece on the board.''
``Compression Sort Transform: The New Sort Transform Data Compression Algorithm Explained'' by Jason Williams http://megadodo.com/articles/2U21.html data compression humor.
The Green Tree Of Compression Methods: A Practical Introduction To Data Compression http://geocities.com/eri32/int.htm by A.Ratushnyak. Short summary, but has lots of good links. Gets kind of philosophical; may make compression sound more difficult than it really is.
Mikael Lundqvist has written a range coder (arithmetic encoder) data compression. He posted the C source code on his web page http://www.geocities.com/mikaellq/ . Has links to other data compression resources.
Dr. Stu http://www.cs.waikato.ac.nz/~singlis/ claims that his image compression program ``As of March 1999 the image compression code achieves the highest known compression.'', and has put all the source code (and his Ph.D. Thesis describing it) on his web page. ``The code is in C and everything that I have written is licensed under the GPL. Have fun with it!'' [FIXME: check this out]
PNG and MNG tools http://pmt.sourceforge.net/ includes ``pngcrush'', a utility to (re-)compress (losslessly) ``.png'' files to save a few bits, using more sophisticated compression than that used by most programs that generats ``.png'' files. [FIXME: help ?]
``The FBI Fingerprint Image Compression Standard'' by Chris Brislawn (1996) http://www.c3.lanl.gov/~brislawn/FBI/FBI.html `` The FBI is digitizing the nation's fingerprint database at 500 dots per inch with 8 bits of grayscale resolution.'' ... ``For this particular application, where image quality is the preeminent concern, wavelet transform coding is superior [to JPEG].'' A very thorough article, pretty easy to read (lots of pictures).
``Simple Compression using an LZ buffer'' article by Adisak Pochanayon http://www.cubic.org/source/archive/coding/compress/ Includes the complete source code for compressing and decompressing in ``SLZ'' format, which (to DAV) at first glance looks almost identical to ``LZRW'' from Ross Williams http://www.ross.net/. The decompression code is given in 68000 assembly and runs *very* fast.
The Music DSP Source Code Archive http://www.smartelectronix.com/musicdsp/analysis.php (under ``Analysis'' there's some wavelet transforms)
time compression
[huffman][arithmetic][Matlab] Arithmetic Coding and Huffman Coding in MatLab by Karl Skretting http://www.ux.his.no/~karlsk/proj99/ ``It is assumed that the histogram of the values, i.e. probabilities, is symmetric about zero (or only non-negative numbers), and that probabaility decrease as distance from zero increase. The program works ok also if this it not the case, but the comprssion ratio is not as good as it will be if the sequences are as expected.'' [DAV: Why is this so ? I thought all Huffman compressors gave exactly the same number of compressed bits ... ... perhaps he's saying that the header to store these probabilities was designed to be very compact in the expected case.] ``An entropy coder usually only exploits the symbol probabilities independet of previous symbols, this is optimal for uncorrelated sequences. For signal compression a decorrelation process usually precede the entropy coding, but often the decorrelation is not perfect. The Huffman coder was made such that it can exploit some of these remaining dependencies, this was done by manipulating the input sequence.'' [FIXME: figure out what he's talking about.]
http://www.stanford.edu/class/ee372/ ???
``Lossless,Reversible Transformation that improve Text compression Ratios'' paper by Robert Franceschini et. al. http://vlsi.cs.ucf.edu/listpub.html [FIXME: read]

>From: d_cary at my-deja.com
>To: d_cary at my-deja.com
>Subject: Re: Compressing the Bible for a PDA
>Date: Thu, 16 Nov 2000 21:26:45 GMT
>
>I'm impressed withe the "pucrunch" compressor by Pasi Ojala
>  http://www.cs.tut.fi/~albert/Dev/pucrunch/
>  http://www.cs.tut.fi/~albert/Dev/pucrunch/packing.html
>.
>It decompresses very fast (important on a slow PDA). Almost all other
>decompression algorithms I've seen run much slower.
>
>You might also look at (better compression, much slower)
>  "Compression: A Key for Next-Generation Text Retrieval Systems"
>Nivio Ziviani, Edleno Silva de Moura, Gonzalo Navarro, and Ricardo
>Baeza-Yates
>in
>_Computer_ magazine November 2000
>  http://www.computer.org/computer/co2000/rytoc.htm
>.
>
>Their decompressor takes 1, 2, or 3 *whole* bytes of compressed data
>and decompresses (using a vocabulary list) into a whole word. This
>makes many kinds of searches *much* faster. One can directly search the
>compressed text for words or phrases, which turns out to be even faster
>than searching uncompressed text. ( You *are* going to let your users
>do searches like a concordance, right ?).
>
>Since it uses whole bytes, it runs faster than other "word at a time"
>algorithms that have to shift around individual bits. And it lets you
>point to any word in the text with a normal (byte) pointer and start
>decompressing immediately from that point.
>
>The article also has lots of other ideas about compressing indexes and
>approximate-match searching.
>
>In article <8uav48$6h$1@nnrp1.deja.com>,
>  comphelp at iinet.net.au wrote:
>...
>> Thanks Benjamin. That makes a lot of sense. Presumably I'd have some
>> sort of table of pointers to the beginning to each book (or chapter
>> even) so that I could quickly jump straight to that location rather
>> than searching sequentially? After all, most PDA's have quite slow
>> cpu's.



From David Cary
2002-03-09
Dear David Scott,

Thanks for putting your page
   http://bijective.dogma.net/
online.
When I first heard about it (in the newsgroup comp.compression) and
skimmed it, I though making a one-to-one mapping was trivial. But now
that I've thought about it for a few weeks :-/, I see how very tricky it
is to handle all possibilities correctly.

I'm pretty sure there's many ways to make a one-to-one mapping of
the output of a Huffman compressor (which can have any number of bits)
to files (which are restricted to multiples of 8 bits ... unless you're
on an early VAX machine which restricted files to an integer number of
80 byte ``records'').

After reading your page, I came up with yet another one-to-one mapping.
Yes, it has the ``infinite recursion'' problem you mentioned, but there
are ways of implementing it such that we never really need to back up.

The common case is really very simple; but there's all those annoying
exceptions to handle zeros. So I'm calling this ``upthz'' (ugly patch to
handle zeros).

Do you have a name for the method you described on your web page ?

Since you explained things in terms of decompression, I guess I'll do
the same.

----
   UPTHZ (ugly patch to handle zeros) decompression

1. Since I use canonical Huffman codes, the all-zeros code represents
the least frequent symbol. There's always one or more other codes that
have the same length, but no other code is all-zeros, and no other code
has a longer length. Call the length of the all-zeros code L. If I use N
bit source text symbols, then N <= L < 2^N. (Much software assumes N=8,
but I'll just use it as an example in parenthesis). Also, any symbol
*other* than the all-zeros code is composed of 3 parts: from 0 to L-1
leading zero bits; a central 1 bit, and a tail of 0 to N-1 following
bits (can be 0 bits, 1 bits, or any combination).

2. The decompressor pulls bits, counting consecutive zeroes, until it
either (a) hits the end of the file (only happens once per file), in
which case jump to step 6, or (b) pulls a 1 bit (the most common case).

If it's *not* the end of the file (i.e., we just pulled a 1 bit):

3. Throw that 1 bit away. The decompressor grabs N-1 more bits to make a binary number B.
(If we hit the end of the file here, pretend that there are lots of 0 bits to grab just past the last
byte of the file).

4. (special case) If the zero count is L or more (when N=8, L could be up to 255),
then print the symbol represented by the all-zeros code, and subtract L from the zero count.
Repeat step 4 until the zero count is less than L.

5. Now we have 2 numbers: a zero count (when N=8, the zero count will be
from 0 to 254) and a binary number B (when N=8, B will be from 0 to 127).
The decompressor runs these through a lookup table, which returns 2
items. The decompressor prints the first item (the symbol represented by
that Huffman code). The second item is the true length of the tail of
that Huffman code. If that tail has length N-1, we can immediately go to
(2) and keep sucking bits. Otherwise we have to put back some (in some
cases all) of these N-1 bits we pulled in step (3), then we return to
step (2) and allow the decompressor to re-scan them.

This continues with step 2 until we hit the end-of-the-file.

So far I've just described standard static Huffman decompression (with
the table-lookup speedup). As you already know, handling the end-of-file
is the tricky part if I want to maintain a bijection.

Else, it *is* the end of this file.

6. Find the number of zero *bytes* at the end of the compressed file.
This is *not* simply
   floor( zero count) / 8 ),
but also includes any trailing zero bits in B from the previously decoded Huffman code.

7. For each of those zero bytes (if there are any), print the
least-frequently used symbol. Done !

Step 7 is similar to step 4, but instead of subtracting L each time, the
decompressor jumps 1 byte ( 8 bits ) each time.

----

Did that explaination make any sense ? No ? Let me try again, this time
in terms of compression:

----
   UPTHZ (ugly patch to handle zeros) compression

1. Use standard canonical Huffman compression. For each symbol in the
source text, lookup the Huffman code, and immediately write it out to
the file. Once you've compressed the entire source text, pad out the
last byte with zero bits if necessary.
(I keep thinking these are ``extra bits'', redundant somehow, but since
this is a bijection, it can't be).

2. Find the *last* byte in compressed file that is non-zero, and end the
file there. If there were any zero bytes that followed that byte, delete
them.

3. Count how many consecutive times the least-frequent symbol occurs at
the end of the source text. Often, this will be zero, and we're done.
Otherwise, append that many all-zero bytes to the end of the compressed
file.

----

I think it's possible to keep a couple of counters of how many zero
bytes and all-zeros codes that we were ``planning'' on writing out. Then
I could do steps 1, 2, 3 in a single pass, without indefinite recursion
(I think). The trick is to delay writing out the code for the
least-frequently-used-symbol until either you hit the end of the source
file -- then you know to represent it as a 8 zero bits (a zero byte) --
or you hit some other symbol -- then those least-frequently-used symbols
are represented as L zero bits.

I understand upthz better than your method (not surprising, since I came
up with it). Is it really any easier to understand ?
It seems so simple that surely someone has thought of this before I thought of it.

UPTHZ also can handle Huffman trees with *any* number of leaves -- in
particular, it works with 2 leaves (arbitrary binary sequences), and
with 128 leaves, and with 256 leaves, and with 1024 leaves.

It's pretty easy to construct input files such that UPTHZ gives *longer*
output file than the first bijection you published. For example, if the
last symbol of the source text is compressed to the code ``1111'', and
that symbol gets split over 2 bytes, then UPTHZ ends the file with ``
........ ......11 11000000'', while your first published bijection gives
the same file minus that last byte. But how is it possible to have
extra, redundant information if this upthz is a real bijection ?

-- 
David Cary
http://www.rdrop.com/~cary/html/data_compression.html

``Teranex and QUALCOMM Create Real-Time Encoding Solution for Digital Cinema'' press release April 9, 2002 http://www.qualcomm.com/press/pr/releases2002/press1013.html

QUALCOMM's image compression algorithm, ABSolute, is the groundbreaking technology employed by Technicolor Digital Cinema installations worldwide. ABSolute technology reduces the amount of digital information needed to represent cinema-quality digital images by as much as 35 to 40 times. These efficient coding rates save storage and transmission cost for digital cinema systems, while reproducing the director's vision with stunning clarity. The ABSolute algorithm divides a digital image into regions, or "blocks," that vary in size from 2x2 to 16x16 pixels, depending on where information resides within the frame. Each block is then transformed to the discrete cosine domain and processed to remove information from the image that will not be visible.
``3dfx open sources texture compression technology'' article by by Peter Cohen 1999-09-14 http://maccentral.macworld.com/news/9909/14.3dfx.shtml
Fantasy Name Generator by Samuel Stoddard - Version 1.3 http://rinkworks.com/namegen/ [FIXME: ... generates random character sequences; various options to make them look like ``names'', ``insults'', etc.]
Foveating Codecs http://datacompression.dogma.net/article.php?sid=82 | http://www.pbs.org/cringely/pulpit/pulpit20020228.html somehow figure out where you're looking, and only bother to display high resolution *there*.

... foveating codecs ... an overall reduction in required bandwidth of three to five times... the only way for the codec to work is if we know where people are actually looking. ...
as video comes to phones, it will most likely do so not in the form of larger screens, but through wearable display devices that are much higher resolution. Retinal scan displays are my personal favorites. Microvision is the top company in that business, and what I would like them to do is build a cheap display and find a way to track eye motion at the same time. There is already a laser at work in these devices, so why not use it during the vertical scanning interval to track eye movement? ...
[machine_vision ? computer_graphics ?][wearables ?]
Project: CUI Natural Language Parser http://sourceforge.net/projects/langparser/ ``analysis of natural languages in a character user interface. Primarily it will replace words with symbols, and can be used for word counts, and word pattern analysis.'' looks useful in developing new text compression algorithms.
Never rise to speak till you have something to say; and when you have said it, cease. -Witherspoon
Consider using a LZ compressor to break plaintext into 3 ``independent'' streams: offset(w), offset(b), length. Since the streams have different characteristics, use 3 independent Huffman trees to further compress. (offset(b) gives the offset from the current location to a byte indicating the *next* byte. offset(w) indicates how many repeats of that byte to scan backwards for ... ). With normal English text, offset(b) is rarely more than 26 ...
What is the Karhunen-Loève transform ? [FIXME:]
consider playing with ``lapped projections'' to avoid block artifacts. ``a lapped orthogonal projection''. Bob Dony http://www.uoguelph.ca/~rdony/ei98/nonlinear.html gives some equations. I think this might be enough for me to build an implementation.
DAV: I've been thinking about how to make a lossless fractal compression algorithm. Since fractals represent some images (and ``parts'' of images) really good, while others pretty bad, I've been thinking about quantizing the fractal representation heavily (making the reconstructed image look fairly abstract), subtracting the reconstructed image from the original image, then compressing the error image using some other method (png ?) to give lossless image compression. Hopefully once we recognize large-scale redundancies and subtract them out, the compressed error image will be much smaller than the original image. And if the total transmitted information ( fractal coefficients + png compressed error image) is less than simply (png compressed original image), then we're cooking. Obviously there's a tradeoff -- if I spend lots of bits describing the image in a fractal manner, so the error image is very ``small'', then I won't need many bits to describe the error image ... compared to spending only a few bits describing the image in a fractal manner, then creating a error image with almost as many bits as just compressing the original directly with png. This paper ``A Vector Quantization Approach to Universal Noiseless Coding and Quantization'' paper by Philip A. Chou, Michelle Effros, Robert M. Gray http://www-isl.stanford.edu/~gray/universal.pdf seems to talk about ways to automatically adjust 2 stage compression algorithms like this to give maximum compression. (very detailed math equations).
Relating Fractal Image Compression to Transform Methods by Axel van de Walle http://links.uwaterloo.ca/~axel/thesis.html [FIXME: read]
http://www.go2net.com/internet/deep/1997/01/01/ very quick description of RLE and Huffman compression, Free Java source code.
[FIXME: condense my ramblings on encoding integers; perhaps delete entirely and just point to other explainations ] [FIXME: section on ergonomics]
Rodrigo S. de Castro http://linuxcompressed.sourceforge.net/design/ | Scott Kaplan http://www.cs.amherst.edu/~sfkaplan/research/compressed-caching/ [FIXME: check out the compression algorithm comparison ... compressing VM pages requires a different strategy than compressing large files.] [FIXME: consider contributing to this research effort, improving Linux]
N. J. A. Sloane, Papers on Multiple Description Vector Quantizers http://www.research.att.com/~njas/doc/MDVQ.html [FIXME: check this out. Many kinds of compression rely on vector quantization ... can I apply any of these ideas to improve compression ?]
http://rleweb.mit.edu/Publications/pr144/02.htm mentions the project ``Source Representation and Encoding for Multimedia Applications'' Project Staff: Stark Draper, Professor Gregory W. Wornell.

We are working in the area of source compression for correlated sources. Standard techniques lead to one of two options. One is to entropy encode the two sources into one codeword. This uses minimal memory resources, but leads to complicated decoding procedures. Alternately, the sources can be separately entropy encoded into two independent codewords. This uses greater memory resources, but leads to simple decoding procedures.
We are interested in a third set-up where the sources are encoded into three codewords. One codeword characterizes the common information between the two sources, and the other two codewords characterize the marginal refinements needed to reconstruct each source. By differentially weighting the cost of the common information rate versus the marginal rate, we can trace out a region bracketed by the two standard techniques described above.
DAV: Interesting. I was thinking about a similar problem. I just set one source as the primary source and compressed it normally, then tried to compress a (very similar) text file using information from the other source ... I hadn't thought of generating 3 codewords. Interesting. In particular, I was thinking about compressing a collection of various translations of the Bible. Say I've already compressed one translation, and the *compressed* file indicates that a particular word in one verse is a repeat of the same word 2 verses ago. When I compress a different translation, and (surprise, surprise) I see that a particular word in the corresponding verse is also a repeat of the same word 2 verses ago ... even though it's a completely different word from the 1st translation ... you would think that I could use the information from that 1st file to make this 2nd file a bit smaller.
BASiCS - Berkeley Audio-visual Signal Processing and Communication Systems http://basics.eecs.berkeley.edu/ [FIXME: data compression ?]
http://www.mathematik.uni-bielefeld.de/~sillke/codes/ points to some error-detection codes. ``The International Standard Book Number (ISBN) ... check digit ... detects any single error and the transposition of any two digits at any distance (assuming the overall number is 10 or fewer digits long).''
``Large Limits to Software Estimation'' article by J. P. Lewis in ACM Software Engineering Notes Vol 26, No. 4 July 2001 p. 54-59, also online at http://www.idiom.com/~zilla/Work/kcsest.pdf . Makes some surprising (to DAV, anyway) claims:
- ``Claim 1: Program size and complexity cannot be feasibly estimated a priori.''
- ``Claim 2: [Software] Development time cannot be objectively predicted.''
- ``Claim 3: Absolute productivity cannot be objectively determined.''
Uses some very technical language. (See http://www.idiom.com/~zilla/Work/Notes/productionanecdotes.html for some much easier-to-read language ... that gives anecdotes of people who understood that development time cannot be estimated before the fact)
algorithmic complexity or KCS (Kolmogorov-Chaitin-Solomonoff) complexity. ``Algorithmic complexity (AC) defines the complexity of a digital object to be the length of the shortest computer program that produces that object. This definition formalizes an intuitive notion of complexity.''
``KCS noncomputability theorem: there is no algorithm for computing the AC of an arbitrary string. ... there is no algorithm for finding the shortest program with a desired behavior.''
DAV is a little confused by this statement: ``One makes the reasonable assumption that if a statement AC(x) > c can be proved then it should be possible to extract from the proof the particular x that is used.'' Is this really true ? I wish I had an easier-to-read discussion of this.
http://www.ross.net/compression/lzrw1a.html mirrored: ../program/compression/
```
From: Stephan T. Lavavej (stlAT_CALTECH_NOT_MIT@mit.edu)
Subject: Bwtzip, A Linear-Time Burrows-Wheeler Compressor
View: Complete Thread (10 articles)
Original Format
Newsgroups: comp.compression
Date: 2002-12-04 07:38:02 PST
```
I have released a new version of my program bwtzip: http://stl.caltech.edu/bwtzip.html
It is a portable linear-time BWT compressor distributed under the GNU GPL. A comparison of its compression efficiency with respect to ZIP, gzip, and bzip2 is provided on the page above; bwtzip handily beats all of them.
Currently, bwtzip uses the following algorithms: * Suffix Tree Burrows-Wheeler Transformation * Move-to-Front-2 * Wheeler's Zero Length Coding * Arithmetic Coding
bwtzip constructs suffix trees using Ukkonen's online algorithm. It is extremely space-greedy, but I have finally fixed a memory leak that was lurking in the code for months.
Other algorithms are implemented in the source code but not used, including: * Suffix Array BWT * MTF * Switched-RLE and variants thereof * Variants of Wheeler's ZLE * Canonical Huffman Coding
bwtzip does not yet break input into blocks; I would recommend not attempting to compress files that are larger than 1/100 of your main system memory.
Comments are welcome. I would like to know of any algorithms that would improve my compression efficiency (not time performance, for which I care not).
Stephan T. Lavavej
Compression - Web Resources http://www.peterindia.com/CompressionResources.html points to Tutorials and Guides on Compression and many other links.
Sorting the matrix of two-sided contexts http://geocities.com/SiliconValley/Bay/1995/slrm.htm seems to resemble my speculation that it *is* possible to do BWT-like using *both* left and right contexts.
Ken Webster discusses the guts of JPEG compression http://massmind.org/techref/datafile/jpeg/guts.htm
http://palmzlib.sourceforge.net/ compression for PalmOS devices [FIXME: copy link to my Palm programming section ?]
The MG system is a suite of programs for compressing and indexing text and images. Most of the functionality implemented in the suite is as described in the book Managing Gigabytes: Compressing and Indexing Documents and Images. http://freshmeat.net/projects/mg/?highlight=mg&topic_id=849
_Managing Gigabytes: Compressing and Indexing Documents and Images_ book by Ian H. Witten, Alistair Moffat, and Timothy C. Bell http://www.cs.mu.oz.au/mg/
[FIXME: read]
todo: google for "eigenfaces".
Modular Eigenspaces http://www-white.media.mit.edu/vismod/demos/facerec/basic.html independently finds eigeneyes, eigennoses, and eigenmouths. a modular or layered representation of a face, where a coarse (low-resolution) description of the whole head is augmented by additional (higher-resolution) details in terms of salient facial features. ... The modular description is also advantageous for image compression and coding purposes.
[making Huffman compression faster]
```
From: Igor Grobarcik (igor.grobarcik at centrum.sk)
Subject: Re: faster huffmann?
Newsgroups: alt.lang.asm, comp.compression
Date: 2003-06-30 00:29:47 PST
```
Hi Zdenek,
For faster Huffman Tree building most of algorithms use a datastructure called heap. Another way is a Huffman-Ricardo coding. See page http://www.rootshell.be/~dan/hrc.html . These algorithms are designed for building static huffman tree. Try adaptive huffman algorithms. See page http://www.datacompression.info./AdaptiveHuffman.shtml .
Igor
```
From: Dale King (KingD at tmicha.net)
Subject: Re: faster huffmann?
Newsgroups: comp.compression
Date: 2003-08-21 17:41:19 PST
```
...
For static Huffman, if you first sort your input frequencies then you don't need a heap and can build the tree in O(n) time (of course sorting the frequencies may take O(n log n) time. A really good paper on this is:
http://www.cs.mu.oz.au/~alistair/abstracts/mk95:wads.html
This algorithm actually operates in-place with a constant amount of additional memory. You put it an array of weights and it overwrites the weights with the length of the codeword.
While the paper is not on-line, the source code for it is.
-- Dale King
the Information Coding Laboratory at UCSD http://code.ucsd.edu/ has a section on image compression.
"The result of a year's work depends more on what is struck out than on what is left in" -- Henry Adams, in "The Dynamo and the Virgin" http://www.wordsworth2.net/resource/artstech/dynvrgan.htm
"Neural Network Compression FAQ" by Matt Mahoney http://cs.fit.edu/~mmahoney/compression/faq.html points to article "Text Compression as a Test for Artificial Intelligence" [FIXME: to read] ... "Improving Neural Network Text Compression with Word-Oriented Contexts" by Matthew Mahoney http://cs.fit.edu/~mmahoney/compression/nnword.html
[data compression] SFLBIG21.ZIP Standard Function Library v2.11 http://www.programmersheaven.com/zone3/cat352/16841.htm http://www.programmersheaven.com/zone3/cat352/index.htm includes source code to some compression routines
[data compression] Compression http://www.programmersheaven.com/zone3/cat856/index.htm source code. BWT (Burrows wheeler transform) fractal image compression [FIXME: is this any good ???] several different Huffman implementations
[FIXME: to read] [video encoding] http://www.bbc.co.uk/rd/projects/dirac/overview.shtml open source video codec
[FIXME: to read] [video encoding] http://thanglong.ece.jhu.edu/Tran/Pub/prepost.pdf
"He can compress the most words into the smallest idea of any man I ever met." -- Abraham Lincoln
Compression of Individual Sequences via Variable-Rate Coding (1978) Jacob Ziv, Abraham Lempel http://citeseer.ist.psu.edu/ziv78compression.html
"Redundancy estimates for the Lempel-Ziv algorithm of data compression" by V. N. Potapov (2004) http://portal.acm.org/citation.cfm?id=971925 For sequences of asymptotically zero empirical entropy, a modification of the Lempel-Ziv coding rule is offered whose coding cost is at most a finite number of times worse than the optimum. A combinatorial proof is offered for the well-known redundancy estimate of the Lempel-Ziv coding algorithm for sequences having a positive entropy.
"Topic: compressed or packed data" http://thesa.com/th/th-78-73-175-th-225-122-65.htm
"Optimal prefetching via data compression" by Jeffrey Scott Vitter (1996) http://portal.acm.org/citation.cfm?id=234753 Caching and prefetching are important ... for speeding up access time to data on secondary storage. ... Intuitively, in order to compress data effectively, you have to be able to predict future data well, and thus good data compressors should be able to predict well for purposes of prefetching. ... the page fault rate incurred by our prefetching algorithms are optimal in the limit for almost all sequences of page requests.
"A fast hardware data compression algorithm" by D. J. Craft 1998 http://www.research.ibm.com/journal/rd/426/craftref.html "The CALGARY corpus is a standard set of mostly textual data of various kinds, but it also contains some source code, executable code, an image data example, and some geodesic data" ... DAV: LZ1 is naturally history-limited ...
"There is also a Java implementation of LZX-2 in the class com.ms.util.cab in the Microsoft AFC (Application Foundation Classes for Java)." http://xavprods.free.fr/lzx/

Original Author: David Cary. This page split off from computer_graphics_tools.html on 1998-07-10. and has backlinks

David Cary feedback.html
d.cary@ieee.org.

Return to index // end http://david.carybros.com/html/data_compression.html /* was http://rdrop.com/~cary/html/data_compression.html */