MIME-Version: 1.0
In-Reply-To: <90EF4E6C-9A71-4A35-A938-EAFC1A24DD24@mattcorallo.com>
References: <565CD7D8.3070102@gmail.com>
	<90EF4E6C-9A71-4A35-A938-EAFC1A24DD24@mattcorallo.com>
From: =?UTF-8?Q?Emin_G=C3=BCn_Sirer?= <el33th4x0r@gmail.com>
Date: Wed, 2 Dec 2015 13:57:46 -0500
Message-ID: <CAPkFh0t9SwVOLrPnL7z80s-Rriezhqxn_3vXKYRxr6JVGNiUZQ@mail.gmail.com>
To: Matt Corallo <lf-lists@mattcorallo.com>
Content-Type: multipart/alternative; boundary=001a113a75b2c7c7f80525eedafd
Cc: Bitcoin Dev <bitcoin-dev@lists.linuxfoundation.org>
Subject: Re: [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and
	Transactions
Precedence: list

--001a113a75b2c7c7f80525eedafd
Content-Type: text/plain; charset=UTF-8

Thanks Peter for the careful, quantitative work.

I want to bring one additional issue to everyone's consideration, related
to the choice of the Lempel-Ziv family of compressors.

While I'm not familiar with every single compression engine tested, the
Lempel-Ziv family of compressors are generally based on "compression
tables." Essentially, they assign a short unique number to every new
subsequence they encounter, and when they re-encounter a sequence like "ab"
in "abcdfdcdabcdfabcdf" they replace it with that short integer (say, in
this case, 9-bit constant 256). So this example sequence may turn into
"abcdfd<258 for cd><256 for ab><258 for cd>f<261 for abc><259 for df>"
which is slightly shorter than the original (I'm doing this off the top of
my head so the counts may be off, but it's meant to be illustrative). Note
that the sequence "abc" got added into the table only after it was
encountered twice in the input.

This is nice and generic and works well for English text where certain
letter sequences (e.g. "it" "th" "the" "this" "are" "there" etc) are
repeated often, but it is nowhere as compact as it could possibly be for
mostly binary data -- there are opportunities for much better compression,
made possible by the structured reuse of certain byte sequences in the
Bitcoin wire protocol.

On a Bitcoin wire connection, we might see several related transactions
reorganizing cash in a set of addresses, and therefore, several reuses of a
20-byte address. Or we might see a 200-byte transaction get transmitted,
followed by the same transaction, repeated in a block. Ideally, we'd learn
the sequence that may be repeated later on, all at once (e.g. a Bitcoin
address or a transaction), and replace it with a short number, referring
back to the long sequence. In the example above, if we knew that "abcdf"
was a UNIT that would likely be repeated, we would put it into the
compression table as a whole, instead of relying on repetition to get it
into the table one extra byte at a time. That may let us compress the
original sequence down to "abcdfd<257 for cd><256 for abcdf><256 for
abcdf>" from the get go.

Yet the LZ variants I know of will need to see a 200-byte sequence repeated
**199 times** in order to develop a single, reusable, 200-byte long
subsequence in the compression table.

So, a Bitcoin-specific compressor can perhaps do significantly better, but
is it a good idea? Let's argue both sides.

Cons:

On the one hand, Bitcoin-specific compressors will be closely tied to the
contents of messages, which might make it difficult to change the wire
format later on -- changes to the wire format may need corresponding
changes to the compressor.  If the compressor cannot be implemented
cleanly, then the protocol-agnostic, off-the-shelf compressors have a
maintainability edge, which comes at the expense of the compression ratio.

Another argument is that compression algorithms of any kind should be
tested thoroughly before inclusion, and brand new code may lack the
maturity required. While this argument has some merit, all outputs are
verified separately later on during processing, so
compression/decompression errors can potentially be detected. If the
compressor/decompressor can be structured in a way that isolates bitcoind
from failure (e.g. as a separate process for starters), this concern can be
remedied.

Pros:

The nature of LZ compressors leads me to believe that much higher
compression ratios are possible by building a custom, Bitcoin-aware
compressor. If I had to guess, I would venture that compression ratios of
2X or more are possible in some cases. In some sense, the "O(1) block
propagation" idea that Gavin proposed a while ago can be seen as extreme
example of a Bitcoin-specific compressor, albeit one that constrains the
order of transactions in a block.

Compression can buy us some additional throughput at zero cost, modulo code
complexity.
Given the amount of acrimonious debate over the block size we have all had
to endure, it seems
criminal to leave potentially free improvements on the table. Even if the
resulting code is
deemed too complex to include in the production client right now, it would
be good to understand
the potential for improvement.

How to Do It

If we want to compress Bitcoin, a programming challenge/contest would be
one of the best ways to find the best possible, Bitcoin-specific
compressor. This is the kind of self-contained exercise that bright young
hackers love to tackle. It'd bring in new programmers into the ecosystem,
and many of us would love to discover the limits of compressibility for
Bitcoin bits on a wire. And the results would be interesting even if the
final compression engine is not enabled by default, or not even merged.

--001a113a75b2c7c7f80525eedafd
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><span style=3D"font-size:12.8px">Thanks Peter for the care=
ful, quantitative work.</span><div style=3D"font-size:12.8px"><br></div><di=
v><span style=3D"font-size:12.8px">I want to bring one additional issue to =
everyone&#39;s consideration, related to the choice of the Lempel-Ziv famil=
y of compressors.=C2=A0</span><div style=3D"font-size:12.8px"><br></div><di=
v style=3D"font-size:12.8px">While I&#39;m not familiar with every single c=
ompression engine tested, the Lempel-Ziv family of compressors are generall=
y based on &quot;compression tables.&quot; Essentially, they assign a short=
 unique number to every new subsequence they encounter, and when they re-en=
counter a sequence like &quot;ab&quot; in &quot;abcdfdcdabcdfabcdf&quot; th=
ey replace it with that short integer (say, in this case, 9-bit constant 25=
6). So this example sequence may turn into &quot;abcdfd&lt;258 for cd&gt;&l=
t;256 for ab&gt;&lt;258 for cd&gt;f&lt;261 for abc&gt;&lt;259 for df&gt;&qu=
ot; which is slightly shorter than the original (I&#39;m doing this off the=
 top of my head so the counts may be off, but it&#39;s meant to be illustra=
tive). Note that the sequence &quot;abc&quot; got added into the table only=
 after it was encountered twice in the input.=C2=A0</div><div style=3D"font=
-size:12.8px"><br></div><div style=3D"font-size:12.8px">This is nice and ge=
neric and works well for English text where certain letter sequences (e.g. =
&quot;it&quot; &quot;th&quot; &quot;the&quot; &quot;this&quot; &quot;are&qu=
ot; &quot;there&quot; etc) are repeated often, but it is nowhere as compact=
 as it could possibly be for mostly binary data -- there are opportunities =
for much better compression, made possible by the structured reuse of certa=
in byte sequences in the Bitcoin wire protocol.</div><div><div style=3D"fon=
t-size:12.8px"><br></div><div style=3D"font-size:12.8px">On a Bitcoin wire =
connection, we might see several related transactions reorganizing cash in =
a set of addresses, and therefore, several reuses of a 20-byte address. Or =
we might see a 200-byte transaction get transmitted, followed by the same t=
ransaction, repeated in a block. Ideally, we&#39;d learn the sequence that =
may be repeated later on, all at once=C2=A0(e.g. a Bitcoin address or a tra=
nsaction), and replace it with a short number, referring back to the long s=
equence. In the example above, if we knew that &quot;abcdf&quot; was a UNIT=
 that would likely be repeated, we would put it into the compression table =
as a whole, instead of relying on repetition to get it into the table one e=
xtra byte at a time. That may let us compress the original sequence down to=
 &quot;abcdfd&lt;257 for cd&gt;&lt;256 for abcdf&gt;&lt;256 for abcdf&gt;&q=
uot; from the get go.</div><div style=3D"font-size:12.8px"><br></div><div s=
tyle=3D"font-size:12.8px">Yet the LZ variants I know of will need to see a =
200-byte sequence repeated **199 times** in order to develop a single, reus=
able, 200-byte long subsequence in the compression table.=C2=A0</div><div><=
div style=3D"font-size:12.8px"><br></div><div style=3D"font-size:12.8px">So=
, a Bitcoin-specific compressor can perhaps do significantly better, but is=
 it a good idea? Let&#39;s argue both sides.</div><div style=3D"font-size:1=
2.8px"><br></div><div><div><span style=3D"font-size:12.8px">Cons:</span></d=
iv><div><span style=3D"font-size:12.8px"><br></span></div><div><span style=
=3D"font-size:12.8px">On the one hand, Bitcoin-specific compressors will be=
 closely tied to the contents of messages, which might make it difficult to=
 change the wire format later on -- changes to the wire format may need cor=
responding changes to the compressor.=C2=A0 If the compressor cannot be imp=
lemented cleanly, then the protocol-agnostic, off-the-shelf compressors hav=
e a maintainability edge, which comes at the expense of the compression rat=
io.=C2=A0</span></div><div><span style=3D"font-size:12.8px"><br></span></di=
v><div><span style=3D"font-size:12.8px">Another argument is that compressio=
n algorithms of any kind should be tested thoroughly before inclusion, and =
brand new code may lack the maturity required. While this argument has some=
 merit, all outputs are verified separately later on during processing, so =
compression/decompression errors can potentially be detected. If the compre=
ssor/decompressor can be structured in a way that isolates bitcoind from fa=
ilure (e.g. as a separate process for starters), this concern can be remedi=
ed.</span></div><div><span style=3D"font-size:12.8px"><br></span></div><div=
><span style=3D"font-size:12.8px">Pros:</span></div><div><span style=3D"fon=
t-size:12.8px"><br></span></div><div><span style=3D"font-size:12.8px">The n=
ature of LZ compressors leads me to believe that much higher compression ra=
tios are possible by building a custom, Bitcoin-aware compressor. If I had =
to guess, I would venture that compression ratios of 2X or more are possibl=
e in some cases. In some sense, the &quot;O(1) block propagation&quot; idea=
 that Gavin proposed a while ago can be seen as extreme example of a Bitcoi=
n-specific compressor, albeit one that constrains the order of transactions=
 in a block.</span></div><div><span style=3D"font-size:12.8px"><br></span><=
/div><div><span style=3D"font-size:12.8px">Compression can buy us some addi=
tional throughput at zero cost, modulo code complexity.=C2=A0</span></div><=
div><span style=3D"font-size:12.8px">Given the amount of acrimonious debate=
 over the block size we have all had to endure, it seems=C2=A0</span></div>=
<div><span style=3D"font-size:12.8px">criminal to leave potentially free im=
provements on the table. Even if the resulting code is</span></div><div><sp=
an style=3D"font-size:12.8px">deemed too complex to include in the producti=
on client right now, it would be good to understand</span></div><div><span =
style=3D"font-size:12.8px">the potential for improvement.</span></div><div>=
<span style=3D"font-size:12.8px"><br></span></div><div><span style=3D"font-=
size:12.8px">How to Do It</span></div><div><span style=3D"font-size:12.8px"=
><br></span></div><div><span style=3D"font-size:12.8px">If we want to compr=
ess Bitcoin, a</span><span style=3D"font-size:12.8px">=C2=A0programming cha=
llenge/contest would be one of the best ways to find the best possible, Bit=
coin-specific compressor. This is the kind of self-contained exercise that =
bright </span><span style=3D"font-size:12.8px">young hackers love to tackle=
. It&#39;d bring in new programmers into the ecosystem, and many of us woul=
d love to discover the limits of compressibility for Bitcoin bits on a wire=
. And the results would be interesting even if the final compression engine=
 is not enabled by default, or not even merged.</span></div><div><br></div>=
</div></div></div></div></div>

--001a113a75b2c7c7f80525eedafd--