96/50062b0fe522c466e51792f212a2231cb2f1d2


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363

Return-Path: <peter.tschipper@gmail.com>
Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org
	[172.17.192.35])
	by mail.linuxfoundation.org (Postfix) with ESMTPS id 2D9798EE
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Wed,  2 Dec 2015 20:16:22 +0000 (UTC)
X-Greylist: whitelisted by SQLgrey-1.7.6
Received: from mail-pa0-f43.google.com (mail-pa0-f43.google.com
	[209.85.220.43])
	by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 4DB0E148
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Wed,  2 Dec 2015 20:16:21 +0000 (UTC)
Received: by pacej9 with SMTP id ej9so50186269pac.2
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Wed, 02 Dec 2015 12:16:21 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=subject:to:references:cc:from:message-id:date:user-agent
	:mime-version:in-reply-to:content-type;
	bh=bOTesNyNFOK3lNr3a85lIEhar5KgCqnO3tgfX9105G8=;
	b=mURcrhbcz3WoWHYLeFGPA2cgJ1kjqCwql9DjgDrPPv57MtPCOjutPCgBlG00sO9NUZ
	Fs7m53CRrFWBNQUJSruB9G7dYZG4zOYo5zlfCd+SmZ8YndBsaE5zjePKuiqUY/UQyZ9P
	QnoSGwkIKVUoAOP/RlM157dbBUKzXcO+vJ4AThBOrPtK9CRwtPGeIDrOSK9GSL796FYy
	8ED7vS86exs7929dP56HkjzhsOcPbsPhZUeOQEnaZd9FJWKsOP1z89iOAXymRyxWBTUx
	CGS5YCQgkOhdsVlXKOytYz3o8lXMVdZEMx23QvSP3i9avME2I2iayfARhN4uJ5u9RFER
	Tf6Q==
X-Received: by 10.66.155.38 with SMTP id vt6mr7477728pab.20.1449087381046;
	Wed, 02 Dec 2015 12:16:21 -0800 (PST)
Received: from [192.168.0.132] (S0106bcd165303d84.cc.shawcable.net.
	[96.54.102.88]) by smtp.googlemail.com with ESMTPSA id
	kh9sm6068489pad.11.2015.12.02.12.16.19
	(version=TLSv1/SSLv3 cipher=OTHER);
	Wed, 02 Dec 2015 12:16:20 -0800 (PST)
To: =?UTF-8?Q?Emin_G=c3=bcn_Sirer?= <el33th4x0r@gmail.com>
References: <565CD7D8.3070102@gmail.com>
	<90EF4E6C-9A71-4A35-A938-EAFC1A24DD24@mattcorallo.com>
	<CAPkFh0t9SwVOLrPnL7z80s-Rriezhqxn_3vXKYRxr6JVGNiUZQ@mail.gmail.com>
From: Peter Tschipper <peter.tschipper@gmail.com>
X-Enigmail-Draft-Status: N1110
Message-ID: <565F5193.1070802@gmail.com>
Date: Wed, 2 Dec 2015 12:16:19 -0800
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101
	Thunderbird/38.3.0
MIME-Version: 1.0
In-Reply-To: <CAPkFh0t9SwVOLrPnL7z80s-Rriezhqxn_3vXKYRxr6JVGNiUZQ@mail.gmail.com>
Content-Type: multipart/alternative;
	boundary="------------010207060208020007080302"
X-Spam-Status: No, score=-2.7 required=5.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW
	autolearn=ham version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	smtp1.linux-foundation.org
Cc: Bitcoin Dev <bitcoin-dev@lists.linuxfoundation.org>
Subject: Re: [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and
 Transactions
X-BeenThere: bitcoin-dev@lists.linuxfoundation.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Bitcoin Development Discussion <bitcoin-dev.lists.linuxfoundation.org>
List-Unsubscribe: <https://lists.linuxfoundation.org/mailman/options/bitcoin-dev>,
	<mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=unsubscribe>
List-Archive: <http://lists.linuxfoundation.org/pipermail/bitcoin-dev/>
List-Post: <mailto:bitcoin-dev@lists.linuxfoundation.org>
List-Help: <mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=help>
List-Subscribe: <https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev>,
	<mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Dec 2015 20:16:22 -0000

This is a multi-part message in MIME format.
--------------010207060208020007080302
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

Building a compressor from scratch may yeild some better compression
ratios, or not, but having trust and faith in whether it will stand up
against attack vectors another matter.  LZO has been around for 20 years
with very few problems and no current issues.  Maybe something better
can be built, but when and how much testing will need to be done before
it can be trusted?  Right now there is something that provides a benefit
and in the future if something better is found it's not that difficult
to add it.  We could easily support multiple compression libraries.


On 02/12/2015 10:57 AM, Emin Gün Sirer wrote:
> Thanks Peter for the careful, quantitative work.
>
> I want to bring one additional issue to everyone's consideration,
> related to the choice of the Lempel-Ziv family of compressors. 
>
> While I'm not familiar with every single compression engine tested,
> the Lempel-Ziv family of compressors are generally based on
> "compression tables." Essentially, they assign a short unique number
> to every new subsequence they encounter, and when they re-encounter a
> sequence like "ab" in "abcdfdcdabcdfabcdf" they replace it with that
> short integer (say, in this case, 9-bit constant 256). So this example
> sequence may turn into "abcdfd<258 for cd><256 for ab><258 for
> cd>f<261 for abc><259 for df>" which is slightly shorter than the
> original (I'm doing this off the top of my head so the counts may be
> off, but it's meant to be illustrative). Note that the sequence "abc"
> got added into the table only after it was encountered twice in the
> input. 
>
> This is nice and generic and works well for English text where certain
> letter sequences (e.g. "it" "th" "the" "this" "are" "there" etc) are
> repeated often, but it is nowhere as compact as it could possibly be
> for mostly binary data -- there are opportunities for much better
> compression, made possible by the structured reuse of certain byte
> sequences in the Bitcoin wire protocol.
>
> On a Bitcoin wire connection, we might see several related
> transactions reorganizing cash in a set of addresses, and therefore,
> several reuses of a 20-byte address. Or we might see a 200-byte
> transaction get transmitted, followed by the same transaction,
> repeated in a block. Ideally, we'd learn the sequence that may be
> repeated later on, all at once (e.g. a Bitcoin address or a
> transaction), and replace it with a short number, referring back to
> the long sequence. In the example above, if we knew that "abcdf" was a
> UNIT that would likely be repeated, we would put it into the
> compression table as a whole, instead of relying on repetition to get
> it into the table one extra byte at a time. That may let us compress
> the original sequence down to "abcdfd<257 for cd><256 for abcdf><256
> for abcdf>" from the get go.
>
> Yet the LZ variants I know of will need to see a 200-byte sequence
> repeated **199 times** in order to develop a single, reusable,
> 200-byte long subsequence in the compression table. 
>
> So, a Bitcoin-specific compressor can perhaps do significantly better,
> but is it a good idea? Let's argue both sides.
>
> Cons:
>
> On the one hand, Bitcoin-specific compressors will be closely tied to
> the contents of messages, which might make it difficult to change the
> wire format later on -- changes to the wire format may need
> corresponding changes to the compressor.  If the compressor cannot be
> implemented cleanly, then the protocol-agnostic, off-the-shelf
> compressors have a maintainability edge, which comes at the expense of
> the compression ratio. 
>
> Another argument is that compression algorithms of any kind should be
> tested thoroughly before inclusion, and brand new code may lack the
> maturity required. While this argument has some merit, all outputs are
> verified separately later on during processing, so
> compression/decompression errors can potentially be detected. If the
> compressor/decompressor can be structured in a way that isolates
> bitcoind from failure (e.g. as a separate process for starters), this
> concern can be remedied.
>
> Pros:
>
> The nature of LZ compressors leads me to believe that much higher
> compression ratios are possible by building a custom, Bitcoin-aware
> compressor. If I had to guess, I would venture that compression ratios
> of 2X or more are possible in some cases. In some sense, the "O(1)
> block propagation" idea that Gavin proposed a while ago can be seen as
> extreme example of a Bitcoin-specific compressor, albeit one that
> constrains the order of transactions in a block.
>
> Compression can buy us some additional throughput at zero cost, modulo
> code complexity. 
> Given the amount of acrimonious debate over the block size we have all
> had to endure, it seems 
> criminal to leave potentially free improvements on the table. Even if
> the resulting code is
> deemed too complex to include in the production client right now, it
> would be good to understand
> the potential for improvement.
>
> How to Do It
>
> If we want to compress Bitcoin, a programming challenge/contest would
> be one of the best ways to find the best possible, Bitcoin-specific
> compressor. This is the kind of self-contained exercise that bright
> young hackers love to tackle. It'd bring in new programmers into the
> ecosystem, and many of us would love to discover the limits of
> compressibility for Bitcoin bits on a wire. And the results would be
> interesting even if the final compression engine is not enabled by
> default, or not even merged.
>


--------------010207060208020007080302
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">Building a compressor from scratch may
      yeild some better compression ratios, or not, but having trust and
      faith in whether it will stand up against attack vectors another
      matter.  LZO has been around for 20 years with very few problems
      and no current issues.  Maybe something better can be built, but
      when and how much testing will need to be done before it can be
      trusted?  Right now there is something that provides a benefit and
      in the future if something better is found it's not that difficult
      to add it.  We could easily support multiple compression
      libraries.<br>
      <br>
      <br>
      On 02/12/2015 10:57 AM, Emin Gün Sirer wrote:<br>
    </div>
    <blockquote
cite="mid:CAPkFh0t9SwVOLrPnL7z80s-Rriezhqxn_3vXKYRxr6JVGNiUZQ@mail.gmail.com"
      type="cite">
      <div dir="ltr"><span style="font-size:12.8px">Thanks Peter for the
          careful, quantitative work.</span>
        <div style="font-size:12.8px"><br>
        </div>
        <div><span style="font-size:12.8px">I want to bring one
            additional issue to everyone's consideration, related to the
            choice of the Lempel-Ziv family of compressors. </span>
          <div style="font-size:12.8px"><br>
          </div>
          <div style="font-size:12.8px">While I'm not familiar with
            every single compression engine tested, the Lempel-Ziv
            family of compressors are generally based on "compression
            tables." Essentially, they assign a short unique number to
            every new subsequence they encounter, and when they
            re-encounter a sequence like "ab" in "abcdfdcdabcdfabcdf"
            they replace it with that short integer (say, in this case,
            9-bit constant 256). So this example sequence may turn into
            "abcdfd&lt;258 for cd&gt;&lt;256 for ab&gt;&lt;258 for
            cd&gt;f&lt;261 for abc&gt;&lt;259 for df&gt;" which is
            slightly shorter than the original (I'm doing this off the
            top of my head so the counts may be off, but it's meant to
            be illustrative). Note that the sequence "abc" got added
            into the table only after it was encountered twice in the
            input. </div>
          <div style="font-size:12.8px"><br>
          </div>
          <div style="font-size:12.8px">This is nice and generic and
            works well for English text where certain letter sequences
            (e.g. "it" "th" "the" "this" "are" "there" etc) are repeated
            often, but it is nowhere as compact as it could possibly be
            for mostly binary data -- there are opportunities for much
            better compression, made possible by the structured reuse of
            certain byte sequences in the Bitcoin wire protocol.</div>
          <div>
            <div style="font-size:12.8px"><br>
            </div>
            <div style="font-size:12.8px">On a Bitcoin wire connection,
              we might see several related transactions reorganizing
              cash in a set of addresses, and therefore, several reuses
              of a 20-byte address. Or we might see a 200-byte
              transaction get transmitted, followed by the same
              transaction, repeated in a block. Ideally, we'd learn the
              sequence that may be repeated later on, all at once (e.g.
              a Bitcoin address or a transaction), and replace it with a
              short number, referring back to the long sequence. In the
              example above, if we knew that "abcdf" was a UNIT that
              would likely be repeated, we would put it into the
              compression table as a whole, instead of relying on
              repetition to get it into the table one extra byte at a
              time. That may let us compress the original sequence down
              to "abcdfd&lt;257 for cd&gt;&lt;256 for abcdf&gt;&lt;256
              for abcdf&gt;" from the get go.</div>
            <div style="font-size:12.8px"><br>
            </div>
            <div style="font-size:12.8px">Yet the LZ variants I know of
              will need to see a 200-byte sequence repeated **199
              times** in order to develop a single, reusable, 200-byte
              long subsequence in the compression table. </div>
            <div>
              <div style="font-size:12.8px"><br>
              </div>
              <div style="font-size:12.8px">So, a Bitcoin-specific
                compressor can perhaps do significantly better, but is
                it a good idea? Let's argue both sides.</div>
              <div style="font-size:12.8px"><br>
              </div>
              <div>
                <div><span style="font-size:12.8px">Cons:</span></div>
                <div><span style="font-size:12.8px"><br>
                  </span></div>
                <div><span style="font-size:12.8px">On the one hand,
                    Bitcoin-specific compressors will be closely tied to
                    the contents of messages, which might make it
                    difficult to change the wire format later on --
                    changes to the wire format may need corresponding
                    changes to the compressor.  If the compressor cannot
                    be implemented cleanly, then the protocol-agnostic,
                    off-the-shelf compressors have a maintainability
                    edge, which comes at the expense of the compression
                    ratio. </span></div>
                <div><span style="font-size:12.8px"><br>
                  </span></div>
                <div><span style="font-size:12.8px">Another argument is
                    that compression algorithms of any kind should be
                    tested thoroughly before inclusion, and brand new
                    code may lack the maturity required. While this
                    argument has some merit, all outputs are verified
                    separately later on during processing, so
                    compression/decompression errors can potentially be
                    detected. If the compressor/decompressor can be
                    structured in a way that isolates bitcoind from
                    failure (e.g. as a separate process for starters),
                    this concern can be remedied.</span></div>
                <div><span style="font-size:12.8px"><br>
                  </span></div>
                <div><span style="font-size:12.8px">Pros:</span></div>
                <div><span style="font-size:12.8px"><br>
                  </span></div>
                <div><span style="font-size:12.8px">The nature of LZ
                    compressors leads me to believe that much higher
                    compression ratios are possible by building a
                    custom, Bitcoin-aware compressor. If I had to guess,
                    I would venture that compression ratios of 2X or
                    more are possible in some cases. In some sense, the
                    "O(1) block propagation" idea that Gavin proposed a
                    while ago can be seen as extreme example of a
                    Bitcoin-specific compressor, albeit one that
                    constrains the order of transactions in a block.</span></div>
                <div><span style="font-size:12.8px"><br>
                  </span></div>
                <div><span style="font-size:12.8px">Compression can buy
                    us some additional throughput at zero cost, modulo
                    code complexity. </span></div>
                <div><span style="font-size:12.8px">Given the amount of
                    acrimonious debate over the block size we have all
                    had to endure, it seems </span></div>
                <div><span style="font-size:12.8px">criminal to leave
                    potentially free improvements on the table. Even if
                    the resulting code is</span></div>
                <div><span style="font-size:12.8px">deemed too complex
                    to include in the production client right now, it
                    would be good to understand</span></div>
                <div><span style="font-size:12.8px">the potential for
                    improvement.</span></div>
                <div><span style="font-size:12.8px"><br>
                  </span></div>
                <div><span style="font-size:12.8px">How to Do It</span></div>
                <div><span style="font-size:12.8px"><br>
                  </span></div>
                <div><span style="font-size:12.8px">If we want to
                    compress Bitcoin, a</span><span
                    style="font-size:12.8px"> programming
                    challenge/contest would be one of the best ways to
                    find the best possible, Bitcoin-specific compressor.
                    This is the kind of self-contained exercise that
                    bright </span><span style="font-size:12.8px">young
                    hackers love to tackle. It'd bring in new
                    programmers into the ecosystem, and many of us would
                    love to discover the limits of compressibility for
                    Bitcoin bits on a wire. And the results would be
                    interesting even if the final compression engine is
                    not enabled by default, or not even merged.</span></div>
                <div><br>
                </div>
              </div>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
  </body>
</html>

--------------010207060208020007080302--