91/a6786cfc2b4c816820420a9ff5c83b4cadbe77


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344

Return-Path: <g.andrew.stone@gmail.com>
Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org
	[172.17.192.35])
	by mail.linuxfoundation.org (Postfix) with ESMTPS id C3356B5F
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Tue, 28 Feb 2017 16:43:33 +0000 (UTC)
X-Greylist: whitelisted by SQLgrey-1.7.6
Received: from mail-wr0-f171.google.com (mail-wr0-f171.google.com
	[209.85.128.171])
	by smtp1.linuxfoundation.org (Postfix) with ESMTPS id B6040157
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Tue, 28 Feb 2017 16:43:31 +0000 (UTC)
Received: by mail-wr0-f171.google.com with SMTP id u48so12768674wrc.0
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Tue, 28 Feb 2017 08:43:31 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=mime-version:in-reply-to:references:from:date:message-id:subject:to
	:cc; bh=NjhbQkqqH2kYTq7jx3+FHNeN/PFvb5e3P0A99bFULPA=;
	b=PRvFsImxEL6Z+qj9mx7S5mRWfIrkTqnP7dTz8tFoeRHNJOMS9OYO+grKjF2J36MEpw
	YRGxtSlTs3lHlaS2D4vlUExYTZ5zVL2tTZaCMKiTY129Fi+QOPAVuQhq1jSAAjz8w2nf
	AT4Ojh6heA12YGgJLW3CdITjTXbxw33NET8YIvrhX46nNq6kU0Rrexew3yoPm/oTPjhI
	qaHSyvO2J2za/RLUtRmWrJXHtjpbGVC316sy0M4CMsUslwbTIeDWWBjXtH4jWMxsL3eu
	v2tJZY1oQeAFflMgjezfDaDMEvCq9asbMVs+Nx56t3geVd3n3gn9FJQPpRpzsKxtCuhK
	ZaTQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:mime-version:in-reply-to:references:from:date
	:message-id:subject:to:cc;
	bh=NjhbQkqqH2kYTq7jx3+FHNeN/PFvb5e3P0A99bFULPA=;
	b=MN2qa/3iN+k3/QcRPQPhn7H52a+nZo2MYxi5z1cXC5hjKZsDkbiaTXR7ZYg343XMmM
	d+4EwskWJVpPhhKv9Q7qx2f2K9JzjvgSS1gQgaQ0apIJr62BkKWH9xuazTg56ONO/UlK
	zBo0Q7NNYR45tc4XKzh+/tjJUbY6c6F4mntMgQzSAY1iew7u0fjAOzLVnQ0GBGnaedJq
	zAEQ/ngHp4jS2zpj7CBp+mRrYyhf0/pmzVSIUC2RKtET9NK/WfD85DkF5hKlPSoLowP2
	VQWClYc3qtlc6tKvE3xpDdmzfGOFHbzJvaNHjrIqJWQfWZbVYkcwtsFbwTkx+YcddwvR
	0qQw==
X-Gm-Message-State: AMke39k2LJM3ajkSSQtNEgqZOQNUr6Pw45Zu53DrTjZ7g3p2zfzqfcEueytNtsFXIwoL+Msv43BZbsgUX/sO+A==
X-Received: by 10.223.136.253 with SMTP id g58mr3558481wrg.10.1488300210170;
	Tue, 28 Feb 2017 08:43:30 -0800 (PST)
MIME-Version: 1.0
Received: by 10.80.142.9 with HTTP; Tue, 28 Feb 2017 08:43:29 -0800 (PST)
In-Reply-To: <CA+KqGkqs8F1hK6y-JnLFRpqhQ5i8i+MXVmtGUQBYmE5d1OCAAg@mail.gmail.com>
References: <20170223235105.GA28497@savin.petertodd.org>
	<CA+KqGkowxEZeAFYa2JJchBDtRkg1p3YZNocivzu3fAtgRLDRBQ@mail.gmail.com>
	<20170224010943.GA29218@savin.petertodd.org>
	<CA+KqGkrOK76S3ffPJmpqYcBwtSeKESqN16yZsrwzDR6JZZmwFA@mail.gmail.com>
	<20170224025811.GA31911@savin.petertodd.org>
	<CA+KqGkq7gavAnAk-tcA+gxL2sWpv3ENhEmHrQHaPdyAsKrLjGg@mail.gmail.com>
	<20170224031531.GA32118@savin.petertodd.org>
	<CA+KqGkrfhg3GnbWwvKXHQ2NWuCnfzYyTPUxRhzYMuDBiNQR4eA@mail.gmail.com>
	<20170224043613.GA32502@savin.petertodd.org>
	<CA+KqGkpi4GvgU-K6vt-U5ZN4AkpjZ0rruzddoJS4-V0TcnyqUQ@mail.gmail.com>
	<20170225041202.GA11152@savin.petertodd.org>
	<CA+KqGkqs8F1hK6y-JnLFRpqhQ5i8i+MXVmtGUQBYmE5d1OCAAg@mail.gmail.com>
From: "G. Andrew Stone" <g.andrew.stone@gmail.com>
Date: Tue, 28 Feb 2017 11:43:29 -0500
Message-ID: <CAHUwRvtseXUx_ShfHd9r9LW1_8cJYcofQ4s1vEpkpKJEniDTzA@mail.gmail.com>
To: Bram Cohen <bram@bittorrent.com>, 
	Bitcoin Protocol Discussion <bitcoin-dev@lists.linuxfoundation.org>
Content-Type: multipart/alternative; boundary=001a11461c7258d6f3054999e53d
X-Spam-Status: No, score=-1.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID, DKIM_VALID_AU, FREEMAIL_FROM, HTML_MESSAGE,
	RCVD_IN_DNSWL_NONE, 
	RCVD_IN_SORBS_SPAM autolearn=no version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	smtp1.linux-foundation.org
Subject: Re: [bitcoin-dev] A Better MMR Definition
X-BeenThere: bitcoin-dev@lists.linuxfoundation.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Bitcoin Protocol Discussion <bitcoin-dev.lists.linuxfoundation.org>
List-Unsubscribe: <https://lists.linuxfoundation.org/mailman/options/bitcoin-dev>,
	<mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=unsubscribe>
List-Archive: <http://lists.linuxfoundation.org/pipermail/bitcoin-dev/>
List-Post: <mailto:bitcoin-dev@lists.linuxfoundation.org>
List-Help: <mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=help>
List-Subscribe: <https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev>,
	<mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Feb 2017 16:43:33 -0000

--001a11461c7258d6f3054999e53d
Content-Type: text/plain; charset=UTF-8

I can understand how Bram's transaction double sha256 hashed UTXO set
patricia trie allows a client to quickly validate inputs because the inputs
of a transaction are specified in the same manner.  So to verify that an
input is unspent the client simply traverses the patricia trie.

It also makes sense that if transaction inputs were specified by a [block
height, tx index, output index] triple we'd have a much more size-efficient
transaction format.  This format would make look up pretty simple in
Peter's pruned time-ordered TXO merkle mountain range, although you'd have
translate the triple to an index, which means we'd have to at a minimum
keep track of the number of TXOs in each block, and then probably do a
linear search starting from the location where the block's TXOs begin in
the MMR.  (The ultimate option I guess is to specify transaction inputs by
a single number which is essentially the index of the TXO in a (never
actually created) insertion-ordered TXO array...)

But since transactions' prevouts are not specified by [block height, tx
index, output index] or by TXO index, I don't understand how an insertion
ordered TXO tree can result in efficient lookups.  Can you help me
understand this?


On Sat, Feb 25, 2017 at 1:23 AM, Bram Cohen via bitcoin-dev <
bitcoin-dev@lists.linuxfoundation.org> wrote:

> On Fri, Feb 24, 2017 at 8:12 PM, Peter Todd <pete@petertodd.org> wrote:
>
>>
>> So to be clear, what you're proposing there is to use the insertion order
>> as
>> the index - once you go that far you've almost entirely re-invented my
>> proposal!
>>
>
> I'm not 'proposing' this, I'm saying it could be done simply but I'm
> skeptical of the utility. Probably the most compelling argument for it is
> that the insertion indexed values are much smaller so they can be compacted
> down a lot resulting in using less memory and more locality and fewer
> hashes, but your implementation doesn't take advantage of that.
>
>
>> Your merkle-set implementation is 1500 lines of densely written Python
>
>
> The reference implementation which is included in those 1500 lines is less
> than 300 lines and fairly straightforward. The non-reference implementation
> always behaves semantically identically to the reference implementation, it
> just does so faster and using less memory.
>
>
>> with
>> almost no comments,
>
>
> The comments at the top explain both the proof format and the in-memory
> data structures very precisely. The whole codebase was reviewed by a
> coworker of mine and comments were added explaining the subtleties which
> tripped him up.
>
>
>> and less than a 100 lines of (also uncommented) tests.
>
>
> Those tests get 98% code coverage and extensively hit not only the lines
> of code but the semantic edge cases as well. The lines which aren't hit are
> convenience functions and error conditions of the parsing code for when
> it's passed bad data.
>
>
>> By
>> comparison, my Python MMR implementation is 300 lines of very readable
>> Python
>> with lots of comments, a 200 line explanation at the top, and 200 lines of
>> (commented) tests. Yet no-one is taking the (still considerable) effort to
>> understand and comment on my implementation. :)
>>
>
> Given that maaku's Merkle prefix trees were shelved because of performance
> problems despite being written in C and operating in basically the same way
> as your code and my reference code, it's clear that non-optimized Python
> won't be touching the bitcoin codebase any time soon.
>
>
>>
>> Fact is, what you've written is really daunting to review, and given it's
>> not
>> in the final language anyway, it's unclear what basis to review it on
>> anyway.
>
>
> It should reviewed based on semantic correctness and performance.
> Performance can only be accurately and convincingly determined by porting
> to C and optimizing it, which mostly involves experimenting with different
> values for the two passed in magic numbers.
>
>
>> I
>> suspect you'd get more feedback if the codebase was better commented, in a
>> production language, and you have actual real-world benchmarks and
>> performance
>> figures.
>>
>
> Porting to C should be straightforward. Several people have already
> expressed interest in doing so, and it's written in intentionally C-ish
> Python, resulting in some rather odd idioms which is a bit part of why you
> think it looks 'dense'. A lot of that weird offset math should be much more
> readable in C because it's all structs and x.y notation can be used instead
> of adding offsets.
>
>
>> In particular, while at the top of merkle_set.py you have a list of
>> advantages,
>> and a bunch of TODO's, you don't explain *why* the code has any of these
>> advantages. To figure that out, I'd have to read and understand 1500
>> lines of
>> densely written Python. Without a human-readable pitch, not many people
>> are
>> going to do that, myself included.
>>
>
> It's all about cache coherence. When doing operations it pulls in a bunch
> of things which are near each other in memory instead of jumping all over
> the place. The improvements it gets should be much greater than the ones
> gained from insertion ordering, although the two could be accretive.
>
>
> _______________________________________________
> bitcoin-dev mailing list
> bitcoin-dev@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev
>
>

--001a11461c7258d6f3054999e53d
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div>I can understand how Bram&#39;s transaction=
 double sha256 hashed UTXO set patricia trie allows a client to quickly val=
idate inputs because the inputs of a transaction are specified in the same =
manner.=C2=A0 So to verify that an input is unspent the client simply trave=
rses the patricia trie.<br><br></div><div>It also makes sense that if trans=
action inputs were specified by a [block height, tx index, output index] tr=
iple we&#39;d have a much more size-efficient transaction format.=C2=A0 Thi=
s format would make look up pretty simple in Peter&#39;s pruned time-ordere=
d TXO merkle mountain range, although you&#39;d have translate the triple t=
o an index, which means we&#39;d have to at a minimum keep track of the num=
ber of TXOs in each block, and then probably do a linear search starting fr=
om the location where the block&#39;s TXOs begin in the MMR.=C2=A0 (The ult=
imate option I guess is to specify transaction inputs by a single number wh=
ich is essentially the index of the TXO in a (never actually created) inser=
tion-ordered TXO array...)<br></div><div><br></div>But since transactions&#=
39; prevouts are not specified by [block height, tx index, output index] or=
 by TXO index, I don&#39;t understand how an insertion ordered TXO tree can=
 result in efficient lookups.=C2=A0 Can you help me understand this?<br><br=
></div></div><div><div><br></div></div></div><div class=3D"gmail_extra"><br=
><div class=3D"gmail_quote">On Sat, Feb 25, 2017 at 1:23 AM, Bram Cohen via=
 bitcoin-dev <span dir=3D"ltr">&lt;<a href=3D"mailto:bitcoin-dev@lists.linu=
xfoundation.org" target=3D"_blank">bitcoin-dev@lists.linuxfoundation.org</a=
>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 =
0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div=
 class=3D"gmail_extra"><div class=3D"gmail_quote"><span class=3D"">On Fri, =
Feb 24, 2017 at 8:12 PM, Peter Todd <span dir=3D"ltr">&lt;<a href=3D"mailto=
:pete@petertodd.org" target=3D"_blank">pete@petertodd.org</a>&gt;</span> wr=
ote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border=
-left:1px #ccc solid;padding-left:1ex"><span><br>
</span>So to be clear, what you&#39;re proposing there is to use the insert=
ion order as<br>
the index - once you go that far you&#39;ve almost entirely re-invented my<=
br>
proposal!<br></blockquote><div><br></div></span><div>I&#39;m not &#39;propo=
sing&#39; this, I&#39;m saying it could be done simply but I&#39;m skeptica=
l of the utility. Probably the most compelling argument for it is that the =
insertion indexed values are much smaller so they can be compacted down a l=
ot resulting in using less memory and more locality and fewer hashes, but y=
our implementation doesn&#39;t take advantage of that.</div><span class=3D"=
"><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex">Your merkle-set implement=
ation is 1500 lines of densely written Python</blockquote><div><br></div></=
span><div>The reference implementation which is included in those 1500 line=
s is less than 300 lines and fairly straightforward. The non-reference impl=
ementation always behaves semantically identically to the reference impleme=
ntation, it just does so faster and using less memory.</div><div>=C2=A0</di=
v><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:=
1px #ccc solid;padding-left:1ex"> with<br>
almost no comments,</blockquote><div><br></div><div>The comments at the top=
 explain both the proof format and the in-memory data structures very preci=
sely. The whole codebase was reviewed by a coworker of mine and comments we=
re added explaining the subtleties which tripped him up.</div><span class=
=3D""><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 =
0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> and less than a 100 =
lines of (also uncommented) tests.</blockquote><div><br></div></span><div>T=
hose tests get 98% code coverage and extensively hit not only the lines of =
code but the semantic edge cases as well. The lines which aren&#39;t hit ar=
e convenience functions and error conditions of the parsing code for when i=
t&#39;s passed bad data.</div><span class=3D""><div>=C2=A0</div><blockquote=
 class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli=
d;padding-left:1ex"> By<br>
comparison, my Python MMR implementation is 300 lines of very readable Pyth=
on<br>
with lots of comments, a 200 line explanation at the top, and 200 lines of<=
br>
(commented) tests. Yet no-one is taking the (still considerable) effort to<=
br>
understand and comment on my implementation. :)<br></blockquote><div><br></=
div></span><div>Given that maaku&#39;s Merkle prefix trees were shelved bec=
ause of performance problems despite being written in C and operating in ba=
sically the same way as your code and my reference code, it&#39;s clear tha=
t non-optimized Python won&#39;t be touching the bitcoin codebase any time =
soon.=C2=A0</div><span class=3D""><div>=C2=A0</div><blockquote class=3D"gma=
il_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-lef=
t:1ex">
<br>
Fact is, what you&#39;ve written is really daunting to review, and given it=
&#39;s not<br>
in the final language anyway, it&#39;s unclear what basis to review it on a=
nyway.</blockquote><div><br></div></span><div>It should reviewed based on s=
emantic correctness and performance. Performance can only be accurately and=
 convincingly determined by porting to C and optimizing it, which mostly in=
volves experimenting with different values for the two passed in magic numb=
ers.</div><span class=3D""><div>=C2=A0</div><blockquote class=3D"gmail_quot=
e" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">=
 I<br>
suspect you&#39;d get more feedback if the codebase was better commented, i=
n a<br>
production language, and you have actual real-world benchmarks and performa=
nce<br>
figures.<br></blockquote><div><br></div></span><div>Porting to C should be =
straightforward. Several people have already expressed interest in doing so=
, and it&#39;s written in intentionally C-ish Python, resulting in some rat=
her odd idioms which is a bit part of why you think it looks &#39;dense&#39=
;. A lot of that weird offset math should be much more readable in C becaus=
e it&#39;s all structs and x.y notation can be used instead of adding offse=
ts.</div><span class=3D""><div>=C2=A0</div><blockquote class=3D"gmail_quote=
" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I=
n particular, while at the top of merkle_set.py you have a list of advantag=
es,<br>
and a bunch of TODO&#39;s, you don&#39;t explain *why* the code has any of =
these<br>
advantages. To figure that out, I&#39;d have to read and understand 1500 li=
nes of<br>
densely written Python. Without a human-readable pitch, not many people are=
<br>
going to do that, myself included.<br></blockquote><div><br></div></span><d=
iv>It&#39;s all about cache coherence. When doing operations it pulls in a =
bunch of things which are near each other in memory instead of jumping all =
over the place. The improvements it gets should be much greater than the on=
es gained from insertion ordering, although the two could be accretive.</di=
v><div><br></div></div></div></div>
<br>______________________________<wbr>_________________<br>
bitcoin-dev mailing list<br>
<a href=3D"mailto:bitcoin-dev@lists.linuxfoundation.org">bitcoin-dev@lists.=
<wbr>linuxfoundation.org</a><br>
<a href=3D"https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev" =
rel=3D"noreferrer" target=3D"_blank">https://lists.linuxfoundation.<wbr>org=
/mailman/listinfo/bitcoin-<wbr>dev</a><br>
<br></blockquote></div><br></div>

--001a11461c7258d6f3054999e53d--