a3/1d5cb2686a6db8e7fe234e5f09d567d476a517


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331

Return-Path: <pete@petertodd.org>
Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org
	[172.17.192.35])
	by mail.linuxfoundation.org (Postfix) with ESMTPS id 59327483
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Sat, 18 Jun 2016 23:01:52 +0000 (UTC)
X-Greylist: from auto-whitelisted by SQLgrey-1.7.6
Received: from outmail149077.authsmtp.com (outmail149077.authsmtp.com
	[62.13.149.77])
	by smtp1.linuxfoundation.org (Postfix) with ESMTP id E91F18E
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Sat, 18 Jun 2016 23:01:50 +0000 (UTC)
Received: from mail-c247.authsmtp.com (mail-c247.authsmtp.com [62.13.128.247])
	by punt24.authsmtp.com (8.14.2/8.14.2/) with ESMTP id u5IN1nYd053524;
	Sun, 19 Jun 2016 00:01:49 +0100 (BST)
Received: from petertodd.org (ec2-52-5-185-120.compute-1.amazonaws.com
	[52.5.185.120]) (authenticated bits=0)
	by mail.authsmtp.com (8.14.2/8.14.2/) with ESMTP id u5IN1kRb058747
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 19 Jun 2016 00:01:46 +0100 (BST)
Received: from [127.0.0.1] (localhost [127.0.0.1])
	by petertodd.org (Postfix) with ESMTPSA id 9B7904011C;
	Sat, 18 Jun 2016 22:59:46 +0000 (UTC)
Received: by localhost (Postfix, from userid 1000)
	id 469F720278; Sat, 18 Jun 2016 19:01:43 -0400 (EDT)
Date: Sat, 18 Jun 2016 19:01:43 -0400
From: Peter Todd <pete@petertodd.org>
To: Bram Cohen <bram@bittorrent.com>
Message-ID: <20160618230143.GA25017@fedora-21-dvm>
References: <CA+KqGkosGrWQ2Lpg3Ptc+UyidK7H4_HLQ_QWx2DOAFRmLkv4zQ@mail.gmail.com>
	<20160616001040.GA5026@fedora-21-dvm>
	<CA+KqGkqAHcU2PzEX6OmorXRBQ22eF_QBYYqUDS1v_ZvevhLCuQ@mail.gmail.com>
	<20160616032612.GA7792@fedora-21-dvm>
	<CA+KqGkqA2WnvBEck3kv6p2no-9wzNVCTNA-MGw=Jg=gMGfrxUQ@mail.gmail.com>
	<20160617043435.GA12800@fedora-21-dvm>
	<CA+KqGkpRmeKyo6TFpe+uUCdJSina+ARraNd0dkHSb2Hpx5dYuw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature"; boundary="pf9I7BMVVzbSWLtt"
Content-Disposition: inline
In-Reply-To: <CA+KqGkpRmeKyo6TFpe+uUCdJSina+ARraNd0dkHSb2Hpx5dYuw@mail.gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Server-Quench: a229cf86-35a8-11e6-bcde-0015176ca198
X-AuthReport-Spam: If SPAM / abuse - report it at:
	http://www.authsmtp.com/abuse
X-AuthRoute: OCd2Yg0TA1ZNQRgX IjsJECJaVQIpKltL GxAVKBZePFsRUQkR
	aAdMdwsUEkAaAgsB AmAbW1ReUFx7XWQ7 bghPaBtcak9QXgdq
	T0pMXVMcUQARfx1a ZEweVxF1cwIIen13 Y0YsD3JZVRcsfUBg
	QUsFHXAHZDJmdTJM BBVFdwNVdQJNeEwU a1l3GhFYa3VsNCMk
	FAgyOXU9MCtqYAFY WAIJIBoOW0sGBXY1 QRxKGDIyG1EMRiN7 NRUgJVMHdAAA
X-Authentic-SMTP: 61633532353630.1038:706
X-AuthFastPath: 0 (Was 255)
X-AuthSMTP-Origin: 52.5.185.120/25
X-AuthVirus-Status: No virus detected - but ensure you scan with your own
	anti-virus system.
X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_LOW
	autolearn=ham version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	smtp1.linux-foundation.org
Cc: Bitcoin Protocol Discussion <bitcoin-dev@lists.linuxfoundation.org>
Subject: Re: [bitcoin-dev] Merkle trees and mountain ranges
X-BeenThere: bitcoin-dev@lists.linuxfoundation.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Bitcoin Protocol Discussion <bitcoin-dev.lists.linuxfoundation.org>
List-Unsubscribe: <https://lists.linuxfoundation.org/mailman/options/bitcoin-dev>,
	<mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=unsubscribe>
List-Archive: <http://lists.linuxfoundation.org/pipermail/bitcoin-dev/>
List-Post: <mailto:bitcoin-dev@lists.linuxfoundation.org>
List-Help: <mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=help>
List-Subscribe: <https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev>,
	<mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=subscribe>
X-List-Received-Date: Sat, 18 Jun 2016 23:01:52 -0000


--pf9I7BMVVzbSWLtt
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Jun 17, 2016 at 07:43:47PM -0700, Bram Cohen wrote:
> On Thu, Jun 16, 2016 at 9:34 PM, Peter Todd <pete@petertodd.org> wrote:
>=20
> > So above you said that in merbinner trees each node "hash[es] in a reco=
rd
> > of
> > its depth" That's actually incorrect: each node commits to the prefix t=
hat
> > all
> > keys below that level start with, not just the depth.
>=20
>=20
> I considered a similar trick at the implementation rather than the
> definition level: A node doesn't have to store the prefix which is implic=
it
> in its position. That would create a fair number of headaches though,
> because I'm using fixed size stuff in important ways, and it could at most
> save about 10% of memory, so it goes into the 'maybe later' bucket.

Wait, are you saying you think committing to the prefix is a "trick"? It's =
just
a very simple - and possibly not-optimal - way of committing to what data
should be accessible under a given node. An alternative would have been ens=
ure
that in terms of _cryptographic_ tree position.

By "position", are you talking about position within RAM? That may or may n=
ot
be a viable optimization, but it's quite separate from the question of the
cryptographic structure of the data.

> > This means that in merbinner trees, cases where multiple keys share par=
ts
> > of
> > the same prefix are handled efficiently, without introducing extra leve=
ls
> > unnecessarily; there's no need for the ONLY0/1 nodes as the children of=
 an
> > inner node will always be on different sides.
> >
> > When keys are randomly distributed, this isn't a big deal; OTOH against
> > attackers who are choosing keys, e.g. by grinding hashes, merbinner tre=
es
> > always have maximum depths in proportion to log2(n) of the actual numbe=
r of
> > items in the tree. Grinding is particularly annoying to deal with due to
> > the
> > birthday attack: creating a ground prefix 64 bits long only takes 32 bi=
ts
> > worth
> > of work.
> >
>=20
> Yes an attacker can force the tree to be deeper in places, but it's
> mitigated in several ways: (1) The way I'm using memory it won't cause a
> whole new block to be allocated, it will just force log(attack strength) -
> log(n) nodes to be used (2) logarithmic growth being what it is that isn't
> such a big amount (3) With the special casing of TERMBOTH an attacker nee=
ds
> three things with the same prefix to pull off an attack rather than two,
> which is quite a bit harder to pull off.


> That said, it wouldn't be all that hard to change how the hashing function
> works to do a single hash for a whole series of ONLY in a row instead of a
> new one at every level, which would make the attacker only able to force
> extra memory usage instead of extra CPU, but this is a slightly annoying
> thing to write to stop a fairly lame attack, so I'm at least not doing it
> for my initial implementation. I could likely be convinced that it's worth
> doing before an actual release though. There's another implementation tri=
ck
> to do the same thing for memory usage, which is much more in the 'do late=
r'
> category because it doesn't involve changing the format and hence it can =
be
> put off.
>=20
>=20
> > In particular, case #2 handles your leaf node optimizations generically,
> > without special cases and additional complexity. It'd also be a better =
way
> > to
> > do the ONLY0/1 cases, as if the "nothing on this side" symbol is a sing=
le
> > byte,
> > each additional colliding level would simply extend the commitment with=
out
> > hashing. In short, you'd have nearly the same level of optimization even
> > if at
> > the cryptography level your tree consists of only leaves, inner nodes, =
and
> > nil.
> >
>=20
> I'm taking pains to make all the hashing be of fixed-size things, so that=
 a
> non-padding variant of a secure hashing algorithm can be used. The chains
> of ONLY thing above would force a special exception to that, which can be
> done but is annoying. Making things smaller than a single block (64 bytes)
> won't speed up hashing time, and making things a single byte longer than
> that doubles it.

Have you seen how BLAKE2 omits padding when the data to be hashed happens t=
o be
exactly one block in size? It's significantly faster than SHA256, and that'=
s a
standard part of the algorithm already.

> > > > Technically even a patricia trie utxo commitment can have sub-1 cac=
he
> > > > > misses per update if some of the updates in a single block are cl=
ose
> > to
> > > > > each other in memory. I think I can get practical Bitcoin updates
> > down
> > > > to a
> > > > > little bit less than one l2 cache miss per update, but not a lot
> > less.
> > > >
> > > > I'm very confused as to why you think that's possible. When you say
> > > > "practical
> > > > Bitcoin updates", what exactly is the data structure you're proposi=
ng
> > to
> > > > update? How is it indexed?
> >
>=20
> I'll re-answer this because I did a terrible job before. The entire data
> structure consists of nodes which contain a metadata byte (TERM0, ONLY1,
> etc.) followed by fixes size secure hashes, and (in some cases) pointers =
to
> where the children are. The secure hashes in parent nodes are found by
> hashing their children verbatim (or the stored string in the case of a
> TERM). This is very conventional. All of the cleverness is in where in
> memory these nodes are stored so that tracing down the tree causes very f=
ew
> cache misses.
>=20
> (The alternate approach is to have each node store its own hash rather th=
an
> that be stored by the parent. That approach means that when you're
> recalculating you have to look up siblings which doubles the number of
> cache misses. Not such a hot idea.)

Have you benchmarked the cost of a hash operation vs. the cost of a cache m=
iss?
What are the actual numbers?

> At the root there's a branch block. It consists of all nodes up to some
> fixed depth - let's say 12 - with that depth set so that it roughly fits
> within a single memory page. Branch blocks are arranged with the nodes in
> fixed position defined by the prefix they correspond to, and the terminals
> have outpointers to other blocks. Because they're all clustered together,=
 a
> lookup or update will only require a single

A single....?

> Below the root block are other branch blocks. Each of them has a fixed 12
> bit prefix it is responsible for. When doing a lookup a second cache miss
> will be hit for levels 13-24, because those are all clustered in the same
> branch block.

So, is this also how the data structure looks cryptographically, or is the =
way
it's hashed separate from the above description?

> Below the second level of root block (at Bitcoin utxo set scale - this
> varies based on how much is stored) there are leaf blocks. A leaf block
> consists of nodes with outpointers to its own children which must be with=
in
> the same leaf block. All entry points into a leaf block are from the same
> branch block, and the leaf block has no out pointers to other blocks. When
> a leaf block overflows the entry point into it which overflowed is moved
> into the active leaf for that branch, and if that's full a new one is
> allocated. There's some subtlety to exactly how this is done, but I've
> gotten everything down to simple expedient tricks with straightforward
> implementations. The thing which matters for now is that there's only a
> single cache miss for each leaf node, because they also fit in a page.

Page as in 4096 bytes? But L1/L2/L3 cache is arranged in terms of 64 byte c=
ache
lines - where do pages come in here?

At Bitcoin UTXO set scale, how large do you think these data structures are?

> So at Bitcoin scale there will probably only be 3 cache misses for a
> lookup, and that's a worst case scenario. The first one is probably always
> warm, bringing it down to 2, and if you do a bunch in sorted order they'll
> probably hit the same second level branches repeatedly bringing it down to
> 1, and might even average less than that if there are enough that the leaf
> block has multiple things being accessed.

"Sorted order" - what exact type of sorting do you mean here?

> > Anyway hashing is pretty slow. The very fast BLAKE2 is about 3 cycles/b=
yte
> > (SHA256 is about 15 cycles/byte) so hashing that same data would take
> > around
> > 200 cycles, and probably quite a bit more in practice due to overheads
> > from our
> > short message lengths; fetching a cache line from DRAM only takes about
> > 1,000
> > cycles. I'd guess that once other overheads are taken into account, even
> > if you
> > could eliminate L2/L3 cache-misses it wouldn't be much of an improvemen=
t.
> >
>=20
> Those numbers actually back up my claims about performance. If you're doi=
ng
> a single update and recalculating the root afterwards, then the amount of
> rehashing to be done is about 30 levels deep times 64 bytes per thing
> hashed times 15 cycles per byte then it's about 28,800 cycles of hashing.
> If you have a naive radix tree implementation which hits a cache miss at
> every level then that's 30,000 cycles, which is about half the performance
> time, certainly worth optimizing. If instead of sha256 you use blake2
> (Which sounds like a very good idea!) then hashing for an update will be
> about 5760 cycles and performance will be completely dominated by cache
> misses. If a more cache coherent implementation is used, then the cost of
> cache misses will be 3000 cycles, which will be a non-factor with sha256
> and a significant but not dominating one with blake2.

But that's assuming the dataset in question fits in cache; I don't see how =
it
does. Since it doesn't, I'm argung the total % improvement by _any_ cache
optimization on the subset that does fit in cache will be relatively small.

Again, how large a dataset do you think you're working with here?

--=20
https://petertodd.org 'peter'[:-1]@petertodd.org

--pf9I7BMVVzbSWLtt
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature

-----BEGIN PGP SIGNATURE-----

iQEcBAEBCAAGBQJXZdLVAAoJEGOZARBE6K+yqBEH/Rxh0oG5TAfqXNFc2HAhtdvC
6jdjefWX4Xgu5E6GhEEHzybPMMoLwfrM1lCkast02ezmW8PyBLnYe1feROX6jk8H
AGH2xA4m3AH5xOpo+E5ZJWKxKkntnD3zxE/V8LoWyavaup7Qq8BOUJ4JjvbDgUH9
yAOfCBDCFRIQXZEeIRlJovMo8kKnMKTVXDfkafiY0HfgSVmjLJUI5m6SD8QJaD12
EW0dCA/RvL05xcxJbKkKNJ9tfBfalEVn7Zyw5bXQCTadW/VGqsBmJthtISeL0cmp
VpjeERJzOLZhI5kEp9qxdYzu2oMwCSL7UdN9BJcG4ChOgIhHQRSIEdT8J1ZaTqI=
=x1wS
-----END PGP SIGNATURE-----

--pf9I7BMVVzbSWLtt--