fe/c167cc29acd342e4f982e2138107f573fc6e1f


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255

Return-Path: <bram@bittorrent.com>
Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org
	[172.17.192.35])
	by mail.linuxfoundation.org (Postfix) with ESMTPS id 71FF7996
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Fri, 15 Jul 2016 23:01:18 +0000 (UTC)
X-Greylist: whitelisted by SQLgrey-1.7.6
Received: from mail-io0-f170.google.com (mail-io0-f170.google.com
	[209.85.223.170])
	by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 8FAE019F
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Fri, 15 Jul 2016 23:01:17 +0000 (UTC)
Received: by mail-io0-f170.google.com with SMTP id 38so116992130iol.0
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Fri, 15 Jul 2016 16:01:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=bittorrent-com.20150623.gappssmtp.com; s=20150623;
	h=mime-version:in-reply-to:references:from:date:message-id:subject:to
	:cc; bh=uroqDx55kj5aJ95yNiEOvqYMgJju68KZw76fWNWL9oU=;
	b=iNFyUpt9dcmzBlzsZrDKbhx/U1g3kzI/jmIMujkRLINZo4W9eXCEx29G0hGM1j2y+f
	QU5lQmjNuY4UfP1K828BImmH3iNttLwA2UkuizbKWZIWQFeAuEcJb9RrhDlR8/vaXMCS
	6E4fJgdmxieHMoiL3fP+7uhRx5knOmNFOaoX7e+I2krBR3uGsBldA58HOhVc+ApFpx8T
	FeBo3Az3mZUpvtvbHGjwAM4TTpEqgTDqWHXdeDZDICGEkNc90yKl6jYszgjrR+33DUg2
	hyPLxykSVikmSWroY2uvcu+2UUwTqIlzS2BI5MoC5uyKW1pRIKAsJWcw+jVph8JtVJKV
	5HoQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20130820;
	h=x-gm-message-state:mime-version:in-reply-to:references:from:date
	:message-id:subject:to:cc;
	bh=uroqDx55kj5aJ95yNiEOvqYMgJju68KZw76fWNWL9oU=;
	b=XkYhjpQRLXrFH7v2yX5SZ/HMeLhGz0mIYTiQa538B8V4Shm26MU+DPNK//0yjFxIvc
	rIew/t/YZcuXkiTuwdqhyPTFJnj0n8Uz8kpSs4/ILhIzEafUX+5telBylHxFIKjhjed9
	uIFU0XXa5F4DK7MAFPIDsBOSu8g96kzHviCBttnLbuZt4yyN+EhcBWc4O94n67S0sCNY
	hd7NYzixV3UjNXhRiL3BME0DBatHKR9ysLe0P1gCIutVF+aNjJ9EBC8H/GjQlesMHXRu
	mgabhKTClB4whxW75Cpx/3rhMaqE5owLCbBeIZY7C8Sm0vHNCHkV0H8hm2eW2ohNbdAh
	OvRQ==
X-Gm-Message-State: ALyK8tK5OXt1rjPquRcDLLMZl/4UEOCKE1EdtG0JhH1x33K4CevMsIhe2XNHiTytMT7m0PieCG+rl5ja4tBFCgUA
X-Received: by 10.107.150.83 with SMTP id y80mr24816104iod.113.1468623676912; 
	Fri, 15 Jul 2016 16:01:16 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.36.31.17 with HTTP; Fri, 15 Jul 2016 16:00:57 -0700 (PDT)
In-Reply-To: <20160618230143.GA25017@fedora-21-dvm>
References: <CA+KqGkosGrWQ2Lpg3Ptc+UyidK7H4_HLQ_QWx2DOAFRmLkv4zQ@mail.gmail.com>
	<20160616001040.GA5026@fedora-21-dvm>
	<CA+KqGkqAHcU2PzEX6OmorXRBQ22eF_QBYYqUDS1v_ZvevhLCuQ@mail.gmail.com>
	<20160616032612.GA7792@fedora-21-dvm>
	<CA+KqGkqA2WnvBEck3kv6p2no-9wzNVCTNA-MGw=Jg=gMGfrxUQ@mail.gmail.com>
	<20160617043435.GA12800@fedora-21-dvm>
	<CA+KqGkpRmeKyo6TFpe+uUCdJSina+ARraNd0dkHSb2Hpx5dYuw@mail.gmail.com>
	<20160618230143.GA25017@fedora-21-dvm>
From: Bram Cohen <bram@bittorrent.com>
Date: Fri, 15 Jul 2016 16:00:57 -0700
Message-ID: <CA+KqGkoxRdCas2hWdbD92Z-HjnS9PTPVQu=oSwUT261S+4y4cw@mail.gmail.com>
To: Peter Todd <pete@petertodd.org>
Content-Type: multipart/alternative; boundary=001a11405fee9247f30537b49804
X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,HTML_MESSAGE,RCVD_IN_DNSWL_LOW autolearn=ham version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	smtp1.linux-foundation.org
Cc: Bitcoin Protocol Discussion <bitcoin-dev@lists.linuxfoundation.org>
Subject: Re: [bitcoin-dev] Merkle trees and mountain ranges
X-BeenThere: bitcoin-dev@lists.linuxfoundation.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Bitcoin Protocol Discussion <bitcoin-dev.lists.linuxfoundation.org>
List-Unsubscribe: <https://lists.linuxfoundation.org/mailman/options/bitcoin-dev>,
	<mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=unsubscribe>
List-Archive: <http://lists.linuxfoundation.org/pipermail/bitcoin-dev/>
List-Post: <mailto:bitcoin-dev@lists.linuxfoundation.org>
List-Help: <mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=help>
List-Subscribe: <https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev>,
	<mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Jul 2016 23:01:18 -0000

--001a11405fee9247f30537b49804
Content-Type: text/plain; charset=UTF-8

On Sat, Jun 18, 2016 at 4:01 PM, Peter Todd <pete@petertodd.org> wrote:

>
>
Have you seen how BLAKE2 omits padding when the data to be hashed happens
> to be
> exactly one block in size? It's significantly faster than SHA256, and
> that's a
> standard part of the algorithm already.
>

That's very convenient! I didn't know it, but had 'look up how blake2 does
padding' in my list of stuff to do. I'm leaning heavily towards using
blake2b at this point, at least for internal hashing.


>
> > At the root there's a branch block. It consists of all nodes up to some
> > fixed depth - let's say 12 - with that depth set so that it roughly fits
> > within a single memory page. Branch blocks are arranged with the nodes in
> > fixed position defined by the prefix they correspond to, and the
> terminals
> > have outpointers to other blocks. Because they're all clustered
> together, a
> > lookup or update will only require a single
>
> A single....?
>

Okay, I've figured out the root cause of general confusion here. It's
mostly my fault.

There are a few different media on which data can be stored, with different
properties in terms of how long it takes to retrieve data from them, and
how much of a readahead they typically have. I was misreading the l2 cache
size as the main memory readahead amount, which is... probably wrong? The
readahead properties of memory aren't well documented and apparently vary a
lot. On SSDs it typically pulls down a kilobyte at once and they call them
pages, hence my use of that term above. But since my real point is that my
implementation should act as a silver bullet which can have acceptable
performance even on extremely bad devices, I'll give an analysis of how
well it works when everything is stored on a regular spinning hard drive.

Let's say you're storing 100 million items, which will fit within 10
gigabytes. If you set the block depths to about 10 bits they'll be about
32K, and if you set the size of leaf blocks to be about the same then
memory efficiency will be good because the leaf blocks will store twigs of
about 2^7 in size while having 2^10 space so they'll fit reasonably. Almost
everything will be three blocks from root, so updates will generally
require two disk seeks (plus one more for a write but those are generally
faster because they get batched.)

For latency numbers, I'm going off these:
https://gist.github.com/jboner/2841832

If the blockchain is very full of simple transactions and a disk seek takes
15 milliseconds, then going with the assumption that a block is about 600
seconds and the blockchain can handle 4 transactions per second and each of
them is 3 updates (one utxo spent plus two new ones created) that's 15 *
600 * 4 * 3 * 2 milliseconds per block, or about 200 seconds per block, so
if the uxto roots trail by a few blocks even a ludicrously underpowered
node could keep up.

On an SSD keeping up is completely trivial, the problem becomes one of how
quickly you can validate an entire blockchain history. There a read takes
about 0.15 milliseconds and you have to do 5 of them instead of 2 because
the natural memory block size is 4k, so it's about 1 millisecond per
update, or 600 * 4 * 3 total time for each block, which is about 7 seconds.
That's large enough that making the utxo root trail by two blocks is still
a good idea, but small enough that it isn't a big factor in the cost of
running a node. It's enough that validating a complete block history might
take a while though, and even validating just the last year would take a
few days. This is very conservative and it's assuming that *everything* is
kept on an SSD though. If the numbers work better and a few layers are kept
in regular memory validating a whole year of history might only take a few
hours.

Hopefully that all makes a fairly good case that raw merkle tree utxo root
trailing by a few blocks is a viable strategy. The data structures in the
MMR proposal are fairly complicated and the analysis of them talks in
somewhat vague terms about things being fast and slow. A similar analysis
of the MMR proposal specifying storage media and expectations of latency
numbers would clarify the reasoning a lot.

(By the way, sorry for the slow response - I got preempted by a bunch of
other work duties.)

--001a11405fee9247f30537b49804
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On S=
at, Jun 18, 2016 at 4:01 PM, Peter Todd <span dir=3D"ltr">&lt;<a href=3D"ma=
ilto:pete@petertodd.org" target=3D"_blank">pete@petertodd.org</a>&gt;</span=
> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0=
.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(20=
4,204,204);padding-left:1ex">=C2=A0<br></blockquote><blockquote class=3D"gm=
ail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:1px;border-l=
eft-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">
<div><div><span style=3D"color:rgb(34,34,34)">Have you seen how BLAKE2 omit=
s padding when the data to be hashed happens to be</span><br></div></div>
exactly one block in size? It&#39;s significantly faster than SHA256, and t=
hat&#39;s a<br>
standard part of the algorithm already.<br></blockquote><div><br></div><div=
>That&#39;s very convenient! I didn&#39;t know it, but had &#39;look up how=
 blake2 does padding&#39; in my list of stuff to do. I&#39;m leaning heavil=
y towards using blake2b at this point, at least for internal hashing.</div>=
<div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px =
0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:r=
gb(204,204,204);padding-left:1ex">
<div><br></div></blockquote><blockquote class=3D"gmail_quote" style=3D"marg=
in:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-l=
eft-color:rgb(204,204,204);padding-left:1ex"><span>
&gt; At the root there&#39;s a branch block. It consists of all nodes up to=
 some<br>
&gt; fixed depth - let&#39;s say 12 - with that depth set so that it roughl=
y fits<br>
&gt; within a single memory page. Branch blocks are arranged with the nodes=
 in<br>
&gt; fixed position defined by the prefix they correspond to, and the termi=
nals<br>
&gt; have outpointers to other blocks. Because they&#39;re all clustered to=
gether, a<br>
&gt; lookup or update will only require a single<br>
<br>
</span>A single....?<br></blockquote><div><br></div><div>Okay, I&#39;ve fig=
ured out the root cause of general confusion here. It&#39;s mostly my fault=
.</div><div><br></div><div>There are a few different media on which data ca=
n be stored, with different properties in terms of how long it takes to ret=
rieve data from them, and how much of a readahead they typically have. I wa=
s misreading the l2 cache size as the main memory readahead amount, which i=
s... probably wrong? The readahead properties of memory aren&#39;t well doc=
umented and apparently vary a lot. On SSDs it typically pulls down a kiloby=
te at once and they call them pages, hence my use of that term above. But s=
ince my real point is that my implementation should act as a silver bullet =
which can have acceptable performance even on extremely bad devices, I&#39;=
ll give an analysis of how well it works when everything is stored on a reg=
ular spinning hard drive.</div><div><br></div><div>Let&#39;s say you&#39;re=
 storing 100 million items, which will fit within 10 gigabytes. If you set =
the block depths to about 10 bits they&#39;ll be about 32K, and if you set =
the size of leaf blocks to be about the same then memory efficiency will be=
 good because the leaf blocks will store twigs of about 2^7 in size while h=
aving 2^10 space so they&#39;ll fit reasonably. Almost everything will be t=
hree blocks from root, so updates will generally require two disk seeks (pl=
us one more for a write but those are generally faster because they get bat=
ched.)</div><div><br></div><div>For latency numbers, I&#39;m going off thes=
e:=C2=A0<a href=3D"https://gist.github.com/jboner/2841832">https://gist.git=
hub.com/jboner/2841832</a></div><div><br></div><div>If the blockchain is ve=
ry full of simple transactions and a disk seek takes 15 milliseconds, then =
going with the assumption that a block is about 600 seconds and the blockch=
ain can handle 4 transactions per second and each of them is 3 updates (one=
 utxo spent plus two new ones created) that&#39;s 15 * 600 * 4 * 3 * 2 mill=
iseconds per block, or about 200 seconds per block, so if the uxto roots tr=
ail by a few blocks even a ludicrously underpowered node could keep up.</di=
v><div><br></div><div>On an SSD keeping up is completely trivial, the probl=
em becomes one of how quickly you can validate an entire blockchain history=
. There a read takes about 0.15 milliseconds and you have to do 5 of them i=
nstead of 2 because the natural memory block size is 4k, so it&#39;s about =
1 millisecond per update, or 600 * 4 * 3 total time for each block, which i=
s about 7 seconds. That&#39;s large enough that making the utxo root trail =
by two blocks is still a good idea, but small enough that it isn&#39;t a bi=
g factor in the cost of running a node. It&#39;s enough that validating a c=
omplete block history might take a while though, and even validating just t=
he last year would take a few days. This is very conservative and it&#39;s =
assuming that *everything* is kept on an SSD though. If the numbers work be=
tter and a few layers are kept in regular memory validating a whole year of=
 history might only take a few hours.</div><div><br></div><div>Hopefully th=
at all makes a fairly good case that raw merkle tree utxo root trailing by =
a few blocks is a viable strategy. The data structures in the MMR proposal =
are fairly complicated and the analysis of them talks in somewhat vague ter=
ms about things being fast and slow. A similar analysis of the MMR proposal=
 specifying storage media and expectations of latency numbers would clarify=
 the reasoning a lot.</div><div><br></div><div>(By the way, sorry for the s=
low response - I got preempted by a bunch of other work duties.)</div></div=
></div></div>

--001a11405fee9247f30537b49804--