72/6ee529fdcc5a821035a8b1a0af0a2bada2e305


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349

Return-Path: <blueadept@gmail.com>
Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org
	[172.17.192.35])
	by mail.linuxfoundation.org (Postfix) with ESMTPS id B4DCFBC8
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Fri,  2 Jun 2017 17:55:34 +0000 (UTC)
X-Greylist: whitelisted by SQLgrey-1.7.6
Received: from mail-qk0-f177.google.com (mail-qk0-f177.google.com
	[209.85.220.177])
	by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 89E9F163
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Fri,  2 Jun 2017 17:55:33 +0000 (UTC)
Received: by mail-qk0-f177.google.com with SMTP id d14so65299569qkb.1
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Fri, 02 Jun 2017 10:55:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=mime-version:sender:in-reply-to:references:from:date:message-id
	:subject:to; bh=IoGFXj7mNtDlAt2lqU1fAXrz5SZCKa8JP4KRs2N7kZI=;
	b=ruNzOoGTlsrTyZBfbfsPMXbYjnR6WwUCagfHC2nMJwE8ZmDj80/GQZ9YMg+ZOVieaN
	St/iRofzaWHAre8xIQb2Z7tXAY1SR28Pb4XTS8KZdwb0vWJCFNRuJpTVDI7U5RbyJ8tw
	fAWDZw1Clf8ezBqCiNuQ6nhxoq8NBrxNaUOjydBiatByfEryZP+lLpFvt22VfJpjZLPL
	mSHVY1ICLkUl61VfsxO1tmlni7u8mq9lV0Ov3h5vRKwznHM7E0LKoAEsuQt78ukEg3u8
	slMUnrWZzXpAs+ENprus/ovEAv5U7mbTew/YAN3uHV3Lo7rpI02dzZ0FKAQBc2PmuB8a
	ZrQg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
	:date:message-id:subject:to;
	bh=IoGFXj7mNtDlAt2lqU1fAXrz5SZCKa8JP4KRs2N7kZI=;
	b=gNB7XHFCPOH9yuOkpFsdgZ+AQO2n5lqcveCFruinT6AkW/PHZNv7zxx85UfAZU6nlN
	AudXdvYualhenpHqMOm65OzJOs/4gyRS8fAIUB7TsKePgPZ54dS0XLJj8I53f5iJV2Op
	2BYTc76A8EmmAFRBN5S5eM87txS2JALCxjng9iSzxn4pk4WJP7EPLXYQ2Kdh50NRZpK3
	/iwDqs3eWzbnfE4Zztdm5c36r3eiiWWqg7dGQu0H1N87Kqjpz7Q+un/Z79VPtF3MTTKG
	1HZcEcBNvN2cijNYwxERqgRucJ5cYE7tmViL2Y4odXu0fCqiMm++FRqPpKnXSY3TA60x
	xlIQ==
X-Gm-Message-State: AODbwcBcRyar9onQyUDDDB6FNGN83PC6Vh6E2Dfn1adNH11l9TnpOKM9
	TvhyTFTw5wTD1yvEn1qZCCS3Y5DglVKi
X-Received: by 10.55.144.5 with SMTP id s5mr9118886qkd.39.1496426132270; Fri,
	02 Jun 2017 10:55:32 -0700 (PDT)
MIME-Version: 1.0
Sender: blueadept@gmail.com
Received: by 10.12.144.139 with HTTP; Fri, 2 Jun 2017 10:55:31 -0700 (PDT)
In-Reply-To: <CAE0pnxJxHYQ4+2pt3tt=1WZ0-K0vDxGB4KBXY+R=WfktMmATwA@mail.gmail.com>
References: <CAO3Pvs8ccTkgrecJG6KFbBW+9moHF-FTU+4qNfayeE3hM9uRrg@mail.gmail.com>
	<CALJw2w5gUgbdX7XnxPsK2FZ6PZ5cSTgmCEqiPu7-S4gwXBM-_Q@mail.gmail.com>
	<CAE0pnx+RRAP269VeWAcxKbrcS9qX4LS8_6nY_js8X5NtQ22t_A@mail.gmail.com>
	<CAE0pnxLKYnwHnktTqW949s1AA9uK=6WnVYWmRoau8B1SszzYEg@mail.gmail.com>
	<CAE0pnxJxHYQ4+2pt3tt=1WZ0-K0vDxGB4KBXY+R=WfktMmATwA@mail.gmail.com>
From: Alex Akselrod <alex@akselrod.org>
Date: Fri, 2 Jun 2017 13:55:31 -0400
X-Google-Sender-Auth: q5yXWqCB3djhftYiJ_m7r9SutWg
Message-ID: <CAE0pnxK5r2XfVks=emkK=v66XRN5c-Sz-Lm_dKY+6nO=kPk6Vw@mail.gmail.com>
To: Bitcoin Dev <bitcoin-dev@lists.linuxfoundation.org>
Content-Type: multipart/alternative; boundary="94eb2c08504a0bf8b00550fddc89"
X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID, FREEMAIL_FROM, HTML_MESSAGE, RCVD_IN_DNSWL_NONE,
	RCVD_IN_SORBS_SPAM autolearn=no version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	smtp1.linux-foundation.org
X-Mailman-Approved-At: Fri, 02 Jun 2017 17:57:23 +0000
Subject: Re: [bitcoin-dev] BIP Proposal: Compact Client Side Filtering for
 Light Clients
X-BeenThere: bitcoin-dev@lists.linuxfoundation.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Bitcoin Protocol Discussion <bitcoin-dev.lists.linuxfoundation.org>
List-Unsubscribe: <https://lists.linuxfoundation.org/mailman/options/bitcoin-dev>,
	<mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=unsubscribe>
List-Archive: <http://lists.linuxfoundation.org/pipermail/bitcoin-dev/>
List-Post: <mailto:bitcoin-dev@lists.linuxfoundation.org>
List-Help: <mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=help>
List-Subscribe: <https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev>,
	<mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=subscribe>
X-List-Received-Date: Fri, 02 Jun 2017 17:55:34 -0000

--94eb2c08504a0bf8b00550fddc89
Content-Type: text/plain; charset="UTF-8"

On Jun 2, 2017 8:09 AM, "Karl Johan Alm via bitcoin-dev" <
bitcoin-dev@lists.linuxfoundation.org> wrote:

Hello,

Really wish I'd known you were working on this a few weeks ago, but
such is life. Hopefully I can provide some useful feedback.


Your feedback is greatly appreciated!


On Fri, Jun 2, 2017 at 4:01 AM, Olaoluwa Osuntokun via bitcoin-dev
<bitcoin-dev@lists.linuxfoundation.org> wrote:
> Full-nodes
> maintain an additional index of the chain, and serve this compact filter
> (the index) to light clients which request them. Light clients then fetch
> these filters, query the locally and _maybe_ fetch the block if a relevant
> item matches.

Is it necessary to maintain the index all the way to the beginning of
the chain? When would clients request "really old digests" and why?


Without a soft fork, this is the only way for light clients to verify that
peers aren't lying to them. Clients can request headers (just hashes of the
filters and the previous headers, creating a chain) and look for conflicts
between peers. If a conflict is found at a certain block, the client can
download the block, generate a filter, calculate the header by hashing
together the previous header and the generated filter, and banning any
peers that don't match. A full node could prune old filters if you wanted
and recalculate them as necessary if you just keep the filter header chain
info as really old filters are unlikely to be requested by correctly
written software but you can't guarantee every client will follow best
practices either.


> The calculator can be found here:
> https://aakselrod.github.io/gcs_calc.html. Alex also has an empirical
> script he's been running on actual data, and the results seem to match up
> rather nicely.

I haven't tried the tool yet, and maybe it will answer some of my questions.

On what data were the simulated wallets on actual data based? How did
false positive rates for wallets with lots of items (pubkeys etc) play
out? Is there a maximum number of items for a wallet before it becomes
too bandwidth costly to use digests?


The simulations are based on completely random data within given
parameters. For example, it will generate a wallet of a specified size and
generate blocks of specified size with specified number of transactions of
specified format, all guaranteed to not match the wallet. It then tries to
match the wallet and tracks the filter size and the bandwidth used by block
downloads which are all due to false positives. The maximum wallet size can
be millions or more of addresses and outpoints before the filter isn't
worth it.

I published the simulation code at
https://gist.github.com/aakselrod/0ee665205f7c9538c2339876b0424b26 but the
calculation code gives you the same results (on average but very close with
a big enough sample size) much faster.

I will definitely try to reproduce my experiments with Golomb-Coded
sets and see what I come up with. It seems like you've got a little
less than half the size of my digests for 1-block digests but I
haven't tried making digests for all blocks (and lots of early blocks
are empty).


Filters for empty blocks only take a few bytes and sometimes zero when the
coinbase output is a burn that doesn't push any data (example will be in
the test vectors that I'll have ready shortly).

On the BIP proposal itself:

In Compact Filter Header Chain, you mention that clients should
download filters from nodes if filter_headers is not identical, and
ban offending nodes. What about temporary forks in the chain? What
about longer forks? In general, I am curious how you will deal with
reorgs and temporary non-consensus related chain splits.


The cfheaders messages give you the hash of the final block for which
there's a header in the message. This means you can ignore the message as
necessary rather than ban the peer, or track cfheaders for multiple forks
if desired.


I am also curious if you have considered digests containing multiple
blocks. Retaining a permanent binsearchable record of the entire chain
is obviously too space costly, but keeping the last X blocks as
binsearchable could speed up syncing for clients tremendously, I feel.


We hadn't (or I hadn't) until we read your recent post/paper and are
considering it now.


It may also be space efficient to ONLY store older digests in chunks
of e.g. 8 blocks. A client syncing up finding a match in an 8-block
chunk would have to grab those 8 blocks, but if it's not recent, that
may be acceptable. It may even be possible to make 4-, 2-, 1-block
digests on demand.


This is also something we (or at least I) hadn't considered before your
recent post. We have been working on this for a few months now so didn't
have time to work on trying out and possibly incorporating the idea before
release.

How fast are these to create? Would it make sense to provide digests
on demand in some cases, rather than keeping them around indefinitely?


They're pretty fast and can be pruned if desired, as mentioned above, as
long as the header chain is kept.

--94eb2c08504a0bf8b00550fddc89
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">On Jun 2, 2017 8:09 AM, &quot;Karl Johan Alm via bitcoin-d=
ev&quot; &lt;<a href=3D"mailto:bitcoin-dev@lists.linuxfoundation.org" targe=
t=3D"_blank">bitcoin-dev@lists.linuxfounda<wbr>tion.org</a>&gt; wrote:<br t=
ype=3D"attribution"><div class=3D"gmail_quote"><div dir=3D"auto"><span clas=
s=3D"gmail-"><div><div class=3D"gmail_extra"><div class=3D"gmail_quote"><bl=
ockquote class=3D"gmail-m_-8755837104329081286m_-5530274791290088381quote" =
style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);pa=
dding-left:1ex">Hello,<br>
<br>
Really wish I&#39;d known you were working on this a few weeks ago, but<br>
such is life. Hopefully I can provide some useful feedback.<br></blockquote=
></div></div></div><div dir=3D"auto"><br></div></span><div dir=3D"auto">You=
r feedback is greatly appreciated!</div><span class=3D"gmail-"><div dir=3D"=
auto"><br></div><div dir=3D"auto"><div class=3D"gmail_extra"><div class=3D"=
gmail_quote"><blockquote class=3D"gmail-m_-8755837104329081286m_-5530274791=
290088381quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb=
(204,204,204);padding-left:1ex">
<div class=3D"gmail-m_-8755837104329081286m_-5530274791290088381quoted-text=
"><br>
On Fri, Jun 2, 2017 at 4:01 AM, Olaoluwa Osuntokun via bitcoin-dev<br>
&lt;<a href=3D"mailto:bitcoin-dev@lists.linuxfoundation.org" target=3D"_bla=
nk">bitcoin-dev@lists.linuxfounda<wbr>tion.org</a>&gt; wrote:<br>
&gt; Full-nodes<br>
&gt; maintain an additional index of the chain, and serve this compact filt=
er<br>
&gt; (the index) to light clients which request them. Light clients then fe=
tch<br>
&gt; these filters, query the locally and _maybe_ fetch the block if a rele=
vant<br>
&gt; item matches.<br>
<br>
</div>Is it necessary to maintain the index all the way to the beginning of=
<br>
the chain? When would clients request &quot;really old digests&quot; and wh=
y?<br></blockquote></div></div></div><div dir=3D"auto"><br></div></span><di=
v dir=3D"auto">Without a soft fork, this is the only way for light clients =
to verify that peers aren&#39;t lying to them. Clients can request headers =
(just hashes of the filters and the previous headers, creating a chain) and=
 look for conflicts between peers. If a conflict is found at a certain bloc=
k, the client can download the block, generate a filter, calculate the head=
er by hashing together the previous header and the generated filter, and ba=
nning any peers that don&#39;t match. A full node could prune old filters i=
f you wanted and recalculate them as necessary if you just keep the filter =
header chain info as really old filters are unlikely to be requested by cor=
rectly written software but you can&#39;t guarantee every client will follo=
w best practices either.</div><span class=3D"gmail-"><div dir=3D"auto"><br>=
</div><div dir=3D"auto"><div class=3D"gmail_extra"><div class=3D"gmail_quot=
e"><blockquote class=3D"gmail-m_-8755837104329081286m_-5530274791290088381q=
uote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,2=
04);padding-left:1ex">
<div class=3D"gmail-m_-8755837104329081286m_-5530274791290088381quoted-text=
"><br>&gt; The calculator can be found here:<br>
&gt; <a href=3D"https://aakselrod.github.io/gcs_calc.html" rel=3D"noreferre=
r" target=3D"_blank">https://aakselrod.github.io/gc<wbr>s_calc.html</a>. Al=
ex also has an empirical<br>
&gt; script he&#39;s been running on actual data, and the results seem to m=
atch up<br>
&gt; rather nicely.<br>
<br>
</div>I haven&#39;t tried the tool yet, and maybe it will answer some of my=
 questions.<br>
<br>
On what data were the simulated wallets on actual data based? How did<br>
false positive rates for wallets with lots of items (pubkeys etc) play<br>
out? Is there a maximum number of items for a wallet before it becomes<br>
too bandwidth costly to use digests?<br></blockquote></div></div></div><div=
 dir=3D"auto"><br></div></span><div dir=3D"auto">The simulations are based =
on completely random data within given parameters. For example, it will gen=
erate a wallet of a specified size and generate blocks of specified size wi=
th specified number of transactions of specified format, all guaranteed to =
not match the wallet. It then tries to match the wallet and tracks the filt=
er size and the bandwidth used by block downloads which are all due to fals=
e positives. The maximum wallet size can be millions or more of addresses a=
nd outpoints before the filter isn&#39;t worth it.</div><div dir=3D"auto"><=
br></div><div dir=3D"auto">I published the simulation code at <a href=3D"ht=
tps://gist.github.com/aakselrod/0ee665205f7c9538c2339876b0424b26">https://g=
ist.github.com/aakselrod/0ee665205f7c9538c2339876b0424b26</a> but the calcu=
lation code gives you the same results (on average but very close with a bi=
g enough sample size) much faster.</div><span class=3D"gmail-"><div dir=3D"=
auto"><br></div><div dir=3D"auto"><div class=3D"gmail_extra"><div class=3D"=
gmail_quote"><blockquote class=3D"gmail-m_-8755837104329081286m_-5530274791=
290088381quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb=
(204,204,204);padding-left:1ex"><div class=3D"gmail-m_-8755837104329081286m=
_-5530274791290088381quoted-text">I will definitely try to reproduce my exp=
eriments with Golomb-Coded<br></div>
sets and see what I come up with. It seems like you&#39;ve got a little<br>
less than half the size of my digests for 1-block digests but I<br>
haven&#39;t tried making digests for all blocks (and lots of early blocks<b=
r>
are empty).<br></blockquote></div></div></div><div dir=3D"auto"><br></div><=
/span><div dir=3D"auto">Filters for empty blocks only take a few bytes and =
sometimes zero when the coinbase output is a burn that doesn&#39;t push any=
 data (example will be in the test vectors that I&#39;ll have ready shortly=
).</div><span class=3D"gmail-"><div dir=3D"auto"><br></div><div dir=3D"auto=
"><div class=3D"gmail_extra"><div class=3D"gmail_quote"><blockquote class=
=3D"gmail-m_-8755837104329081286m_-5530274791290088381quote" style=3D"margi=
n:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex=
">On the BIP proposal itself:<br>
<br>
In Compact Filter Header Chain, you mention that clients should<br>
download filters from nodes if filter_headers is not identical, and<br>
ban offending nodes. What about temporary forks in the chain? What<br>
about longer forks? In general, I am curious how you will deal with<br>
reorgs and temporary non-consensus related chain splits.<br></blockquote></=
div></div></div><div dir=3D"auto"><br></div></span><div dir=3D"auto">The cf=
headers messages give you the hash of the final block for which there&#39;s=
 a header in the message. This means you can ignore the message as necessar=
y rather than ban the peer, or track cfheaders for multiple forks if desire=
d.</div><span class=3D"gmail-"><div dir=3D"auto"><br></div><div dir=3D"auto=
"><div class=3D"gmail_extra"><div class=3D"gmail_quote"><blockquote class=
=3D"gmail-m_-8755837104329081286m_-5530274791290088381quote" style=3D"margi=
n:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex=
"><br>
I am also curious if you have considered digests containing multiple<br>
blocks. Retaining a permanent binsearchable record of the entire chain<br>
is obviously too space costly, but keeping the last X blocks as<br>
binsearchable could speed up syncing for clients tremendously, I feel.<br><=
/blockquote></div></div></div><div dir=3D"auto"><br></div></span><div dir=
=3D"auto">We hadn&#39;t (or I hadn&#39;t) until we read your recent post/pa=
per and are considering it now.</div><span class=3D"gmail-"><div dir=3D"aut=
o"><br></div><div dir=3D"auto"><div class=3D"gmail_extra"><div class=3D"gma=
il_quote"><blockquote class=3D"gmail-m_-8755837104329081286m_-5530274791290=
088381quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(20=
4,204,204);padding-left:1ex">
<br>
It may also be space efficient to ONLY store older digests in chunks<br>
of e.g. 8 blocks. A client syncing up finding a match in an 8-block<br>
chunk would have to grab those 8 blocks, but if it&#39;s not recent, that<b=
r>
may be acceptable. It may even be possible to make 4-, 2-, 1-block<br>
digests on demand.<br></blockquote></div></div></div><div dir=3D"auto"><br>=
</div></span><div dir=3D"auto">This is also something we (or at least I) ha=
dn&#39;t considered before your recent post. We have been working on this f=
or a few months now so didn&#39;t have time to work on trying out and possi=
bly incorporating the idea before release.</div><div dir=3D"auto"><br></div=
><div dir=3D"auto"><span class=3D"gmail-"><div class=3D"gmail_extra"><div c=
lass=3D"gmail_quote"><blockquote class=3D"gmail-m_-8755837104329081286m_-55=
30274791290088381quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px s=
olid rgb(204,204,204);padding-left:1ex">How fast are these to create? Would=
 it make sense to provide digests<br>
on demand in some cases, rather than keeping them around indefinitely?</blo=
ckquote></div><br></div></span><div class=3D"gmail_extra" dir=3D"auto">They=
&#39;re pretty fast and can be pruned if desired, as mentioned above, as lo=
ng as the header chain is kept.</div></div></div>
</div><br></div>

--94eb2c08504a0bf8b00550fddc89--