MIME-Version: 1.0
References: <CAO3Pvs8ccTkgrecJG6KFbBW+9moHF-FTU+4qNfayeE3hM9uRrg@mail.gmail.com>
	<CALJw2w5gUgbdX7XnxPsK2FZ6PZ5cSTgmCEqiPu7-S4gwXBM-_Q@mail.gmail.com>
	<CAE0pnx+RRAP269VeWAcxKbrcS9qX4LS8_6nY_js8X5NtQ22t_A@mail.gmail.com>
	<CAE0pnxLKYnwHnktTqW949s1AA9uK=6WnVYWmRoau8B1SszzYEg@mail.gmail.com>
	<CAE0pnxJxHYQ4+2pt3tt=1WZ0-K0vDxGB4KBXY+R=WfktMmATwA@mail.gmail.com>
	<CAE0pnxK5r2XfVks=emkK=v66XRN5c-Sz-Lm_dKY+6nO=kPk6Vw@mail.gmail.com>
	<CALJw2w6Vzq8PO3x607=ERK4XKU2vrHApqKP2rWm-sw2r1ZOJMw@mail.gmail.com>
In-Reply-To: <CALJw2w6Vzq8PO3x607=ERK4XKU2vrHApqKP2rWm-sw2r1ZOJMw@mail.gmail.com>
From: Olaoluwa Osuntokun <laolu32@gmail.com>
Date: Fri, 09 Jun 2017 03:03:51 +0000
Message-ID: <CAO3Pvs-0h=E0ZQmOHcNE9Q+XgJJb7761jz9QxgginMb6+n4ogw@mail.gmail.com>
To: Karl Johan Alm <karljohan-alm@garage.co.jp>,
	Alex Akselrod <alex@akselrod.org>
Content-Type: multipart/alternative; boundary="94eb2c002330b9368f05517e38ac"
Cc: Bitcoin Dev <bitcoin-dev@lists.linuxfoundation.org>
Subject: Re: [bitcoin-dev] BIP Proposal: Compact Client Side Filtering for
 Light Clients
Precedence: list

--94eb2c002330b9368f05517e38ac
Content-Type: text/plain; charset="UTF-8"

Karl wrote:

> I am also curious if you have considered digests containing multiple
> blocks. Retaining a permanent binsearchable record of the entire chain is
> obviously too space costly, but keeping the last X blocks as binsearchable
> could speed up syncing for clients tremendously, I feel.

Originally we hadn't considered such an idea. Grasping the concept a bit
better, I can see how that may result in considerable bandwidth savings
(for purely negative queries) for clients doing a historical sync, or
catching up to the chain after being inactive for months/weeks.

If we were to purse tacking this approach onto the current BIP proposal,
we could do it in the following way:

   * The `getcfilter` message gains an additional "Level" field. Using
     this field, the range of blocks to be included in the returned filter
     would be Level^2. So a level of 0 is just the single filter, 3 is 8
     blocks past the block hash etc.

   * Similarly, the `getcfheaders` message would also gain a similar field
     with identical semantics. In this case each "level" would have a
     distinct header chain for clients to verify.

> How fast are these to create? Would it make sense to provide digests on
> demand in some cases, rather than keeping them around indefinitely?

For larger blocks (like the one referenced at the end of this mail) full
construction of the regular filter takes ~10-20ms (most of this spent
extracting the data pushes). With smaller blocks, it quickly dips down to
the nano to micro second range.

Whether to keep _all_ the filters on disk, or to dynamically re-generate a
particular range (possibly most of the historical data) is an
implementation detail. Nodes that already do block pruning could discard
very old filters once the header chain is constructed allowing them to
save additional space, as it's unlikely most clients would care about the
first 300k or so blocks.

> Ahh, so you actually make a separate digest chain with prev hashes and
> everything. Once/if committed digests are soft forked in, it seems a bit
> overkill but maybe it's worth it.

Yep, this is only a hold-over until when/if a commitment to the filter is
soft-forked in. In that case, there could be some extension message to
fetch the filter hash for a particular block, along with a merkle proof of
the coinbase transaction to the merkle root in the header.

> I created digests for all blocks up until block #469805 and actually ended
> up with 5.8 GB, which is 1.1 GB lower than what you have, but may be worse
> perf-wise on false positive rates and such.

Interesting, are you creating the equivalent of both our "regular" and
"extended" filters? Each of the filter types consume about ~3.5GB in
isolation, with the extended filter type on average consuming more bytes
due to the fact that it includes sigScript/witness data as well.

It's worth noting that those numbers includes the fixed 4-byte value for
"N" that's prepended to each filter once it's serialized (though that
doesn't add a considerable amount of overhead).  Alex and I were
considering instead using Bitcoin's var-int encoding for that number
instead. This would result in using a single byte for empty filters, 1
byte for most filters (< 2^16 items), and 3 bytes for the remainder of the
cases.

> For comparison, creating the digests above (469805 of them) took
> roughly 30 mins on my end, but using the kstats format so probably
> higher on an actual node (should get around to profiling that...).

Does that include the time required to read the blocks from disk? Or just
the CPU computation of constructing the filters? I haven't yet kicked off
a full re-index of the filters, but for reference this block[1] on testnet
takes ~18ms for the _full_ indexing routine with our current code+spec.

[1]: 000000000000052184fbe86eff349e31703e4f109b52c7e6fa105cd1588ab6aa

-- Laolu


On Sun, Jun 4, 2017 at 7:18 PM Karl Johan Alm via bitcoin-dev <
bitcoin-dev@lists.linuxfoundation.org> wrote:

> On Sat, Jun 3, 2017 at 2:55 AM, Alex Akselrod via bitcoin-dev
> <bitcoin-dev@lists.linuxfoundation.org> wrote:
> > Without a soft fork, this is the only way for light clients to verify
> that
> > peers aren't lying to them. Clients can request headers (just hashes of
> the
> > filters and the previous headers, creating a chain) and look for
> conflicts
> > between peers. If a conflict is found at a certain block, the client can
> > download the block, generate a filter, calculate the header by hashing
> > together the previous header and the generated filter, and banning any
> peers
> > that don't match. A full node could prune old filters if you wanted and
> > recalculate them as necessary if you just keep the filter header chain
> info
> > as really old filters are unlikely to be requested by correctly written
> > software but you can't guarantee every client will follow best practices
> > either.
>
> Ahh, so you actually make a separate digest chain with prev hashes and
> everything. Once/if committed digests are soft forked in, it seems a
> bit overkill but maybe it's worth it. (I was always assuming committed
> digests in coinbase would come after people started using this, and
> that people could just ask a couple of random peers for the digest
> hash and ensure everyone gave the same answer as the hash of the
> downloaded digest..).
>
> > The simulations are based on completely random data within given
> parameters.
>
> I noticed an increase in FP hits when using real data sampled from
> real scriptPubKeys and such. Address reuse and other weird stuff. See
> "lies.h" in github repo for experiments and chainsim.c initial part of
> main where wallets get random stuff from the chain.
>
> > I will definitely try to reproduce my experiments with Golomb-Coded
> > sets and see what I come up with. It seems like you've got a little
> > less than half the size of my digests for 1-block digests but I
> > haven't tried making digests for all blocks (and lots of early blocks
> > are empty).
> >
> >
> > Filters for empty blocks only take a few bytes and sometimes zero when
> the
> > coinbase output is a burn that doesn't push any data (example will be in
> the
> > test vectors that I'll have ready shortly).
>
> I created digests for all blocks up until block #469805 and actually
> ended up with 5.8 GB, which is 1.1 GB lower than what you have, but
> may be worse perf-wise on false positive rates and such.
>
> > How fast are these to create? Would it make sense to provide digests
> > on demand in some cases, rather than keeping them around indefinitely?
> >
> >
> > They're pretty fast and can be pruned if desired, as mentioned above, as
> > long as the header chain is kept.
>
> For comparison, creating the digests above (469805 of them) took
> roughly 30 mins on my end, but using the kstats format so probably
> higher on an actual node (should get around to profiling that...).
> _______________________________________________
> bitcoin-dev mailing list
> bitcoin-dev@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev
>

--94eb2c002330b9368f05517e38ac
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Karl wrote:</div><div><br></div><div>&gt; I am also c=
urious if you have considered digests containing multiple</div><div>&gt; bl=
ocks. Retaining a permanent binsearchable record of the entire chain is</di=
v><div>&gt; obviously too space costly, but keeping the last X blocks as bi=
nsearchable</div><div>&gt; could speed up syncing for clients tremendously,=
 I feel.</div><div><br></div><div>Originally we hadn&#39;t considered such =
an idea. Grasping the concept a bit</div><div>better, I can see how that ma=
y result in considerable bandwidth savings</div><div>(for purely negative q=
ueries) for clients doing a historical sync, or</div><div>catching up to th=
e chain after being inactive for months/weeks.=C2=A0</div><div><br></div><d=
iv>If we were to purse tacking this approach onto the current BIP proposal,=
</div><div>we could do it in the following way:</div><div><br></div><div>=
=C2=A0 =C2=A0* The `getcfilter` message gains an additional &quot;Level&quo=
t; field. Using</div><div>=C2=A0 =C2=A0 =C2=A0this field, the range of bloc=
ks to be included in the returned filter</div><div>=C2=A0 =C2=A0 =C2=A0woul=
d be Level^2. So a level of 0 is just the single filter, 3 is 8</div><div>=
=C2=A0 =C2=A0 =C2=A0blocks past the block hash etc.</div><div><br></div><di=
v>=C2=A0 =C2=A0* Similarly, the `getcfheaders` message would also gain a si=
milar field</div><div>=C2=A0 =C2=A0 =C2=A0with identical semantics. In this=
 case each &quot;level&quot; would have a</div><div>=C2=A0 =C2=A0 =C2=A0dis=
tinct header chain for clients to verify.</div><div><br></div><div>&gt; How=
 fast are these to create? Would it make sense to provide digests on</div><=
div>&gt; demand in some cases, rather than keeping them around indefinitely=
?</div><div><br></div><div>For larger blocks (like the one referenced at th=
e end of this mail) full</div><div>construction of the regular filter takes=
 ~10-20ms (most of this spent</div><div>extracting the data pushes). With s=
maller blocks, it quickly dips down to</div><div>the nano to micro second r=
ange.</div><div><br></div><div>Whether to keep _all_ the filters on disk, o=
r to dynamically re-generate a</div><div>particular range (possibly most of=
 the historical data) is an</div><div>implementation detail. Nodes that alr=
eady do block pruning could discard</div><div>very old filters once the hea=
der chain is constructed allowing them to</div><div>save additional space, =
as it&#39;s unlikely most clients would care about the</div><div>first 300k=
 or so blocks.</div><div><br></div><div>&gt; Ahh, so you actually make a se=
parate digest chain with prev hashes and</div><div>&gt; everything. Once/if=
 committed digests are soft forked in, it seems a bit</div><div>&gt; overki=
ll but maybe it&#39;s worth it.</div><div><br></div><div>Yep, this is only =
a hold-over until when/if a commitment to the filter is</div><div>soft-fork=
ed in. In that case, there could be some extension message to</div><div>fet=
ch the filter hash for a particular block, along with a merkle proof of</di=
v><div>the coinbase transaction to the merkle root in the header.</div><div=
><br></div><div>&gt; I created digests for all blocks up until block #46980=
5 and actually ended</div><div>&gt; up with 5.8 GB, which is 1.1 GB lower t=
han what you have, but may be worse</div><div>&gt; perf-wise on false posit=
ive rates and such.</div><div><br></div><div>Interesting, are you creating =
the equivalent of both our &quot;regular&quot; and</div><div>&quot;extended=
&quot; filters? Each of the filter types consume about ~3.5GB in</div><div>=
isolation, with the extended filter type on average consuming more bytes</d=
iv><div>due to the fact that it includes sigScript/witness data as well.</d=
iv><div><br></div><div>It&#39;s worth noting that those numbers includes th=
e fixed 4-byte value for</div><div>&quot;N&quot; that&#39;s prepended to ea=
ch filter once it&#39;s serialized (though that</div><div>doesn&#39;t add a=
 considerable amount of overhead).=C2=A0 Alex and I were</div><div>consider=
ing instead using Bitcoin&#39;s var-int encoding for that number</div><div>=
instead. This would result in using a single byte for empty filters, 1</div=
><div>byte for most filters (&lt; 2^16 items), and 3 bytes for the remainde=
r of the</div><div>cases.</div><div><br></div><div>&gt; For comparison, cre=
ating the digests above (469805 of them) took</div><div>&gt; roughly 30 min=
s on my end, but using the kstats format so probably</div><div>&gt; higher =
on an actual node (should get around to profiling that...).</div><div><br><=
/div><div>Does that include the time required to read the blocks from disk?=
 Or just</div><div>the CPU computation of constructing the filters? I haven=
&#39;t yet kicked off</div><div>a full re-index of the filters, but for ref=
erence this block[1] on testnet</div><div>takes ~18ms for the _full_ indexi=
ng routine with our current code+spec.</div><div><br></div><div>[1]: 000000=
000000052184fbe86eff349e31703e4f109b52c7e6fa105cd1588ab6aa</div><div><br></=
div><div>-- Laolu</div><div><br></div><br><div class=3D"gmail_quote"><div d=
ir=3D"ltr">On Sun, Jun 4, 2017 at 7:18 PM Karl Johan Alm via bitcoin-dev &l=
t;<a href=3D"mailto:bitcoin-dev@lists.linuxfoundation.org">bitcoin-dev@list=
s.linuxfoundation.org</a>&gt; wrote:<br></div><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex=
">On Sat, Jun 3, 2017 at 2:55 AM, Alex Akselrod via bitcoin-dev<br>
&lt;<a href=3D"mailto:bitcoin-dev@lists.linuxfoundation.org" target=3D"_bla=
nk">bitcoin-dev@lists.linuxfoundation.org</a>&gt; wrote:<br>
&gt; Without a soft fork, this is the only way for light clients to verify =
that<br>
&gt; peers aren&#39;t lying to them. Clients can request headers (just hash=
es of the<br>
&gt; filters and the previous headers, creating a chain) and look for confl=
icts<br>
&gt; between peers. If a conflict is found at a certain block, the client c=
an<br>
&gt; download the block, generate a filter, calculate the header by hashing=
<br>
&gt; together the previous header and the generated filter, and banning any=
 peers<br>
&gt; that don&#39;t match. A full node could prune old filters if you wante=
d and<br>
&gt; recalculate them as necessary if you just keep the filter header chain=
 info<br>
&gt; as really old filters are unlikely to be requested by correctly writte=
n<br>
&gt; software but you can&#39;t guarantee every client will follow best pra=
ctices<br>
&gt; either.<br>
<br>
Ahh, so you actually make a separate digest chain with prev hashes and<br>
everything. Once/if committed digests are soft forked in, it seems a<br>
bit overkill but maybe it&#39;s worth it. (I was always assuming committed<=
br>
digests in coinbase would come after people started using this, and<br>
that people could just ask a couple of random peers for the digest<br>
hash and ensure everyone gave the same answer as the hash of the<br>
downloaded digest..).<br>
<br>
&gt; The simulations are based on completely random data within given param=
eters.<br>
<br>
I noticed an increase in FP hits when using real data sampled from<br>
real scriptPubKeys and such. Address reuse and other weird stuff. See<br>
&quot;lies.h&quot; in github repo for experiments and chainsim.c initial pa=
rt of<br>
main where wallets get random stuff from the chain.<br>
<br>
&gt; I will definitely try to reproduce my experiments with Golomb-Coded<br=
>
&gt; sets and see what I come up with. It seems like you&#39;ve got a littl=
e<br>
&gt; less than half the size of my digests for 1-block digests but I<br>
&gt; haven&#39;t tried making digests for all blocks (and lots of early blo=
cks<br>
&gt; are empty).<br>
&gt;<br>
&gt;<br>
&gt; Filters for empty blocks only take a few bytes and sometimes zero when=
 the<br>
&gt; coinbase output is a burn that doesn&#39;t push any data (example will=
 be in the<br>
&gt; test vectors that I&#39;ll have ready shortly).<br>
<br>
I created digests for all blocks up until block #469805 and actually<br>
ended up with 5.8 GB, which is 1.1 GB lower than what you have, but<br>
may be worse perf-wise on false positive rates and such.<br>
<br>
&gt; How fast are these to create? Would it make sense to provide digests<b=
r>
&gt; on demand in some cases, rather than keeping them around indefinitely?=
<br>
&gt;<br>
&gt;<br>
&gt; They&#39;re pretty fast and can be pruned if desired, as mentioned abo=
ve, as<br>
&gt; long as the header chain is kept.<br>
<br>
For comparison, creating the digests above (469805 of them) took<br>
roughly 30 mins on my end, but using the kstats format so probably<br>
higher on an actual node (should get around to profiling that...).<br>
_______________________________________________<br>
bitcoin-dev mailing list<br>
<a href=3D"mailto:bitcoin-dev@lists.linuxfoundation.org" target=3D"_blank">=
bitcoin-dev@lists.linuxfoundation.org</a><br>
<a href=3D"https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev" =
rel=3D"noreferrer" target=3D"_blank">https://lists.linuxfoundation.org/mail=
man/listinfo/bitcoin-dev</a><br>
</blockquote></div></div>

--94eb2c002330b9368f05517e38ac--