Return-Path: Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id 1A06A6C for ; Fri, 9 Jun 2017 03:04:05 +0000 (UTC) X-Greylist: whitelisted by SQLgrey-1.7.6 Received: from mail-yw0-f175.google.com (mail-yw0-f175.google.com [209.85.161.175]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id D9288AB for ; Fri, 9 Jun 2017 03:04:03 +0000 (UTC) Received: by mail-yw0-f175.google.com with SMTP id e142so9804036ywa.1 for ; Thu, 08 Jun 2017 20:04:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=n7S6b3sakaBhXxWCM8Oj8rKTIPL8hpX9HKHQqR6o00g=; b=rbJOqwHJBlA3V8/bbu3u+TFdPBscKsEqhG80MfM4ngHoM+knmohpYPW0JivXzMFxUz W5dHKmZdJdEYeQMvWW/6FJvowLSCj2MWRLUAFeUtT6wMEDqGHvBW9aNv3CmjoB0g466U MzQkU5WSP/G2SUfJzDh846sz0Jq4K4z9CgezZAX/Z+RgQBdTonbmQEtpI0xKqt44nl03 nZrYMtqLh6bkg4vD6zAGDcKXI0LpfzRKSscUjRrU0BOa0dZ51pCvev07qUGn2F5lB90n ErkJ4/42YbPMnzr3UMkL8o2aiwhl3FCx6+iG/1xQC+ET70id7FsyhRCFjLG1HRHCdrVA G+AA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=n7S6b3sakaBhXxWCM8Oj8rKTIPL8hpX9HKHQqR6o00g=; b=tY3xVqSCwE97aljKGKFEpk5TnqkvN3OIvNdd08+0mEp+aKCgzch9WSukBN6ISpudTo tH/69SQ7W0nZBQ3mftt+IHRPNxXaXZbhiKTVvzm7MYTduQJlieXc85O4hOId29+iiYkG KiU2zWcmDYcKnqdzFws7oyRXX4pjwukwRmkiaSANSXkuy2QDtkgprhLZy37wwbnHye4p 4fEkRRz0tss460B6vSoASsHtaVjRwXKqxzc8cJGcsMGhGUJheyqX4RMXXsp4KNW+BD8x C+yNHk6yWfyvHBnk43SPeVPDem3WoxLPn3GMfW333qCDveXDmcDZwjUxAZRkmZX6oTcR YT1g== X-Gm-Message-State: AODbwcC+v2OXv5GTZLDxUDQC9JVG0MjtJt0b9Rao7nZFQeohs3VmDRLM 9QFL30a8xQT5oMuF+OVUg6hGwAG6Cw== X-Received: by 10.129.179.193 with SMTP id r184mr14139647ywh.39.1496977442943; Thu, 08 Jun 2017 20:04:02 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Olaoluwa Osuntokun Date: Fri, 09 Jun 2017 03:03:51 +0000 Message-ID: To: Karl Johan Alm , Alex Akselrod Content-Type: multipart/alternative; boundary="94eb2c002330b9368f05517e38ac" X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM, HTML_MESSAGE,RCVD_IN_DNSWL_NONE autolearn=no version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Cc: Bitcoin Dev Subject: Re: [bitcoin-dev] BIP Proposal: Compact Client Side Filtering for Light Clients X-BeenThere: bitcoin-dev@lists.linuxfoundation.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: Bitcoin Protocol Discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 09 Jun 2017 03:04:05 -0000 --94eb2c002330b9368f05517e38ac Content-Type: text/plain; charset="UTF-8" Karl wrote: > I am also curious if you have considered digests containing multiple > blocks. Retaining a permanent binsearchable record of the entire chain is > obviously too space costly, but keeping the last X blocks as binsearchable > could speed up syncing for clients tremendously, I feel. Originally we hadn't considered such an idea. Grasping the concept a bit better, I can see how that may result in considerable bandwidth savings (for purely negative queries) for clients doing a historical sync, or catching up to the chain after being inactive for months/weeks. If we were to purse tacking this approach onto the current BIP proposal, we could do it in the following way: * The `getcfilter` message gains an additional "Level" field. Using this field, the range of blocks to be included in the returned filter would be Level^2. So a level of 0 is just the single filter, 3 is 8 blocks past the block hash etc. * Similarly, the `getcfheaders` message would also gain a similar field with identical semantics. In this case each "level" would have a distinct header chain for clients to verify. > How fast are these to create? Would it make sense to provide digests on > demand in some cases, rather than keeping them around indefinitely? For larger blocks (like the one referenced at the end of this mail) full construction of the regular filter takes ~10-20ms (most of this spent extracting the data pushes). With smaller blocks, it quickly dips down to the nano to micro second range. Whether to keep _all_ the filters on disk, or to dynamically re-generate a particular range (possibly most of the historical data) is an implementation detail. Nodes that already do block pruning could discard very old filters once the header chain is constructed allowing them to save additional space, as it's unlikely most clients would care about the first 300k or so blocks. > Ahh, so you actually make a separate digest chain with prev hashes and > everything. Once/if committed digests are soft forked in, it seems a bit > overkill but maybe it's worth it. Yep, this is only a hold-over until when/if a commitment to the filter is soft-forked in. In that case, there could be some extension message to fetch the filter hash for a particular block, along with a merkle proof of the coinbase transaction to the merkle root in the header. > I created digests for all blocks up until block #469805 and actually ended > up with 5.8 GB, which is 1.1 GB lower than what you have, but may be worse > perf-wise on false positive rates and such. Interesting, are you creating the equivalent of both our "regular" and "extended" filters? Each of the filter types consume about ~3.5GB in isolation, with the extended filter type on average consuming more bytes due to the fact that it includes sigScript/witness data as well. It's worth noting that those numbers includes the fixed 4-byte value for "N" that's prepended to each filter once it's serialized (though that doesn't add a considerable amount of overhead). Alex and I were considering instead using Bitcoin's var-int encoding for that number instead. This would result in using a single byte for empty filters, 1 byte for most filters (< 2^16 items), and 3 bytes for the remainder of the cases. > For comparison, creating the digests above (469805 of them) took > roughly 30 mins on my end, but using the kstats format so probably > higher on an actual node (should get around to profiling that...). Does that include the time required to read the blocks from disk? Or just the CPU computation of constructing the filters? I haven't yet kicked off a full re-index of the filters, but for reference this block[1] on testnet takes ~18ms for the _full_ indexing routine with our current code+spec. [1]: 000000000000052184fbe86eff349e31703e4f109b52c7e6fa105cd1588ab6aa -- Laolu On Sun, Jun 4, 2017 at 7:18 PM Karl Johan Alm via bitcoin-dev < bitcoin-dev@lists.linuxfoundation.org> wrote: > On Sat, Jun 3, 2017 at 2:55 AM, Alex Akselrod via bitcoin-dev > wrote: > > Without a soft fork, this is the only way for light clients to verify > that > > peers aren't lying to them. Clients can request headers (just hashes of > the > > filters and the previous headers, creating a chain) and look for > conflicts > > between peers. If a conflict is found at a certain block, the client can > > download the block, generate a filter, calculate the header by hashing > > together the previous header and the generated filter, and banning any > peers > > that don't match. A full node could prune old filters if you wanted and > > recalculate them as necessary if you just keep the filter header chain > info > > as really old filters are unlikely to be requested by correctly written > > software but you can't guarantee every client will follow best practices > > either. > > Ahh, so you actually make a separate digest chain with prev hashes and > everything. Once/if committed digests are soft forked in, it seems a > bit overkill but maybe it's worth it. (I was always assuming committed > digests in coinbase would come after people started using this, and > that people could just ask a couple of random peers for the digest > hash and ensure everyone gave the same answer as the hash of the > downloaded digest..). > > > The simulations are based on completely random data within given > parameters. > > I noticed an increase in FP hits when using real data sampled from > real scriptPubKeys and such. Address reuse and other weird stuff. See > "lies.h" in github repo for experiments and chainsim.c initial part of > main where wallets get random stuff from the chain. > > > I will definitely try to reproduce my experiments with Golomb-Coded > > sets and see what I come up with. It seems like you've got a little > > less than half the size of my digests for 1-block digests but I > > haven't tried making digests for all blocks (and lots of early blocks > > are empty). > > > > > > Filters for empty blocks only take a few bytes and sometimes zero when > the > > coinbase output is a burn that doesn't push any data (example will be in > the > > test vectors that I'll have ready shortly). > > I created digests for all blocks up until block #469805 and actually > ended up with 5.8 GB, which is 1.1 GB lower than what you have, but > may be worse perf-wise on false positive rates and such. > > > How fast are these to create? Would it make sense to provide digests > > on demand in some cases, rather than keeping them around indefinitely? > > > > > > They're pretty fast and can be pruned if desired, as mentioned above, as > > long as the header chain is kept. > > For comparison, creating the digests above (469805 of them) took > roughly 30 mins on my end, but using the kstats format so probably > higher on an actual node (should get around to profiling that...). > _______________________________________________ > bitcoin-dev mailing list > bitcoin-dev@lists.linuxfoundation.org > https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev > --94eb2c002330b9368f05517e38ac Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Karl wrote:

> I am also c= urious if you have considered digests containing multiple
> bl= ocks. Retaining a permanent binsearchable record of the entire chain is
> obviously too space costly, but keeping the last X blocks as bi= nsearchable
> could speed up syncing for clients tremendously,= I feel.

Originally we hadn't considered such = an idea. Grasping the concept a bit
better, I can see how that ma= y result in considerable bandwidth savings
(for purely negative q= ueries) for clients doing a historical sync, or
catching up to th= e chain after being inactive for months/weeks.=C2=A0

If we were to purse tacking this approach onto the current BIP proposal,=
we could do it in the following way:

= =C2=A0 =C2=A0* The `getcfilter` message gains an additional "Level&quo= t; field. Using
=C2=A0 =C2=A0 =C2=A0this field, the range of bloc= ks to be included in the returned filter
=C2=A0 =C2=A0 =C2=A0woul= d be Level^2. So a level of 0 is just the single filter, 3 is 8
= =C2=A0 =C2=A0 =C2=A0blocks past the block hash etc.

=C2=A0 =C2=A0* Similarly, the `getcfheaders` message would also gain a si= milar field
=C2=A0 =C2=A0 =C2=A0with identical semantics. In this= case each "level" would have a
=C2=A0 =C2=A0 =C2=A0dis= tinct header chain for clients to verify.

> How= fast are these to create? Would it make sense to provide digests on
<= div>> demand in some cases, rather than keeping them around indefinitely= ?

For larger blocks (like the one referenced at th= e end of this mail) full
construction of the regular filter takes= ~10-20ms (most of this spent
extracting the data pushes). With s= maller blocks, it quickly dips down to
the nano to micro second r= ange.

Whether to keep _all_ the filters on disk, o= r to dynamically re-generate a
particular range (possibly most of= the historical data) is an
implementation detail. Nodes that alr= eady do block pruning could discard
very old filters once the hea= der chain is constructed allowing them to
save additional space, = as it's unlikely most clients would care about the
first 300k= or so blocks.

> Ahh, so you actually make a se= parate digest chain with prev hashes and
> everything. Once/if= committed digests are soft forked in, it seems a bit
> overki= ll but maybe it's worth it.

Yep, this is only = a hold-over until when/if a commitment to the filter is
soft-fork= ed in. In that case, there could be some extension message to
fet= ch the filter hash for a particular block, along with a merkle proof of
the coinbase transaction to the merkle root in the header.

> I created digests for all blocks up until block #46980= 5 and actually ended
> up with 5.8 GB, which is 1.1 GB lower t= han what you have, but may be worse
> perf-wise on false posit= ive rates and such.

Interesting, are you creating = the equivalent of both our "regular" and
"extended= " filters? Each of the filter types consume about ~3.5GB in
= isolation, with the extended filter type on average consuming more bytes
due to the fact that it includes sigScript/witness data as well.

It's worth noting that those numbers includes th= e fixed 4-byte value for
"N" that's prepended to ea= ch filter once it's serialized (though that
doesn't add a= considerable amount of overhead).=C2=A0 Alex and I were
consider= ing instead using Bitcoin's var-int encoding for that number
= instead. This would result in using a single byte for empty filters, 1
byte for most filters (< 2^16 items), and 3 bytes for the remainde= r of the
cases.

> For comparison, cre= ating the digests above (469805 of them) took
> roughly 30 min= s on my end, but using the kstats format so probably
> higher = on an actual node (should get around to profiling that...).

<= /div>
Does that include the time required to read the blocks from disk?= Or just
the CPU computation of constructing the filters? I haven= 't yet kicked off
a full re-index of the filters, but for ref= erence this block[1] on testnet
takes ~18ms for the _full_ indexi= ng routine with our current code+spec.

[1]: 000000= 000000052184fbe86eff349e31703e4f109b52c7e6fa105cd1588ab6aa

-- Laolu


On Sun, Jun 4, 2017 at 7:18 PM Karl Johan Alm via bitcoin-dev &l= t;bitcoin-dev@list= s.linuxfoundation.org> wrote:
bitcoin-dev@lists.linuxfoundation.org> wrote:
> Without a soft fork, this is the only way for light clients to verify = that
> peers aren't lying to them. Clients can request headers (just hash= es of the
> filters and the previous headers, creating a chain) and look for confl= icts
> between peers. If a conflict is found at a certain block, the client c= an
> download the block, generate a filter, calculate the header by hashing=
> together the previous header and the generated filter, and banning any= peers
> that don't match. A full node could prune old filters if you wante= d and
> recalculate them as necessary if you just keep the filter header chain= info
> as really old filters are unlikely to be requested by correctly writte= n
> software but you can't guarantee every client will follow best pra= ctices
> either.

Ahh, so you actually make a separate digest chain with prev hashes and
everything. Once/if committed digests are soft forked in, it seems a
bit overkill but maybe it's worth it. (I was always assuming committed<= br> digests in coinbase would come after people started using this, and
that people could just ask a couple of random peers for the digest
hash and ensure everyone gave the same answer as the hash of the
downloaded digest..).

> The simulations are based on completely random data within given param= eters.

I noticed an increase in FP hits when using real data sampled from
real scriptPubKeys and such. Address reuse and other weird stuff. See
"lies.h" in github repo for experiments and chainsim.c initial pa= rt of
main where wallets get random stuff from the chain.

> I will definitely try to reproduce my experiments with Golomb-Coded > sets and see what I come up with. It seems like you've got a littl= e
> less than half the size of my digests for 1-block digests but I
> haven't tried making digests for all blocks (and lots of early blo= cks
> are empty).
>
>
> Filters for empty blocks only take a few bytes and sometimes zero when= the
> coinbase output is a burn that doesn't push any data (example will= be in the
> test vectors that I'll have ready shortly).

I created digests for all blocks up until block #469805 and actually
ended up with 5.8 GB, which is 1.1 GB lower than what you have, but
may be worse perf-wise on false positive rates and such.

> How fast are these to create? Would it make sense to provide digests > on demand in some cases, rather than keeping them around indefinitely?=
>
>
> They're pretty fast and can be pruned if desired, as mentioned abo= ve, as
> long as the header chain is kept.

For comparison, creating the digests above (469805 of them) took
roughly 30 mins on my end, but using the kstats format so probably
higher on an actual node (should get around to profiling that...).
_______________________________________________
bitcoin-dev mailing list
= bitcoin-dev@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mail= man/listinfo/bitcoin-dev
--94eb2c002330b9368f05517e38ac--