MIME-Version: 1.0
References: <d43c6082-1b2c-c95b-5144-99ad0021ea6c@mattcorallo.com>
	<CAAS2fgRF-MhOvpFY6c_qAPzNMo3GQ28RExdSbOV6Q6Oy2iWn1A@mail.gmail.com>
	<CADabwBCKe6DaiBf_sjz9zyirkw8BdsDZnWSLEiAABEZvVDwj-Q@mail.gmail.com>
In-Reply-To: <CADabwBCKe6DaiBf_sjz9zyirkw8BdsDZnWSLEiAABEZvVDwj-Q@mail.gmail.com>
From: Olaoluwa Osuntokun <laolu32@gmail.com>
Date: Fri, 18 May 2018 20:08:29 -0700
Message-ID: <CAO3Pvs_Ca6FKWw32hDSnuOGWLHAikNrdeopgS6L-FdXT6jn-AA@mail.gmail.com>
To: Riccardo Casatta <riccardo.casatta@gmail.com>, 
	Bitcoin Protocol Discussion <bitcoin-dev@lists.linuxfoundation.org>
Content-Type: multipart/alternative; boundary="000000000000c50ee8056c866263"
Subject: Re: [bitcoin-dev] BIP 158 Flexibility and Filter Size
Precedence: list

--000000000000c50ee8056c866263
Content-Type: text/plain; charset="UTF-8"

Riccardo wrote:
> The BIP recall some go code for how the parameter has been selected which
> I can hardly understand and run

The code you're linking to is for generating test vectors (to allow
implementations to check the correctness of their gcs filters. The name of
the file is 'gentestvectors.go'. It produces CSV files which contain test
vectors of various testnet blocks and at various false positive rates.

> it's totally my fault but if possible I would really like more details on
> the process, like charts and explanations

When we published the BIP draft last year (wow, time flies!), we put up code
(as well as an interactive website) showing the process we used to arrive at
the current false positive rate. The aim was to minimize the bandwidth
required to download each filter plus the expected bandwidth from
downloading "large-ish" full segwit blocks. The code simulated a few wallet
types (in terms of number of addrs, etc) focusing on a "mid-sized" wallet.
One could also model the selection as a Bernoulli process where we attempt
to compute the probability that after k queries (let's say you have k
addresses) we have k "successes". A success would mean the queries item
wasn't found in the filter, while a failure is a filter match (false
positive or not). A failure in the process requires fetching the entire
block.

-- Laolu

On Fri, May 18, 2018 at 5:35 AM Riccardo Casatta via bitcoin-dev <
bitcoin-dev@lists.linuxfoundation.org> wrote:

> Another parameter which heavily affects filter size is the false positive
> rate which is empirically set
> <https://github.com/bitcoin/bips/blob/master/bip-0158.mediawiki#construction>
> to 2^-20
> The BIP recall some go code
> <https://github.com/Roasbeef/bips/blob/83b83c78e189be898573e0bfe936dd0c9b99ecb9/gcs_light_client/gentestvectors.go>
> for how the parameter has been selected which I can hardly understand and
> run, it's totally my fault but if possible I would really like more details
> on the process, like charts and explanations (for example, which is the
> number of elements to search for which the filter has been optimized for?)
>
> Instinctively I feel 2^-20 is super low and choosing a lot higher alpha
> will shrink the total filter size by gigabytes at the cost of having to
> wastefully download just some megabytes of blocks.
>
>
> 2018-05-17 18:36 GMT+02:00 Gregory Maxwell via bitcoin-dev <
> bitcoin-dev@lists.linuxfoundation.org>:
>
>> On Thu, May 17, 2018 at 3:25 PM, Matt Corallo via bitcoin-dev
>> <bitcoin-dev@lists.linuxfoundation.org> wrote:
>> > I believe (1) could be skipped entirely - there is almost no reason why
>> > you'd not be able to filter for, eg, the set of output scripts in a
>> > transaction you know about
>>
>> I think this is convincing for the txids themselves.
>>
>> What about also making input prevouts filter based on the scriptpubkey
>> being _spent_?  Layering wise in the processing it's a bit ugly, but
>> if you validated the block you have the data needed.
>>
>> This would eliminate the multiple data type mixing entirely.
>> _______________________________________________
>> bitcoin-dev mailing list
>> bitcoin-dev@lists.linuxfoundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev
>>
>
>
>
> --
> Riccardo Casatta - @RCasatta <https://twitter.com/RCasatta>
> _______________________________________________
> bitcoin-dev mailing list
> bitcoin-dev@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev
>

--000000000000c50ee8056c866263
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Riccardo wrote:</div><div>&gt; The BIP recall some go=
 code for how the parameter has been selected which</div><div>&gt; I can ha=
rdly understand and run</div><div><br></div><div>The code you&#39;re linkin=
g to is for generating test vectors (to allow</div><div>implementations to =
check the correctness of their gcs filters. The name of</div><div>the file =
is &#39;gentestvectors.go&#39;. It produces CSV files which contain test</d=
iv><div>vectors of various testnet blocks and at various false positive rat=
es.</div><div><br></div><div>&gt; it&#39;s totally my fault but if possible=
 I would really like more details on</div><div>&gt; the process, like chart=
s and explanations</div><div><br></div><div>When we published the BIP draft=
 last year (wow, time flies!), we put up code</div><div>(as well as an inte=
ractive website) showing the process we used to arrive at</div><div>the cur=
rent false positive rate. The aim was to minimize the bandwidth</div><div>r=
equired to download each filter plus the expected bandwidth from</div><div>=
downloading &quot;large-ish&quot; full segwit blocks. The code simulated a =
few wallet</div><div>types (in terms of number of addrs, etc) focusing on a=
 &quot;mid-sized&quot; wallet.</div><div>One could also model the selection=
 as a Bernoulli process where we attempt</div><div>to compute the probabili=
ty that after k queries (let&#39;s say you have k</div><div>addresses) we h=
ave k &quot;successes&quot;. A success would mean the queries item</div><di=
v>wasn&#39;t found in the filter, while a failure is a filter match (false<=
/div><div>positive or not). A failure in the process requires fetching the =
entire</div><div>block.</div><div><br></div><div>-- Laolu</div><br><div cla=
ss=3D"gmail_quote"><div dir=3D"ltr">On Fri, May 18, 2018 at 5:35 AM Riccard=
o Casatta via bitcoin-dev &lt;<a href=3D"mailto:bitcoin-dev@lists.linuxfoun=
dation.org">bitcoin-dev@lists.linuxfoundation.org</a>&gt; wrote:<br></div><=
blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex"><div dir=3D"ltr">Another parameter which heav=
ily affects filter size is the false positive rate which is <a href=3D"http=
s://github.com/bitcoin/bips/blob/master/bip-0158.mediawiki#construction" ta=
rget=3D"_blank">empirically set</a> to 2^-20=C2=A0<div>The BIP recall some =
<a href=3D"https://github.com/Roasbeef/bips/blob/83b83c78e189be898573e0bfe9=
36dd0c9b99ecb9/gcs_light_client/gentestvectors.go" target=3D"_blank">go cod=
e</a> for how the parameter has been selected which I can hardly understand=
 and run, it&#39;s totally my fault but if possible I would really like mor=
e details on the process, like charts and explanations (for example, which =
is the number of elements to search for which the filter has been optimized=
 for?)</div><div><br></div><div>Instinctively I feel 2^-20 is super low and=
 choosing a lot higher alpha will shrink the total filter size by gigabytes=
 at the cost of having to wastefully download just some megabytes of blocks=
.</div><div><br></div></div><div class=3D"gmail_extra"></div><div class=3D"=
gmail_extra"><br><div class=3D"gmail_quote">2018-05-17 18:36 GMT+02:00 Greg=
ory Maxwell via bitcoin-dev <span dir=3D"ltr">&lt;<a href=3D"mailto:bitcoin=
-dev@lists.linuxfoundation.org" target=3D"_blank">bitcoin-dev@lists.linuxfo=
undation.org</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=3D"=
margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span>On Thu=
, May 17, 2018 at 3:25 PM, Matt Corallo via bitcoin-dev<br>
&lt;<a href=3D"mailto:bitcoin-dev@lists.linuxfoundation.org" target=3D"_bla=
nk">bitcoin-dev@lists.linuxfoundation.org</a>&gt; wrote:<br>
&gt; I believe (1) could be skipped entirely - there is almost no reason wh=
y<br>
&gt; you&#39;d not be able to filter for, eg, the set of output scripts in =
a<br>
&gt; transaction you know about<br>
<br>
</span>I think this is convincing for the txids themselves.<br>
<br>
What about also making input prevouts filter based on the scriptpubkey<br>
being _spent_?=C2=A0 Layering wise in the processing it&#39;s a bit ugly, b=
ut<br>
if you validated the block you have the data needed.<br>
<br>
This would eliminate the multiple data type mixing entirely.<br>
<div class=3D"m_-3892677285626005673HOEnZb"><div class=3D"m_-38926772856260=
05673h5">_______________________________________________<br>
bitcoin-dev mailing list<br>
<a href=3D"mailto:bitcoin-dev@lists.linuxfoundation.org" target=3D"_blank">=
bitcoin-dev@lists.linuxfoundation.org</a><br>
<a href=3D"https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev" =
rel=3D"noreferrer" target=3D"_blank">https://lists.linuxfoundation.org/mail=
man/listinfo/bitcoin-dev</a><br>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
div class=3D"gmail_extra">-- <br><div class=3D"m_-3892677285626005673gmail_=
signature" data-smartmail=3D"gmail_signature"><div dir=3D"ltr">Riccardo Cas=
atta - <a href=3D"https://twitter.com/RCasatta" target=3D"_blank">@RCasatta=
</a></div></div>
</div>
_______________________________________________<br>
bitcoin-dev mailing list<br>
<a href=3D"mailto:bitcoin-dev@lists.linuxfoundation.org" target=3D"_blank">=
bitcoin-dev@lists.linuxfoundation.org</a><br>
<a href=3D"https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev" =
rel=3D"noreferrer" target=3D"_blank">https://lists.linuxfoundation.org/mail=
man/listinfo/bitcoin-dev</a><br>
</blockquote></div></div>

--000000000000c50ee8056c866263--