Date: Sat, 18 Jun 2016 19:01:43 -0400
From: Peter Todd <pete@petertodd.org>
To: Bram Cohen <bram@bittorrent.com>
Message-ID: <20160618230143.GA25017@fedora-21-dvm>
References: <CA+KqGkosGrWQ2Lpg3Ptc+UyidK7H4_HLQ_QWx2DOAFRmLkv4zQ@mail.gmail.com>
	<20160616001040.GA5026@fedora-21-dvm>
	<CA+KqGkqAHcU2PzEX6OmorXRBQ22eF_QBYYqUDS1v_ZvevhLCuQ@mail.gmail.com>
	<20160616032612.GA7792@fedora-21-dvm>
	<CA+KqGkqA2WnvBEck3kv6p2no-9wzNVCTNA-MGw=Jg=gMGfrxUQ@mail.gmail.com>
	<20160617043435.GA12800@fedora-21-dvm>
	<CA+KqGkpRmeKyo6TFpe+uUCdJSina+ARraNd0dkHSb2Hpx5dYuw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature"; boundary="pf9I7BMVVzbSWLtt"
Content-Disposition: inline
In-Reply-To: <CA+KqGkpRmeKyo6TFpe+uUCdJSina+ARraNd0dkHSb2Hpx5dYuw@mail.gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Cc: Bitcoin Protocol Discussion <bitcoin-dev@lists.linuxfoundation.org>
Subject: Re: [bitcoin-dev] Merkle trees and mountain ranges
Precedence: list


--pf9I7BMVVzbSWLtt
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Jun 17, 2016 at 07:43:47PM -0700, Bram Cohen wrote:
> On Thu, Jun 16, 2016 at 9:34 PM, Peter Todd <pete@petertodd.org> wrote:
>=20
> > So above you said that in merbinner trees each node "hash[es] in a reco=
rd
> > of
> > its depth" That's actually incorrect: each node commits to the prefix t=
hat
> > all
> > keys below that level start with, not just the depth.
>=20
>=20
> I considered a similar trick at the implementation rather than the
> definition level: A node doesn't have to store the prefix which is implic=
it
> in its position. That would create a fair number of headaches though,
> because I'm using fixed size stuff in important ways, and it could at most
> save about 10% of memory, so it goes into the 'maybe later' bucket.

Wait, are you saying you think committing to the prefix is a "trick"? It's =
just
a very simple - and possibly not-optimal - way of committing to what data
should be accessible under a given node. An alternative would have been ens=
ure
that in terms of _cryptographic_ tree position.

By "position", are you talking about position within RAM? That may or may n=
ot
be a viable optimization, but it's quite separate from the question of the
cryptographic structure of the data.

> > This means that in merbinner trees, cases where multiple keys share par=
ts
> > of
> > the same prefix are handled efficiently, without introducing extra leve=
ls
> > unnecessarily; there's no need for the ONLY0/1 nodes as the children of=
 an
> > inner node will always be on different sides.
> >
> > When keys are randomly distributed, this isn't a big deal; OTOH against
> > attackers who are choosing keys, e.g. by grinding hashes, merbinner tre=
es
> > always have maximum depths in proportion to log2(n) of the actual numbe=
r of
> > items in the tree. Grinding is particularly annoying to deal with due to
> > the
> > birthday attack: creating a ground prefix 64 bits long only takes 32 bi=
ts
> > worth
> > of work.
> >
>=20
> Yes an attacker can force the tree to be deeper in places, but it's
> mitigated in several ways: (1) The way I'm using memory it won't cause a
> whole new block to be allocated, it will just force log(attack strength) -
> log(n) nodes to be used (2) logarithmic growth being what it is that isn't
> such a big amount (3) With the special casing of TERMBOTH an attacker nee=
ds
> three things with the same prefix to pull off an attack rather than two,
> which is quite a bit harder to pull off.


> That said, it wouldn't be all that hard to change how the hashing function
> works to do a single hash for a whole series of ONLY in a row instead of a
> new one at every level, which would make the attacker only able to force
> extra memory usage instead of extra CPU, but this is a slightly annoying
> thing to write to stop a fairly lame attack, so I'm at least not doing it
> for my initial implementation. I could likely be convinced that it's worth
> doing before an actual release though. There's another implementation tri=
ck
> to do the same thing for memory usage, which is much more in the 'do late=
r'
> category because it doesn't involve changing the format and hence it can =
be
> put off.
>=20
>=20
> > In particular, case #2 handles your leaf node optimizations generically,
> > without special cases and additional complexity. It'd also be a better =
way
> > to
> > do the ONLY0/1 cases, as if the "nothing on this side" symbol is a sing=
le
> > byte,
> > each additional colliding level would simply extend the commitment with=
out
> > hashing. In short, you'd have nearly the same level of optimization even
> > if at
> > the cryptography level your tree consists of only leaves, inner nodes, =
and
> > nil.
> >
>=20
> I'm taking pains to make all the hashing be of fixed-size things, so that=
 a
> non-padding variant of a secure hashing algorithm can be used. The chains
> of ONLY thing above would force a special exception to that, which can be
> done but is annoying. Making things smaller than a single block (64 bytes)
> won't speed up hashing time, and making things a single byte longer than
> that doubles it.

Have you seen how BLAKE2 omits padding when the data to be hashed happens t=
o be
exactly one block in size? It's significantly faster than SHA256, and that'=
s a
standard part of the algorithm already.

> > > > Technically even a patricia trie utxo commitment can have sub-1 cac=
he
> > > > > misses per update if some of the updates in a single block are cl=
ose
> > to
> > > > > each other in memory. I think I can get practical Bitcoin updates
> > down
> > > > to a
> > > > > little bit less than one l2 cache miss per update, but not a lot
> > less.
> > > >
> > > > I'm very confused as to why you think that's possible. When you say
> > > > "practical
> > > > Bitcoin updates", what exactly is the data structure you're proposi=
ng
> > to
> > > > update? How is it indexed?
> >
>=20
> I'll re-answer this because I did a terrible job before. The entire data
> structure consists of nodes which contain a metadata byte (TERM0, ONLY1,
> etc.) followed by fixes size secure hashes, and (in some cases) pointers =
to
> where the children are. The secure hashes in parent nodes are found by
> hashing their children verbatim (or the stored string in the case of a
> TERM). This is very conventional. All of the cleverness is in where in
> memory these nodes are stored so that tracing down the tree causes very f=
ew
> cache misses.
>=20
> (The alternate approach is to have each node store its own hash rather th=
an
> that be stored by the parent. That approach means that when you're
> recalculating you have to look up siblings which doubles the number of
> cache misses. Not such a hot idea.)

Have you benchmarked the cost of a hash operation vs. the cost of a cache m=
iss?
What are the actual numbers?

> At the root there's a branch block. It consists of all nodes up to some
> fixed depth - let's say 12 - with that depth set so that it roughly fits
> within a single memory page. Branch blocks are arranged with the nodes in
> fixed position defined by the prefix they correspond to, and the terminals
> have outpointers to other blocks. Because they're all clustered together,=
 a
> lookup or update will only require a single

A single....?

> Below the root block are other branch blocks. Each of them has a fixed 12
> bit prefix it is responsible for. When doing a lookup a second cache miss
> will be hit for levels 13-24, because those are all clustered in the same
> branch block.

So, is this also how the data structure looks cryptographically, or is the =
way
it's hashed separate from the above description?

> Below the second level of root block (at Bitcoin utxo set scale - this
> varies based on how much is stored) there are leaf blocks. A leaf block
> consists of nodes with outpointers to its own children which must be with=
in
> the same leaf block. All entry points into a leaf block are from the same
> branch block, and the leaf block has no out pointers to other blocks. When
> a leaf block overflows the entry point into it which overflowed is moved
> into the active leaf for that branch, and if that's full a new one is
> allocated. There's some subtlety to exactly how this is done, but I've
> gotten everything down to simple expedient tricks with straightforward
> implementations. The thing which matters for now is that there's only a
> single cache miss for each leaf node, because they also fit in a page.

Page as in 4096 bytes? But L1/L2/L3 cache is arranged in terms of 64 byte c=
ache
lines - where do pages come in here?

At Bitcoin UTXO set scale, how large do you think these data structures are?

> So at Bitcoin scale there will probably only be 3 cache misses for a
> lookup, and that's a worst case scenario. The first one is probably always
> warm, bringing it down to 2, and if you do a bunch in sorted order they'll
> probably hit the same second level branches repeatedly bringing it down to
> 1, and might even average less than that if there are enough that the leaf
> block has multiple things being accessed.

"Sorted order" - what exact type of sorting do you mean here?

> > Anyway hashing is pretty slow. The very fast BLAKE2 is about 3 cycles/b=
yte
> > (SHA256 is about 15 cycles/byte) so hashing that same data would take
> > around
> > 200 cycles, and probably quite a bit more in practice due to overheads
> > from our
> > short message lengths; fetching a cache line from DRAM only takes about
> > 1,000
> > cycles. I'd guess that once other overheads are taken into account, even
> > if you
> > could eliminate L2/L3 cache-misses it wouldn't be much of an improvemen=
t.
> >
>=20
> Those numbers actually back up my claims about performance. If you're doi=
ng
> a single update and recalculating the root afterwards, then the amount of
> rehashing to be done is about 30 levels deep times 64 bytes per thing
> hashed times 15 cycles per byte then it's about 28,800 cycles of hashing.
> If you have a naive radix tree implementation which hits a cache miss at
> every level then that's 30,000 cycles, which is about half the performance
> time, certainly worth optimizing. If instead of sha256 you use blake2
> (Which sounds like a very good idea!) then hashing for an update will be
> about 5760 cycles and performance will be completely dominated by cache
> misses. If a more cache coherent implementation is used, then the cost of
> cache misses will be 3000 cycles, which will be a non-factor with sha256
> and a significant but not dominating one with blake2.

But that's assuming the dataset in question fits in cache; I don't see how =
it
does. Since it doesn't, I'm argung the total % improvement by _any_ cache
optimization on the subset that does fit in cache will be relatively small.

Again, how large a dataset do you think you're working with here?

--=20
https://petertodd.org 'peter'[:-1]@petertodd.org

--pf9I7BMVVzbSWLtt
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature

-----BEGIN PGP SIGNATURE-----

iQEcBAEBCAAGBQJXZdLVAAoJEGOZARBE6K+yqBEH/Rxh0oG5TAfqXNFc2HAhtdvC
6jdjefWX4Xgu5E6GhEEHzybPMMoLwfrM1lCkast02ezmW8PyBLnYe1feROX6jk8H
AGH2xA4m3AH5xOpo+E5ZJWKxKkntnD3zxE/V8LoWyavaup7Qq8BOUJ4JjvbDgUH9
yAOfCBDCFRIQXZEeIRlJovMo8kKnMKTVXDfkafiY0HfgSVmjLJUI5m6SD8QJaD12
EW0dCA/RvL05xcxJbKkKNJ9tfBfalEVn7Zyw5bXQCTadW/VGqsBmJthtISeL0cmp
VpjeERJzOLZhI5kEp9qxdYzu2oMwCSL7UdN9BJcG4ChOgIhHQRSIEdT8J1ZaTqI=
=x1wS
-----END PGP SIGNATURE-----

--pf9I7BMVVzbSWLtt--