Return-Path: Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id 59327483 for ; Sat, 18 Jun 2016 23:01:52 +0000 (UTC) X-Greylist: from auto-whitelisted by SQLgrey-1.7.6 Received: from outmail149077.authsmtp.com (outmail149077.authsmtp.com [62.13.149.77]) by smtp1.linuxfoundation.org (Postfix) with ESMTP id E91F18E for ; Sat, 18 Jun 2016 23:01:50 +0000 (UTC) Received: from mail-c247.authsmtp.com (mail-c247.authsmtp.com [62.13.128.247]) by punt24.authsmtp.com (8.14.2/8.14.2/) with ESMTP id u5IN1nYd053524; Sun, 19 Jun 2016 00:01:49 +0100 (BST) Received: from petertodd.org (ec2-52-5-185-120.compute-1.amazonaws.com [52.5.185.120]) (authenticated bits=0) by mail.authsmtp.com (8.14.2/8.14.2/) with ESMTP id u5IN1kRb058747 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 19 Jun 2016 00:01:46 +0100 (BST) Received: from [127.0.0.1] (localhost [127.0.0.1]) by petertodd.org (Postfix) with ESMTPSA id 9B7904011C; Sat, 18 Jun 2016 22:59:46 +0000 (UTC) Received: by localhost (Postfix, from userid 1000) id 469F720278; Sat, 18 Jun 2016 19:01:43 -0400 (EDT) Date: Sat, 18 Jun 2016 19:01:43 -0400 From: Peter Todd To: Bram Cohen Message-ID: <20160618230143.GA25017@fedora-21-dvm> References: <20160616001040.GA5026@fedora-21-dvm> <20160616032612.GA7792@fedora-21-dvm> <20160617043435.GA12800@fedora-21-dvm> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="pf9I7BMVVzbSWLtt" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) X-Server-Quench: a229cf86-35a8-11e6-bcde-0015176ca198 X-AuthReport-Spam: If SPAM / abuse - report it at: http://www.authsmtp.com/abuse X-AuthRoute: OCd2Yg0TA1ZNQRgX IjsJECJaVQIpKltL GxAVKBZePFsRUQkR aAdMdwsUEkAaAgsB AmAbW1ReUFx7XWQ7 bghPaBtcak9QXgdq T0pMXVMcUQARfx1a ZEweVxF1cwIIen13 Y0YsD3JZVRcsfUBg QUsFHXAHZDJmdTJM BBVFdwNVdQJNeEwU a1l3GhFYa3VsNCMk FAgyOXU9MCtqYAFY WAIJIBoOW0sGBXY1 QRxKGDIyG1EMRiN7 NRUgJVMHdAAA X-Authentic-SMTP: 61633532353630.1038:706 X-AuthFastPath: 0 (Was 255) X-AuthSMTP-Origin: 52.5.185.120/25 X-AuthVirus-Status: No virus detected - but ensure you scan with your own anti-virus system. X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_LOW autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Cc: Bitcoin Protocol Discussion Subject: Re: [bitcoin-dev] Merkle trees and mountain ranges X-BeenThere: bitcoin-dev@lists.linuxfoundation.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: Bitcoin Protocol Discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Jun 2016 23:01:52 -0000 --pf9I7BMVVzbSWLtt Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Jun 17, 2016 at 07:43:47PM -0700, Bram Cohen wrote: > On Thu, Jun 16, 2016 at 9:34 PM, Peter Todd wrote: >=20 > > So above you said that in merbinner trees each node "hash[es] in a reco= rd > > of > > its depth" That's actually incorrect: each node commits to the prefix t= hat > > all > > keys below that level start with, not just the depth. >=20 >=20 > I considered a similar trick at the implementation rather than the > definition level: A node doesn't have to store the prefix which is implic= it > in its position. That would create a fair number of headaches though, > because I'm using fixed size stuff in important ways, and it could at most > save about 10% of memory, so it goes into the 'maybe later' bucket. Wait, are you saying you think committing to the prefix is a "trick"? It's = just a very simple - and possibly not-optimal - way of committing to what data should be accessible under a given node. An alternative would have been ens= ure that in terms of _cryptographic_ tree position. By "position", are you talking about position within RAM? That may or may n= ot be a viable optimization, but it's quite separate from the question of the cryptographic structure of the data. > > This means that in merbinner trees, cases where multiple keys share par= ts > > of > > the same prefix are handled efficiently, without introducing extra leve= ls > > unnecessarily; there's no need for the ONLY0/1 nodes as the children of= an > > inner node will always be on different sides. > > > > When keys are randomly distributed, this isn't a big deal; OTOH against > > attackers who are choosing keys, e.g. by grinding hashes, merbinner tre= es > > always have maximum depths in proportion to log2(n) of the actual numbe= r of > > items in the tree. Grinding is particularly annoying to deal with due to > > the > > birthday attack: creating a ground prefix 64 bits long only takes 32 bi= ts > > worth > > of work. > > >=20 > Yes an attacker can force the tree to be deeper in places, but it's > mitigated in several ways: (1) The way I'm using memory it won't cause a > whole new block to be allocated, it will just force log(attack strength) - > log(n) nodes to be used (2) logarithmic growth being what it is that isn't > such a big amount (3) With the special casing of TERMBOTH an attacker nee= ds > three things with the same prefix to pull off an attack rather than two, > which is quite a bit harder to pull off. > That said, it wouldn't be all that hard to change how the hashing function > works to do a single hash for a whole series of ONLY in a row instead of a > new one at every level, which would make the attacker only able to force > extra memory usage instead of extra CPU, but this is a slightly annoying > thing to write to stop a fairly lame attack, so I'm at least not doing it > for my initial implementation. I could likely be convinced that it's worth > doing before an actual release though. There's another implementation tri= ck > to do the same thing for memory usage, which is much more in the 'do late= r' > category because it doesn't involve changing the format and hence it can = be > put off. >=20 >=20 > > In particular, case #2 handles your leaf node optimizations generically, > > without special cases and additional complexity. It'd also be a better = way > > to > > do the ONLY0/1 cases, as if the "nothing on this side" symbol is a sing= le > > byte, > > each additional colliding level would simply extend the commitment with= out > > hashing. In short, you'd have nearly the same level of optimization even > > if at > > the cryptography level your tree consists of only leaves, inner nodes, = and > > nil. > > >=20 > I'm taking pains to make all the hashing be of fixed-size things, so that= a > non-padding variant of a secure hashing algorithm can be used. The chains > of ONLY thing above would force a special exception to that, which can be > done but is annoying. Making things smaller than a single block (64 bytes) > won't speed up hashing time, and making things a single byte longer than > that doubles it. Have you seen how BLAKE2 omits padding when the data to be hashed happens t= o be exactly one block in size? It's significantly faster than SHA256, and that'= s a standard part of the algorithm already. > > > > Technically even a patricia trie utxo commitment can have sub-1 cac= he > > > > > misses per update if some of the updates in a single block are cl= ose > > to > > > > > each other in memory. I think I can get practical Bitcoin updates > > down > > > > to a > > > > > little bit less than one l2 cache miss per update, but not a lot > > less. > > > > > > > > I'm very confused as to why you think that's possible. When you say > > > > "practical > > > > Bitcoin updates", what exactly is the data structure you're proposi= ng > > to > > > > update? How is it indexed? > > >=20 > I'll re-answer this because I did a terrible job before. The entire data > structure consists of nodes which contain a metadata byte (TERM0, ONLY1, > etc.) followed by fixes size secure hashes, and (in some cases) pointers = to > where the children are. The secure hashes in parent nodes are found by > hashing their children verbatim (or the stored string in the case of a > TERM). This is very conventional. All of the cleverness is in where in > memory these nodes are stored so that tracing down the tree causes very f= ew > cache misses. >=20 > (The alternate approach is to have each node store its own hash rather th= an > that be stored by the parent. That approach means that when you're > recalculating you have to look up siblings which doubles the number of > cache misses. Not such a hot idea.) Have you benchmarked the cost of a hash operation vs. the cost of a cache m= iss? What are the actual numbers? > At the root there's a branch block. It consists of all nodes up to some > fixed depth - let's say 12 - with that depth set so that it roughly fits > within a single memory page. Branch blocks are arranged with the nodes in > fixed position defined by the prefix they correspond to, and the terminals > have outpointers to other blocks. Because they're all clustered together,= a > lookup or update will only require a single A single....? > Below the root block are other branch blocks. Each of them has a fixed 12 > bit prefix it is responsible for. When doing a lookup a second cache miss > will be hit for levels 13-24, because those are all clustered in the same > branch block. So, is this also how the data structure looks cryptographically, or is the = way it's hashed separate from the above description? > Below the second level of root block (at Bitcoin utxo set scale - this > varies based on how much is stored) there are leaf blocks. A leaf block > consists of nodes with outpointers to its own children which must be with= in > the same leaf block. All entry points into a leaf block are from the same > branch block, and the leaf block has no out pointers to other blocks. When > a leaf block overflows the entry point into it which overflowed is moved > into the active leaf for that branch, and if that's full a new one is > allocated. There's some subtlety to exactly how this is done, but I've > gotten everything down to simple expedient tricks with straightforward > implementations. The thing which matters for now is that there's only a > single cache miss for each leaf node, because they also fit in a page. Page as in 4096 bytes? But L1/L2/L3 cache is arranged in terms of 64 byte c= ache lines - where do pages come in here? At Bitcoin UTXO set scale, how large do you think these data structures are? > So at Bitcoin scale there will probably only be 3 cache misses for a > lookup, and that's a worst case scenario. The first one is probably always > warm, bringing it down to 2, and if you do a bunch in sorted order they'll > probably hit the same second level branches repeatedly bringing it down to > 1, and might even average less than that if there are enough that the leaf > block has multiple things being accessed. "Sorted order" - what exact type of sorting do you mean here? > > Anyway hashing is pretty slow. The very fast BLAKE2 is about 3 cycles/b= yte > > (SHA256 is about 15 cycles/byte) so hashing that same data would take > > around > > 200 cycles, and probably quite a bit more in practice due to overheads > > from our > > short message lengths; fetching a cache line from DRAM only takes about > > 1,000 > > cycles. I'd guess that once other overheads are taken into account, even > > if you > > could eliminate L2/L3 cache-misses it wouldn't be much of an improvemen= t. > > >=20 > Those numbers actually back up my claims about performance. If you're doi= ng > a single update and recalculating the root afterwards, then the amount of > rehashing to be done is about 30 levels deep times 64 bytes per thing > hashed times 15 cycles per byte then it's about 28,800 cycles of hashing. > If you have a naive radix tree implementation which hits a cache miss at > every level then that's 30,000 cycles, which is about half the performance > time, certainly worth optimizing. If instead of sha256 you use blake2 > (Which sounds like a very good idea!) then hashing for an update will be > about 5760 cycles and performance will be completely dominated by cache > misses. If a more cache coherent implementation is used, then the cost of > cache misses will be 3000 cycles, which will be a non-factor with sha256 > and a significant but not dominating one with blake2. But that's assuming the dataset in question fits in cache; I don't see how = it does. Since it doesn't, I'm argung the total % improvement by _any_ cache optimization on the subset that does fit in cache will be relatively small. Again, how large a dataset do you think you're working with here? --=20 https://petertodd.org 'peter'[:-1]@petertodd.org --pf9I7BMVVzbSWLtt Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- iQEcBAEBCAAGBQJXZdLVAAoJEGOZARBE6K+yqBEH/Rxh0oG5TAfqXNFc2HAhtdvC 6jdjefWX4Xgu5E6GhEEHzybPMMoLwfrM1lCkast02ezmW8PyBLnYe1feROX6jk8H AGH2xA4m3AH5xOpo+E5ZJWKxKkntnD3zxE/V8LoWyavaup7Qq8BOUJ4JjvbDgUH9 yAOfCBDCFRIQXZEeIRlJovMo8kKnMKTVXDfkafiY0HfgSVmjLJUI5m6SD8QJaD12 EW0dCA/RvL05xcxJbKkKNJ9tfBfalEVn7Zyw5bXQCTadW/VGqsBmJthtISeL0cmp VpjeERJzOLZhI5kEp9qxdYzu2oMwCSL7UdN9BJcG4ChOgIhHQRSIEdT8J1ZaTqI= =x1wS -----END PGP SIGNATURE----- --pf9I7BMVVzbSWLtt--