Return-Path: Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id 71FF7996 for ; Fri, 15 Jul 2016 23:01:18 +0000 (UTC) X-Greylist: whitelisted by SQLgrey-1.7.6 Received: from mail-io0-f170.google.com (mail-io0-f170.google.com [209.85.223.170]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 8FAE019F for ; Fri, 15 Jul 2016 23:01:17 +0000 (UTC) Received: by mail-io0-f170.google.com with SMTP id 38so116992130iol.0 for ; Fri, 15 Jul 2016 16:01:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bittorrent-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=uroqDx55kj5aJ95yNiEOvqYMgJju68KZw76fWNWL9oU=; b=iNFyUpt9dcmzBlzsZrDKbhx/U1g3kzI/jmIMujkRLINZo4W9eXCEx29G0hGM1j2y+f QU5lQmjNuY4UfP1K828BImmH3iNttLwA2UkuizbKWZIWQFeAuEcJb9RrhDlR8/vaXMCS 6E4fJgdmxieHMoiL3fP+7uhRx5knOmNFOaoX7e+I2krBR3uGsBldA58HOhVc+ApFpx8T FeBo3Az3mZUpvtvbHGjwAM4TTpEqgTDqWHXdeDZDICGEkNc90yKl6jYszgjrR+33DUg2 hyPLxykSVikmSWroY2uvcu+2UUwTqIlzS2BI5MoC5uyKW1pRIKAsJWcw+jVph8JtVJKV 5HoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=uroqDx55kj5aJ95yNiEOvqYMgJju68KZw76fWNWL9oU=; b=XkYhjpQRLXrFH7v2yX5SZ/HMeLhGz0mIYTiQa538B8V4Shm26MU+DPNK//0yjFxIvc rIew/t/YZcuXkiTuwdqhyPTFJnj0n8Uz8kpSs4/ILhIzEafUX+5telBylHxFIKjhjed9 uIFU0XXa5F4DK7MAFPIDsBOSu8g96kzHviCBttnLbuZt4yyN+EhcBWc4O94n67S0sCNY hd7NYzixV3UjNXhRiL3BME0DBatHKR9ysLe0P1gCIutVF+aNjJ9EBC8H/GjQlesMHXRu mgabhKTClB4whxW75Cpx/3rhMaqE5owLCbBeIZY7C8Sm0vHNCHkV0H8hm2eW2ohNbdAh OvRQ== X-Gm-Message-State: ALyK8tK5OXt1rjPquRcDLLMZl/4UEOCKE1EdtG0JhH1x33K4CevMsIhe2XNHiTytMT7m0PieCG+rl5ja4tBFCgUA X-Received: by 10.107.150.83 with SMTP id y80mr24816104iod.113.1468623676912; Fri, 15 Jul 2016 16:01:16 -0700 (PDT) MIME-Version: 1.0 Received: by 10.36.31.17 with HTTP; Fri, 15 Jul 2016 16:00:57 -0700 (PDT) In-Reply-To: <20160618230143.GA25017@fedora-21-dvm> References: <20160616001040.GA5026@fedora-21-dvm> <20160616032612.GA7792@fedora-21-dvm> <20160617043435.GA12800@fedora-21-dvm> <20160618230143.GA25017@fedora-21-dvm> From: Bram Cohen Date: Fri, 15 Jul 2016 16:00:57 -0700 Message-ID: To: Peter Todd Content-Type: multipart/alternative; boundary=001a11405fee9247f30537b49804 X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HTML_MESSAGE,RCVD_IN_DNSWL_LOW autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Cc: Bitcoin Protocol Discussion Subject: Re: [bitcoin-dev] Merkle trees and mountain ranges X-BeenThere: bitcoin-dev@lists.linuxfoundation.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: Bitcoin Protocol Discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Jul 2016 23:01:18 -0000 --001a11405fee9247f30537b49804 Content-Type: text/plain; charset=UTF-8 On Sat, Jun 18, 2016 at 4:01 PM, Peter Todd wrote: > > Have you seen how BLAKE2 omits padding when the data to be hashed happens > to be > exactly one block in size? It's significantly faster than SHA256, and > that's a > standard part of the algorithm already. > That's very convenient! I didn't know it, but had 'look up how blake2 does padding' in my list of stuff to do. I'm leaning heavily towards using blake2b at this point, at least for internal hashing. > > > At the root there's a branch block. It consists of all nodes up to some > > fixed depth - let's say 12 - with that depth set so that it roughly fits > > within a single memory page. Branch blocks are arranged with the nodes in > > fixed position defined by the prefix they correspond to, and the > terminals > > have outpointers to other blocks. Because they're all clustered > together, a > > lookup or update will only require a single > > A single....? > Okay, I've figured out the root cause of general confusion here. It's mostly my fault. There are a few different media on which data can be stored, with different properties in terms of how long it takes to retrieve data from them, and how much of a readahead they typically have. I was misreading the l2 cache size as the main memory readahead amount, which is... probably wrong? The readahead properties of memory aren't well documented and apparently vary a lot. On SSDs it typically pulls down a kilobyte at once and they call them pages, hence my use of that term above. But since my real point is that my implementation should act as a silver bullet which can have acceptable performance even on extremely bad devices, I'll give an analysis of how well it works when everything is stored on a regular spinning hard drive. Let's say you're storing 100 million items, which will fit within 10 gigabytes. If you set the block depths to about 10 bits they'll be about 32K, and if you set the size of leaf blocks to be about the same then memory efficiency will be good because the leaf blocks will store twigs of about 2^7 in size while having 2^10 space so they'll fit reasonably. Almost everything will be three blocks from root, so updates will generally require two disk seeks (plus one more for a write but those are generally faster because they get batched.) For latency numbers, I'm going off these: https://gist.github.com/jboner/2841832 If the blockchain is very full of simple transactions and a disk seek takes 15 milliseconds, then going with the assumption that a block is about 600 seconds and the blockchain can handle 4 transactions per second and each of them is 3 updates (one utxo spent plus two new ones created) that's 15 * 600 * 4 * 3 * 2 milliseconds per block, or about 200 seconds per block, so if the uxto roots trail by a few blocks even a ludicrously underpowered node could keep up. On an SSD keeping up is completely trivial, the problem becomes one of how quickly you can validate an entire blockchain history. There a read takes about 0.15 milliseconds and you have to do 5 of them instead of 2 because the natural memory block size is 4k, so it's about 1 millisecond per update, or 600 * 4 * 3 total time for each block, which is about 7 seconds. That's large enough that making the utxo root trail by two blocks is still a good idea, but small enough that it isn't a big factor in the cost of running a node. It's enough that validating a complete block history might take a while though, and even validating just the last year would take a few days. This is very conservative and it's assuming that *everything* is kept on an SSD though. If the numbers work better and a few layers are kept in regular memory validating a whole year of history might only take a few hours. Hopefully that all makes a fairly good case that raw merkle tree utxo root trailing by a few blocks is a viable strategy. The data structures in the MMR proposal are fairly complicated and the analysis of them talks in somewhat vague terms about things being fast and slow. A similar analysis of the MMR proposal specifying storage media and expectations of latency numbers would clarify the reasoning a lot. (By the way, sorry for the slow response - I got preempted by a bunch of other work duties.) --001a11405fee9247f30537b49804 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On S= at, Jun 18, 2016 at 4:01 PM, Peter Todd <pete@petertodd.org> wrote:
=C2=A0
Have you seen how BLAKE2 omit= s padding when the data to be hashed happens to be
exactly one block in size? It's significantly faster than SHA256, and t= hat's a
standard part of the algorithm already.

That's very convenient! I didn't know it, but had 'look up how= blake2 does padding' in my list of stuff to do. I'm leaning heavil= y towards using blake2b at this point, at least for internal hashing.
=
=C2=A0

> At the root there's a branch block. It consists of all nodes up to= some
> fixed depth - let's say 12 - with that depth set so that it roughl= y fits
> within a single memory page. Branch blocks are arranged with the nodes= in
> fixed position defined by the prefix they correspond to, and the termi= nals
> have outpointers to other blocks. Because they're all clustered to= gether, a
> lookup or update will only require a single

A single....?

Okay, I've fig= ured out the root cause of general confusion here. It's mostly my fault= .

There are a few different media on which data ca= n be stored, with different properties in terms of how long it takes to ret= rieve data from them, and how much of a readahead they typically have. I wa= s misreading the l2 cache size as the main memory readahead amount, which i= s... probably wrong? The readahead properties of memory aren't well doc= umented and apparently vary a lot. On SSDs it typically pulls down a kiloby= te at once and they call them pages, hence my use of that term above. But s= ince my real point is that my implementation should act as a silver bullet = which can have acceptable performance even on extremely bad devices, I'= ll give an analysis of how well it works when everything is stored on a reg= ular spinning hard drive.

Let's say you're= storing 100 million items, which will fit within 10 gigabytes. If you set = the block depths to about 10 bits they'll be about 32K, and if you set = the size of leaf blocks to be about the same then memory efficiency will be= good because the leaf blocks will store twigs of about 2^7 in size while h= aving 2^10 space so they'll fit reasonably. Almost everything will be t= hree blocks from root, so updates will generally require two disk seeks (pl= us one more for a write but those are generally faster because they get bat= ched.)

For latency numbers, I'm going off thes= e:=C2=A0https://gist.git= hub.com/jboner/2841832

If the blockchain is ve= ry full of simple transactions and a disk seek takes 15 milliseconds, then = going with the assumption that a block is about 600 seconds and the blockch= ain can handle 4 transactions per second and each of them is 3 updates (one= utxo spent plus two new ones created) that's 15 * 600 * 4 * 3 * 2 mill= iseconds per block, or about 200 seconds per block, so if the uxto roots tr= ail by a few blocks even a ludicrously underpowered node could keep up.

On an SSD keeping up is completely trivial, the probl= em becomes one of how quickly you can validate an entire blockchain history= . There a read takes about 0.15 milliseconds and you have to do 5 of them i= nstead of 2 because the natural memory block size is 4k, so it's about = 1 millisecond per update, or 600 * 4 * 3 total time for each block, which i= s about 7 seconds. That's large enough that making the utxo root trail = by two blocks is still a good idea, but small enough that it isn't a bi= g factor in the cost of running a node. It's enough that validating a c= omplete block history might take a while though, and even validating just t= he last year would take a few days. This is very conservative and it's = assuming that *everything* is kept on an SSD though. If the numbers work be= tter and a few layers are kept in regular memory validating a whole year of= history might only take a few hours.

Hopefully th= at all makes a fairly good case that raw merkle tree utxo root trailing by = a few blocks is a viable strategy. The data structures in the MMR proposal = are fairly complicated and the analysis of them talks in somewhat vague ter= ms about things being fast and slow. A similar analysis of the MMR proposal= specifying storage media and expectations of latency numbers would clarify= the reasoning a lot.

(By the way, sorry for the s= low response - I got preempted by a bunch of other work duties.)
--001a11405fee9247f30537b49804--