Return-Path: Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id B84A1E29 for ; Wed, 16 May 2018 16:36:38 +0000 (UTC) X-Greylist: whitelisted by SQLgrey-1.7.6 Received: from mail-oi0-f45.google.com (mail-oi0-f45.google.com [209.85.218.45]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 849496BF for ; Wed, 16 May 2018 16:36:37 +0000 (UTC) Received: by mail-oi0-f45.google.com with SMTP id b130-v6so1283832oif.12 for ; Wed, 16 May 2018 09:36:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=coryfields-com.20150623.gappssmtp.com; s=20150623; h=mime-version:reply-to:from:date:message-id:subject:to :content-transfer-encoding; bh=1Mxsd38bAfxwR3syO7nqYThVT9qibohi9M9YWjyK2PA=; b=Oa2EFdVcXS2coYeq/cXmtEIcw+dfd9gxKptyiHXCdsF/4Z73RnMp/QufuJg5CG5gxt iDQdnGmJnQn2mHhhGYeWxLJe1jL0P2MW4tH/+2N7DEhFUsiSb6eXfTDUgILfrWmijewC rROKPIYBsBoSsZ3UlFOWfKxB/uitQ1wzqGhoGluYSUe+GAoG0llH9F8uNoUyEilPhvmt PqAVeiXVMU2LOcozatBPfEBj26xCFU/Ngc1ZFSvGq+kQMqYP0RHiSF7hlXKv7Au/kNA9 rk17T9qJifz9m5a/ezlI+yCIAXSzG4h7T2pcaqkFptJAvDZzRYcqZ71f6Yn9tK8G5R7/ UbRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:reply-to:from:date:message-id :subject:to:content-transfer-encoding; bh=1Mxsd38bAfxwR3syO7nqYThVT9qibohi9M9YWjyK2PA=; b=ZdWOF8qKJ8FKd3ook0mzHdc5WQ2jCehcY320JOE9+ruFi12jFBiKOOwVJOsOp3Wry6 Z7lFF+gKoYILr6K+Kz4k/ZJ+VUooXIT4eW7oXsaEUUso09IeHr5xVwrUS/xjd86O4cxt aOxedqUsiHhTqrW9PvLtxvLrM4ZP2GCNhKp4OnOcA0Cc3amxEjXqD0lG255RIg9kdVVc DdhVnzP2vI7Q0Rh1qNYP+Uqb5P9wZT0Fk0/UpwnGKkaq6MpU76bgkGgeE52Nq9Nj2/8R FqbiAy+WIGVVU7dtH3kyjeqApBOZpag3VafR58Q1M9Z/ifAJawjSUCIUfg9/iE8BusPu XHdA== X-Gm-Message-State: ALKqPwcJMiVce/5S+1wVWCqa6V4Ve32ZJ/3GuhB/nS3wpS4iPvH5MUkG nGX8jJ3SIqD7+OHGB3OIKH6sUJDnG+nNlAoHwViJfRoPYoo= X-Google-Smtp-Source: AB8JxZp3+xKRfYzCO81PM8NGj3f8kwJkFg5KD79qjK7Hm4SXG7PosUjQmS1ki2ghkVALOT6yIueNTVjCA8HfcRl8Rco= X-Received: by 2002:aca:508f:: with SMTP id e137-v6mr1091014oib.22.1526488596293; Wed, 16 May 2018 09:36:36 -0700 (PDT) MIME-Version: 1.0 Reply-To: lists@coryfields.com Received: by 10.74.191.17 with HTTP; Wed, 16 May 2018 09:36:35 -0700 (PDT) From: Cory Fields Date: Wed, 16 May 2018 12:36:35 -0400 Message-ID: To: Bitcoin Dev Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Subject: [bitcoin-dev] UHS: Full-node security without maintaining a full UTXO set X-BeenThere: bitcoin-dev@lists.linuxfoundation.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: Bitcoin Protocol Discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 16 May 2018 16:36:38 -0000 Tl;dr: Rather than storing all unspent outputs, store their hashes. Untrust= ed peers can supply the full outputs when needed, with very little overhead. Any attempt to spoof those outputs would be apparent, as their hashes would= not be present in the hash set. There are many advantages to this, most apparen= tly in disk and memory savings, as well as a validation speedup. The primary disadvantage is a small increase in network traffic. I believe that the advantages outweigh the disadvantages. -- Bitcoin=E2=80=99s unspent transaction output set (usually referred to as = =E2=80=9CThe UTXO set=E2=80=9D) has two primary roles: providing proof that previous outputs = exist to be spent, and providing the actual previous output data for verification when = new transactions attempts to spend them. These roles are not usually discussed independently, but as Bram Cohen's TXO Bitfield [0] idea hints, there are compelling reasons to consider them this way. To see why, consider running a node with the following changes: - For each new output, gather all extra data that will be needed for verification when spending it later as an input: the amount, scriptPubKey= , creation height, coinbaseness, and output type (p2pkh, p2sh, p2wpkh, etc.= ). Call this the Dereferenced Prevout data. - Create a hash from the concatenation of the new outpoint and the derefere= nced prevout data. Call this a Unspent Transaction Output Hash. - Rather than storing the full dereferenced prevout entries in a UTXO set a= s is currently done, instead store their hashes to an Unspent Transaction Outp= ut Hash Set, or UHS. - When relaying a transaction, append the dereferenced prevout for each inp= ut. Now when a transaction is received, it contains everything needed for verification, including the input amount, height, and coinbaseness, which w= ould have otherwise required a lookup the UTXO set. To verify an input's unspentness, again create a hash from the concatenatio= n of the referenced outpoint and the provided dereferenced prevout, and check fo= r its presence in the UHS. The hash will only be present if a hash of the exa= ct same data was previously added to (and not since removed from) the UHS. As such, we are protected from a peer attempting to lie about the dereferenced prevout data. ### Some benefits of the UHS model - Requires no consensus changes, purely a p2p/implementation change. - UHS is substantially smaller than a full UTXO set (just over half for the main chain, see below). In-memory caching can be much more effective as a result. - A block=E2=80=99s transactions can be fully verified before doing a poten= tially expensive database lookup for the previous output data. The UHS can be queried afterwards (or in parallel) to verify previous output inclusion. - Entire blocks could potentially be verified out-of-order because all inpu= t data is provided; only the inclusion checks have to be in-order. Admitted= ly this is likely too complicated to be realistic. - pay-to-pubkey outputs are less burdensome on full nodes, since they use n= o more space on-disk than pay-to-pubkey-hash or pay-to-script-hash. Taproot= and Graftroot outputs may share the same benefits. - The burden of holding UTXO data is technically shifted from the verifiers= to the spender. In reality, full nodes will likely always have a copy as wel= l, but conceptually it's a slight improvement to the incentive model. - Block data from peers can also be used to roll backwards during a reorg. = This potentially enables an even more aggressive pruning mode. - UTXO storage size grows exactly linearly with UTXO count, as opposed to growing linearly with UTXO data size. This may be relevant when consideri= ng new larger output types which would otherwise cause the UTXO Set size to increase more quickly. - The UHS is a simple set, no need for a key-value database. LevelDB could potentially be dropped as a dependency in some distant future. - Potentially integrates nicely with Pieter Wuille's Rolling UTXO set hashe= s [1]. Unspent Transaction Output Hashes would simply be mapped to points o= n a curve before adding them to the set. - With the help of inclusion proofs and rolling hashes, libbitcoinconsensus could potentially safely verify entire blocks. The size of the required proofs would be largely irrelevant as they would be consumed locally. - Others? ### TxIn De-duplication Setting aside the potential benefits, the obvious drawback of using a UHS i= s a significant network traffic increase. Fortunately, some properties of transactions can be exploited to offset most of the difference. For quick reference: p2pkh scriptPubKey: DUP HASH160 [pubkey hash] EQUALVERIFY CHECKSIG p2pkh scriptSig: [signature] [pubkey] p2sh scriptPubKey: HASH160 [script hash] EQUAL p2sh scriptSig: [signature(s)] [script] Notice that if a peer is sending a scriptPubKey and a scriptSig together, a= s they would when using a UHS, there would likely be some redundancy. Using a p2sh output for example, the scriptPubKey contains the script hash, and the scriptSig contains the script itself. Therefore when sending dereferenced prevouts over the wire, any hash which can be computed can be omitted and o= nly the preimages sent. Non-standard output scripts must be sent in-full, though because they accou= nt for only ~1% of all current UTXOs, they are rare enough to be ignored here. ### Intra-block Script De-duplication When transactions are chained together in the same block, dereferenced prev= out data for these inputs would be redundant, as the full output data is alread= y present. For that reason, these dereferenced prevouts can be omitted when sending over the wire. The downside would be a new reconstruction pass requirement prior to validation. ### Data Here's some preliminary testing with a naive POC implementation patched int= o Bitcoin Core. Note that the final sizes will depend on optimization of the serialization format. The format used for these tests is suboptimal for sur= e. Syncing mainnet to block 516346: UTXO Node UHS Node IBD Network Data: 153G 157G Block disk space: 175G 157G UTXO disk space : 2.8G 1.6G Total disk space: 177.8G 158.6G The on-disk block-space reduction comes from the elimination of the Undo da= ta that Bitcoin Core uses to roll back orphaned blocks. For UHS Nodes, this da= ta is merged into to the block data and de-duplicated. Compared to the UXTO model, using a UHS reduces disk space by ~12%, yet onl= y requires ~5% more data over the wire. Experimentation shows intra-block de-duplication to be of little help in practice, as it only reduces overhead by ~0.2% on mainnet. It could become = more useful if, for example, CPFP usage increases substantially in the future. ### Safety Assuming sha256 for the UHS's hash function, I don't believe any fundamenta= l changes to Bitcoin's security model are introduced. Because the unspent transaction output hashes commit to all necessary data, including output ty= pes, it should not be possible to be tricked into validating using mutated or fo= rged inputs. ### Compatibility Transitioning from the current UTXO model would be annoying, but not particularly painful. I'll briefly describe my current preferred approach, = but it makes sense to largely ignore this until there's been some discussion ab= out UHS in general. A new service-bit should be allocated to indicate that a node is willing to serve blocks and transactions with dereferenced prevout data appended. Once connected to a node advertising this feature, nodes would use a new getdata flag, creating MSG_PREVDATA_BLOCK and MSG_PREVDATA_TX. Because current full nodes already have this data readily available in the block-undo files, it is trivial to append on-the-fly. For that reason, it w= ould be easy to backport patches to the current stable version of Bitcoin Core i= n order to enable serving these blocks even before they could be consumed. Th= is would avoid an awkward bootstrapping phase where there may only be a few no= des available to serve to all new nodes. Admittedly I haven't put much thought into the on-disk format, I'd rather l= eave that to a database person. Though it does seem like a reasonable excuse to consider moving away from LevelDB. Wallets would begin holding full prevout data for their unspent outputs, th= ough they could probably back-into the data as-is. ### Serialization I would prefer to delay this discussion until a more high-level discussion = has been had, otherwise this would be a magnet for nits. The format used to gat= her the data above can be seen in the implementation below. It should be noted, though, that the size of a UHS is directly dependent on= the chosen hash function. Smaller hashes could potentially be used, but I belie= ve that to be unwise. ### Drawbacks The primary drawback of this approach is the ~5% network ovhead. Additionally, there is the possibility that a few "bridge nodes" may be nee= ded for some time. In a future with a network of only pruned UHS nodes, an old wallet with no knowledge of its dereferenced prevout data would need to broadcast an old-style transaction, and have a bridge node append the extra data before forwarding it along the network. I won't speculate further there, except to say that I can't imagine a transition problem that wouldn't have a straightforward solution. Migration issues aside, am I missing any obvious drawbacks? ### Implementation This code [2] was a quick hack-job, just enough to gather some initial data= . It builds a UHS in memory and never flushes to disk. Only a single run works, nasty things will happen upon restart. It should only be viewed in order to= get an idea of what changes are needed. Only enough for IBD is implemented, mempool/wallet/rpc are likely all broken. It is definitely not consensus-sa= fe. ### Acknowledgement I consider the UHS concept to be an evolution of Bram Cohen's TXO bitfield idea. Bram also provided invaluable input when initially walking through th= e feasibility of a UHS. Pieter Wuille's work on Rolling UTXO set hashes served as a catalyst for rethinking how the UTXO set may be represented. Additional thanks to those at at Financial Crypto and the CoreDev event afterwards who helped to flesh out the idea: Tadge Dryja Greg Sanders John Newbery Neha Narula Jeremy Rubin Jim Posen ...and everyone else who has chimed in. [0] https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2017-March/0139= 28.html [1] https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2017-May/014337= .html [2] https://github.com/theuni/bitcoin/tree/utxo-set-hash3