To: lists@coryfields.com,
	Bitcoin Protocol Discussion <bitcoin-dev@lists.linuxfoundation.org>
References: <CAApLimjfPKDxmiy_SHjuOKbfm6HumFPjc9EFKvw=3NwZO8JcmQ@mail.gmail.com>
From: Matt Corallo <lf-lists@mattcorallo.com>
Message-ID: <2b585894-53ef-6364-81af-e31e9c7a48f5@mattcorallo.com>
Date: Thu, 17 May 2018 11:28:28 -0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
	Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <CAApLimjfPKDxmiy_SHjuOKbfm6HumFPjc9EFKvw=3NwZO8JcmQ@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Subject: Re: [bitcoin-dev] UHS: Full-node security without maintaining a
 full UTXO set
Precedence: list

Hey Cory,

I'm generally a fan of having an option to "prove a block is valid when
relaying it" instead of "just relay it", but I am concerned that this
proposal is overfitting the current UTXO set. Specifically, because UTXO
entries are (roughly) 32 bytes per output plus 32 bytes per transaction
on disk today, a material increase in batching and many-output
transactions may significantly reduce the UTXO-set-size gain in this
proposal while adding complexity to block relay as well as increase the
size of block data relayed, which can have adverse effects on
propagation. I'd love to see your tests re-run on simulated transaction
data with more batching of sends.

Matt

On 05/16/18 12:36, Cory Fields via bitcoin-dev wrote:
> Tl;dr: Rather than storing all unspent outputs, store their hashes. Untrusted
> peers can supply the full outputs when needed, with very little overhead.
> Any attempt to spoof those outputs would be apparent, as their hashes would not
> be present in the hash set. There are many advantages to this, most apparently
> in disk and memory savings, as well as a validation speedup. The primary
> disadvantage is a small increase in network traffic. I believe that the
> advantages outweigh the disadvantages.
> 
> --
> 
> Bitcoin’s unspent transaction output set (usually referred to as “The UTXO
> set”) has two primary roles: providing proof that previous outputs exist to be
> spent, and providing the actual previous output data for verification when new
> transactions attempts to spend them. These roles are not usually discussed
> independently, but as Bram Cohen's TXO Bitfield [0] idea hints, there are
> compelling reasons to consider them this way.
> 
> To see why, consider running a node with the following changes:
> 
> - For each new output, gather all extra data that will be needed for
>   verification when spending it later as an input: the amount, scriptPubKey,
>   creation height, coinbaseness, and output type (p2pkh, p2sh, p2wpkh, etc.).
>   Call this the Dereferenced Prevout data.
> - Create a hash from the concatenation of the new outpoint and the dereferenced
>   prevout data. Call this a Unspent Transaction Output Hash.
> - Rather than storing the full dereferenced prevout entries in a UTXO set as is
>   currently done, instead store their hashes to an Unspent Transaction Output
>   Hash Set, or UHS.
> - When relaying a transaction, append the dereferenced prevout for each input.
> 
> Now when a transaction is received, it contains everything needed for
> verification, including the input amount, height, and coinbaseness, which would
> have otherwise required a lookup the UTXO set.
> 
> To verify an input's unspentness, again create a hash from the concatenation of
> the referenced outpoint and the provided dereferenced prevout, and check for
> its presence in the UHS. The hash will only be present if a hash of the exact
> same data was previously added to (and not since removed from) the UHS. As
> such, we are protected from a peer attempting to lie about the dereferenced
> prevout data.
> 
> ### Some benefits of the UHS model
> 
> - Requires no consensus changes, purely a p2p/implementation change.
> 
> - UHS is substantially smaller than a full UTXO set (just over half for the
>   main chain, see below). In-memory caching can be much more effective as a
>   result.
> 
> - A block’s transactions can be fully verified before doing a potentially
>   expensive database lookup for the previous output data. The UHS can be
>   queried afterwards (or in parallel) to verify previous output inclusion.
> 
> - Entire blocks could potentially be verified out-of-order because all input
>   data is provided; only the inclusion checks have to be in-order. Admittedly
>   this is likely too complicated to be realistic.
> 
> - pay-to-pubkey outputs are less burdensome on full nodes, since they use no
>   more space on-disk than pay-to-pubkey-hash or pay-to-script-hash. Taproot and
>   Graftroot outputs may share the same benefits.
> 
> - The burden of holding UTXO data is technically shifted from the verifiers to
>   the spender. In reality, full nodes will likely always have a copy as well,
>   but conceptually it's a slight improvement to the incentive model.
> 
> - Block data from peers can also be used to roll backwards during a reorg. This
>   potentially enables an even more aggressive pruning mode.
> 
> - UTXO storage size grows exactly linearly with UTXO count, as opposed to
>   growing linearly with UTXO data size. This may be relevant when considering
>   new larger output types which would otherwise cause the UTXO Set size to
>   increase more quickly.
> 
> - The UHS is a simple set, no need for a key-value database. LevelDB could
>   potentially be dropped as a dependency in some distant future.
> 
> - Potentially integrates nicely with Pieter Wuille's Rolling UTXO set hashes
>   [1]. Unspent Transaction Output Hashes would simply be mapped to points on a
>   curve before adding them to the set.
> 
> - With the help of inclusion proofs and rolling hashes, libbitcoinconsensus
>   could potentially safely verify entire blocks. The size of the required
>   proofs would be largely irrelevant as they would be consumed locally.
> 
> - Others?
> 
> ### TxIn De-duplication
> 
> Setting aside the potential benefits, the obvious drawback of using a UHS is a
> significant network traffic increase. Fortunately, some properties of
> transactions can be exploited to offset most of the difference.
> 
> For quick reference:
> 
> p2pkh scriptPubKey: DUP HASH160 [pubkey hash] EQUALVERIFY CHECKSIG
> p2pkh scriptSig:    [signature] [pubkey]
> 
> p2sh scriptPubKey:  HASH160 [script hash] EQUAL
> p2sh scriptSig:     [signature(s)] [script]
> 
> Notice that if a peer is sending a scriptPubKey and a scriptSig together, as
> they would when using a UHS, there would likely be some redundancy. Using a
> p2sh output for example, the scriptPubKey contains the script hash, and the
> scriptSig contains the script itself. Therefore when sending dereferenced
> prevouts over the wire, any hash which can be computed can be omitted and only
> the preimages sent.
> 
> Non-standard output scripts must be sent in-full, though because they account
> for only ~1% of all current UTXOs, they are rare enough to be ignored here.
> 
> ### Intra-block Script De-duplication
> 
> When transactions are chained together in the same block, dereferenced prevout
> data for these inputs would be redundant, as the full output data is already
> present. For that reason, these dereferenced prevouts can be omitted when
> sending over the wire.
> 
> The downside would be a new reconstruction pass requirement prior to
> validation.
> 
> ### Data
> 
> Here's some preliminary testing with a naive POC implementation patched into
> Bitcoin Core. Note that the final sizes will depend on optimization of the
> serialization format. The format used for these tests is suboptimal for sure.
> Syncing mainnet to block 516346:
> 
>                       UTXO Node      UHS Node
>   IBD Network Data:   153G           157G
>   Block disk space:   175G           157G
>   UTXO disk space :   2.8G           1.6G
>   Total disk space:   177.8G         158.6G
> 
> The on-disk block-space reduction comes from the elimination of the Undo data
> that Bitcoin Core uses to roll back orphaned blocks. For UHS Nodes, this data
> is merged into to the block data and de-duplicated.
> 
> Compared to the UXTO model, using a UHS reduces disk space by ~12%, yet only
> requires ~5% more data over the wire.
> 
> Experimentation shows intra-block de-duplication to be of little help in
> practice, as it only reduces overhead by ~0.2% on mainnet. It could become more
> useful if, for example, CPFP usage increases substantially in the future.
> 
> ### Safety
> 
> Assuming sha256 for the UHS's hash function, I don't believe any fundamental
> changes to Bitcoin's security model are introduced. Because the unspent
> transaction output hashes commit to all necessary data, including output types,
> it should not be possible to be tricked into validating using mutated or forged
> inputs.
> 
> ### Compatibility
> 
> Transitioning from the current UTXO model would be annoying, but not
> particularly painful. I'll briefly describe my current preferred approach, but
> it makes sense to largely ignore this until there's been some discussion about
> UHS in general.
> 
> A new service-bit should be allocated to indicate that a node is willing to
> serve blocks and transactions with dereferenced prevout data appended. Once
> connected to a node advertising this feature, nodes would use a new getdata
> flag, creating MSG_PREVDATA_BLOCK and MSG_PREVDATA_TX.
> 
> Because current full nodes already have this data readily available in the
> block-undo files, it is trivial to append on-the-fly. For that reason, it would
> be easy to backport patches to the current stable version of Bitcoin Core in
> order to enable serving these blocks even before they could be consumed. This
> would avoid an awkward bootstrapping phase where there may only be a few nodes
> available to serve to all new nodes.
> 
> Admittedly I haven't put much thought into the on-disk format, I'd rather leave
> that to a database person. Though it does seem like a reasonable excuse to
> consider moving away from LevelDB.
> 
> Wallets would begin holding full prevout data for their unspent outputs, though
> they could probably back-into the data as-is.
> 
> ### Serialization
> 
> I would prefer to delay this discussion until a more high-level discussion has
> been had, otherwise this would be a magnet for nits. The format used to gather
> the data above can be seen in the implementation below.
> 
> It should be noted, though, that the size of a UHS is directly dependent on the
> chosen hash function. Smaller hashes could potentially be used, but I believe
> that to be unwise.
> 
> ### Drawbacks
> 
> The primary drawback of this approach is the ~5% network ovhead.
> 
> Additionally, there is the possibility that a few "bridge nodes" may be needed
> for some time. In a future with a network of only pruned UHS nodes, an old
> wallet with no knowledge of its dereferenced prevout data would need to
> broadcast an old-style transaction, and have a bridge node append the extra
> data before forwarding it along the network.
> 
> I won't speculate further there, except to say that I can't imagine a
> transition problem that wouldn't have a straightforward solution.
> 
> Migration issues aside, am I missing any obvious drawbacks?
> 
> ### Implementation
> 
> This code [2] was a quick hack-job, just enough to gather some initial data. It
> builds a UHS in memory and never flushes to disk. Only a single run works,
> nasty things will happen upon restart. It should only be viewed in order to get
> an idea of what changes are needed. Only enough for IBD is implemented,
> mempool/wallet/rpc are likely all broken. It is definitely not consensus-safe.
> 
> ### Acknowledgement
> 
> I consider the UHS concept to be an evolution of Bram Cohen's TXO bitfield
> idea. Bram also provided invaluable input when initially walking through the
> feasibility of a UHS.
> 
> Pieter Wuille's work on Rolling UTXO set hashes served as a catalyst for
> rethinking how the UTXO set may be represented.
> 
> Additional thanks to those at at Financial Crypto and the CoreDev event
> afterwards who helped to flesh out the idea:
> 
> Tadge Dryja
> Greg Sanders
> John Newbery
> Neha Narula
> Jeremy Rubin
> Jim Posen
> ...and everyone else who has chimed in.
> 
> 
> [0] https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2017-March/013928.html
> [1] https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2017-May/014337.html
> [2] https://github.com/theuni/bitcoin/tree/utxo-set-hash3
> _______________________________________________
> bitcoin-dev mailing list
> bitcoin-dev@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev
>