Received: from sog-mx-4.v43.ch3.sourceforge.com ([172.29.43.194] helo=mx.sourceforge.net) by sfs-ml-2.v29.ch3.sourceforge.com with esmtp (Exim 4.76) (envelope-from ) id 1VtpM3-0007jq-A5 for bitcoin-development@lists.sourceforge.net; Fri, 20 Dec 2013 01:58:19 +0000 Received: from mail-pb0-f46.google.com ([209.85.160.46]) by sog-mx-4.v43.ch3.sourceforge.com with esmtps (TLSv1:RC4-SHA:128) (Exim 4.76) id 1VtpM0-0003Cp-WC for bitcoin-development@lists.sourceforge.net; Fri, 20 Dec 2013 01:58:19 +0000 Received: by mail-pb0-f46.google.com with SMTP id md12so1930490pbc.19 for ; Thu, 19 Dec 2013 17:58:10 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:organization:user-agent :mime-version:to:subject:content-type:content-transfer-encoding; bh=zj2nb2fPFLuFAASGxUTnCHPs8yHMQ9rx2EXnZ8wvLSk=; b=Q1PhifIwFZxMRolAIirhdxznQ6MWJlMyerMit4RGXwFU3XLwlgWOV/3S5VtuJQ7jt3 2GrUmZc27awEBlA6x3hq8WnLeTg3h/Pqg/b16uxeGbZRJnhozYCjdF7E+lDSRI66kKwD hqC5a3eFv/AgQcUSfQPOoPItReq7imIkpTCbWRM9zoFwmqEkAW4UJSfy6DSl3OrgSenA NT59+xspB1X6wRNuROwNLaC/Lp1PiUkDPADc4ZV6HDtlG7LeHIjCOrDf/TRTtXX4CMmO wOgPfXuJmZypbFL9cKqZWJmkusoC+9EptFiFXj1eEjt07LtZgcfCGq6vYXKqFujzO3pk mOpQ== X-Gm-Message-State: ALoCoQkKos/5XsQoKeNgZNXx4ThauCrwG6DxBnItqXlWIdDEbLaQGgOAVZRk19CazcH7ml/T6g7l X-Received: by 10.66.162.195 with SMTP id yc3mr5480685pab.64.1387504690679; Thu, 19 Dec 2013 17:58:10 -0800 (PST) Received: from [192.168.127.177] (50-0-36-217.dsl.dynamic.sonic.net. [50.0.36.217]) by mx.google.com with ESMTPSA id pe3sm10332366pbc.23.2013.12.19.17.58.08 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 19 Dec 2013 17:58:09 -0800 (PST) Message-ID: <52B3A1C8.5000005@monetize.io> Date: Thu, 19 Dec 2013 17:47:52 -0800 From: Mark Friedenbach Organization: Monetize.io Inc. User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Bitcoin Dev X-Enigmail-Version: 1.6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-Spam-Report: Spam Filtering performed by mx.sourceforge.net. See http://spamassassin.org/tag/ for more details. 0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [URIs: wikipedia.org] X-Headers-End: 1VtpM0-0003Cp-WC Subject: [Bitcoin-development] BIP proposal: Authenticated prefix trees X-BeenThere: bitcoin-development@lists.sourceforge.net X-Mailman-Version: 2.1.9 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Dec 2013 01:58:19 -0000 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello fellow bitcoin developers. Included below is the first draft of a BIP for a new Merkle-compressed data structure. The need for this data structure arose out of the misnamed "Ultimate blockchain compression" project, but it has since been recognized to have many other applications. In addition to this BIP I am preparing three additional BIPs describing the use of this data structure in stateless validation & mining, the UBC address index for "SPV+" operating modes, document timestamping and merged mining. A Python implementation of this data structure is available here: https://github.com/monetizeio/python-bitcoin A C++ implementation is being worked on. As per the BIP-1 procedure, I am submitting this rough draft to the community for discussion. I welcome all comments and criticisms of both form and content. - -Mark ==Abstract== This BIP describes a [http://en.wikipedia.org/wiki/Hash_tree Merkle hash tree] variant of the [http://en.wikipedia.org/wiki/Trie prefix-tree data structure], ideally suited for encoding key-value indices which support memory-efficient proofs. ==Motivation== There are a number of applications which would benefit from having a data structure with the following properties: * '''Arbitrary mapping of keys to values.''' A ''key'' can be any bytestring, and its ''value'' any other bytestring. * '''Duplicate keys disallowed.''' Every key has one, and only one value associated with it. Some applications demand assurance that no key value is reused, and that this constraint can be checked without requiring access to the entire data structure. * '''Efficient look-up by key.''' The data structure should support sub-linear lookup operations with respect to the number of keys in the mapping. Logarithmic time or linear with respect to the length of the key should be achievable and would be sufficient for realistic applications. * '''Merkle compression of mapping structure.''' It should be possible to produce a reduced description of the tree consisting of a single root hash value which is deterministically calculated from the mapping structure. * '''Efficient proofs of inclusion.''' It should be possible to extract a proof of key/value mapping which is limited in size and verification time by the length of the key in the worst case. * '''Computation of updates using local information.''' Given a set of inclusion proofs, it should be possible to calculate adjustments to the local mapping structure (update or deletion of included mappings, or insertion between two included mappings which are adjacent in the global structure). Such applications include committed validation indices which enable stateless mining nodes, committed wallet indices which enable trust-less querying of the unspent transaction output set by scriptPubKey, efficient document time-stamping, and secure & efficient merged mining. This BIP describes an authenticated prefix tree which has the above properties, but leaves the myriad applications to be formalized in future BIPs. ==Data structure== This BIP defines a binary prefix tree. Such a structure provides a mapping of bitstrings (the ''keys'') to bytestrings (the ''values''). It is an acyclic binary tree which implicitly encodes keys within the traversal path -- a "left" branch is a 0, and a "right" branch is a 1. Each node is reachable by only one unique path, and reading off the branches taken (0 for each left, 1 for each right) as one follows the path from root to target yields the node's key. The particular binary prefix tree defined by this BIP is a hybrid PATRICIA / de la Brandais tree structure. [http://en.wikipedia.org/wiki/Radix_tree PATRICIA trees] compress a long sequence of non-branching nodes into a single interior node with a per-branch ''skip prefix''. This achieves significant savings in storage space, root hash calculation, and traversal time. A de la Brandais trie achieves compression by only storing branches actually taken in a node. The space savings are minimal for a binary tree, but place the serialized size of a non-branching interior node under the SHA-256 block size, thereby reducing the number of hash operations required to perform updates and validate proofs. This BIP describes the authenticated prefix tree and its many variations in terms of its serialized representation. Additional BIPs describe the application of authenticated prefix trees to such applications as committed indices, document time-stamping, and merged mining. ==Serialization format== As a hierarchical structure, the serialization of an entire tree is the serialization of its root node. A serialized node is the concatenation of five structures: node := flags || VARCHAR(extra) || value || left || right The flags is a single byte field whose composite values determine the bytes that follow. flags = (left_flags << 0) | (right_flags << 2) | (has_value << 4) | (prune_left << 5) | (prune_right << 6) | (prune_value << 7) The left_flags and right_flags are special 2-bit enumeration fields. A value of 0 indicates that the node does not branch in this direction, and the corresponding left or right branch is missing (replaced with the empty string in the node serialization). A value of 1 indicates a single bit key prefix for this branch, implicitly 0 for left and 1 for right. A 2 indicates up to 7 bits of additional skip prefix (beyond the implicit first bit, making 8 bits total) are stored in a compact single-byte format. A 3 indicates a skip prefix with greater than 7 additional bits, stored length-prefix encoded. The single bit has_value indicates whether the node stores a data bytestring, the value associated with its key prefix. Since keys may be any value or length, including one key being a prefix of another, it is possible for interior nodes in addition to leaf nodes to have values associated with them, and therefore an explicit value-existence bit is required. The remaining three bits are used for proof extraction, and are masked away prior to hash operations. prune_left indicates that the entire left branch has been pruned. prune_right has similar meaning for the right branch. If has_value is set, prune_value may be set to exclude the node's value from encoded proof. This is necessary field for interior nodes, since it is possible that their values may be pruned while their children are not. The value field is only present if the bit flags.has_value is set, in which case it is a VARCHAR bytestring: switch flags.has_value: case 0: value := ε case 1: value := VARCHAR(node.value) The extra field is always present, and takes on a bytestring value defined by the particular application. Use of the extra field is application dependent, and will not be covered in this specification. It can be set to the empty bytestring (serialized as a single zero byte) if the application has no use for the extra field. value := VARCHAR(calculate_extra(node)) The left and right non-terminals are only present if the corresponding flags.left_flags or flags.right_flags are non-zero. The format depends on the value of this flags setting: switch branch_flags: case 0: branch := ε case 1: branch := branch_node_or_hash case 2: prefix = prefix >> 1 branch := int_to_byte(1 << len(prefix) | bits_to_int(prefix)) || branch_node_or_hash case 3: prefix = prefix >> 1 branch := VARINT(len(prefix) - 9) || bits_to_string(prefix) || branch_node_or_hash branch_flags is a stand-in meant to describe either left_flags or right_flags, and likewise everywhere else in the above pseudocode branch can be replaced with either left or right. prefix is the key bits between the current node and the next branching, terminal, and/or leaf node, including the implicit leading bit for the branch (0 for the left branch, 1 for the right branch). In the above code, len(prefix) returns the number of bits in the bitstring, and prefix >> 1 drops the first bit reducing the size of the bitstring by one and renumbering the indices accordingly. The function int_to_byte takes an integer in the range [0, 255] and returns the octet representing that value. This is a NOP in many languages, but present in this pseudocode so as to be explicit about what is going on. The function bits_to_int interprets a sequence of bits as a little-endian integer value. This is analogous to the following pseudocode: def bits_to_int(bits): result = 0 for idx in 1..len(bits): if bits[idx] == 1: result |= 1<bits_to_string serializes a sequence of bits into a binary string. It uses little-endian bit and byte order, as demonstrated by the following pseudocode: def bits_to_string(bits): bytes = [0] * ceil(len(bits) / 8) for idx in 1..len(bits): if bits[idx] == 1: bytes[idx / 8] |= 1 << idx % 8 return map(int_to_byte, bytes) branch_node_or_hash is either the serialized child node or its SHA-256 hash and associated meta-data. Context determines which value to use: during digest calculations, disk/database serialization, and when the branch is pruned the hash value is used and serialized in the same way as other SHA-256 values in the bitcoin protocol (note however that it is single-SHA-256, not the double-SHA-256 more commonly used in bitcoin). The number of terminal (value-containing) nodes and the serialized size in bytes of the fully unpruned branch are suffixed to the branch hash. When serializing a proof or snapshotting tree state and the branch is not pruned, the serialized child node is included directly and the count and size are omitted as they can be derived from the serialization. if branch_pruned or SER_HASH: branch_node_or_hash := SHA-256(branch) || count(branch) || size(branch) else: branch_node_or_hash := serialize(branch) As an example, here is the serialization of a prefix tree mapping the names men and women of science to the year of their greatest publication: >>> dict = AuthTree() >>> dict['Curie'] = VARINT(1898) >>> dict('Einstein') = VARINT(1905) >>> dict['Fleming'] = VARINT(1928) >>> dict['中本'] = VARINT(2009) >>> dict.serialize() # An bytestring, broken out into parts: # . Root node: 0x0e # left_flags: 2, right_flags: 3, has_value: 1 0x00 # extra: ε # .l Inner node: 0b01000 0x11 # 0b01000 0x07 # left_flags: 3, right_flags: 1 0x00 # extra: ε # .l.l Inner node: 0b01000011 0b01110101 0b01110010 0b01101001 # 'C' 'u' 'r' 'i' # 0b01100101 # 'e' 0x1abb3a599a02 # 0b01101110101011100100110100101100101 0x10 # has_value: 1 0x00 # extra: ε 0x03fd6a07 # value: VARINT(1911) # .l.r Inner node: 0b010001 0x0f # left_flags: 3, right_flags: 3 0x00 # extra: ε # .l.r.l Inner node: 0b01000101 0b01101001 0b01101110 0b01110011 # 'E' 'i' 'n' 's' # 0b01110100 0b01100101 0b01101001 0b01101110 # 't' 'e' 'i' 'n' 0x312ded9c5d4c2ded00 # 0b1011010010110111 # 0b0011100110111010 # 0b0011001010110100 # 0b101101110 0x10 # has_value: 1 0x00 # extra: ε 0x03fd7107 # value: VARINT(1905) # .l.r.r Inner node: 0b01000110 0b01101100 0b01100101 0b01101101 # 'F' 'l' 'e' 'm' # 0b01101001 0b01101110 0b01100111 # 'i' 'n' 'g' 0x296c4c6d2dedcc01 # 0b0011011000110010 # 0b1011011010110100 # 0b10110111001100111 0x10 # has_value: 1 0x00 # extra: ε 0x03fd8807 # value: VARINT(1928) # .r Inner node: 0b11100100 0b10111000 0b10101101 # '中' # 0b11100110 0b10011100 0b10101100 # '本' 0x27938edab39c1a # 0b1100100101110001 # 0b0101101111001101 # 0b001110010101100 0x10 # has_value: 1 0x00 # extra: ε 0x03fdd907 # value: VARINT(2009) ==Hashing== There are two variations of the authenticated prefix tree presented in this draft BIP. They differ only in the way in which hash values of a node and its left/right branches are constructed. The variations, discussed below, tradeoff computational resources for the ability to compose operational proofs. Whether the performance hit is significant, and whether or not the added features are worth the tradeoff depends very much on the application. ===Variation 1: Level-compressed hashing=== In this variation the referenced child node's hash is used in construction of an interior node's hash digest. The interior node is serialized just as described (using the child node's digest instead of inline serialization), the resulting bytestring is passed through one round of SHA-256, and the digest that comes out of that is the hash value of the node. This is very efficient to calculate, requiring the absolute minimum number of SHA-256 hash operations, and achieving level-compression of computational resources in addition to reduction of space usage. For example: >>> dict = AuthTree() >>> dict['a'] = 0xff >>> dict.serialize() 0x0200c3100001ff >>> dict.root AuthTreeNode( left_prefix = 0b01100001, left_hash = 0xbafa0e2bba3396c5e9804b6cbe61be82bc442c1121aed81f8d5de36e9b20dc2f, left_count = 1, left_size = 4) >>> dict.hash 0xb4837376022a7c9ddaa7d685ad183bcbd5d16c362b81fa293a7b9e911766cf3c Assuming uniform distribution of key values, level-compressed hashing has time-complexity logarithmic with respect to the number of keys in the prefix tree. The disadvantage is that it is not possible in general to "rebase" an operational proof on top of a sibling, particularly if that sibling deletes branches that result in reorganization and level compression of internal nodes used by the rebased proof. ===Variation 2: Proof-updatable hashing=== In this variation, level-compressed branches are expanded into a series of chained single-branch internal nodes, each including the hash of its direct child. For a brach with a prefix N bits in length, this requires N chained hashes. Thanks to node-compression (excluding empty branches from the serialization), it is possible for each hash operation + padding to fit within a single SHA-256 block. Note that the serialization semantics are unchanged! The variation only changes the procedure for calculating the hash values of interior nodes. The serialization format remains the same (modulo differing hash values standing in for pruned branches). Using the above example, calling dict.hash causes the following internal nodes to be constructed: >>> node1 = AuthTreeNode( right_prefix = 0b1, right_hash = 0xbafa0e2bba3396c5e9804b6cbe61be82bc442c1121aed81f8d5de36e9b20dc2f, right_count = 1, right_size = 4) >>> node2 = AuthTreeNode( left_prefix=0b0, left_hash=node1.hash, left_count=1, left_size=4) >>> node3 = AuthTreeNode( left_prefix=0b0, left_hash=node2.hash, left_count=1, left_size=4) >>> node4 = AuthTreeNode( left_prefix=0b0, left_hash=node3.hash, left_count=1, left_size=4) >>> node5 = AuthTreeNode( left_prefix=0b0, left_hash=node4.hash, left_count=1, left_size=4) >>> node6 = AuthTreeNode(right_prefix=0b1, right_hash=node5.hash, right_count=1, right_size=4) >>> node7 = AuthTreeNode(right_prefix=0b1, right_hash=node6.hash, right_count=1, right_size=4) >>> node8 = AuthTreeNode( left_prefix=0b0, left_hash=node7.hash, left_count=1, left_size=4, value=0xff) >>> dict.hash == node8.hash True >>> dict.hash 0xc3a9328eff06662ed9ff8e82aa9cc094d05f70f0953828ea8c643c4679213895 The advantage of proof-updatable hashing is that any operational proof may be "rebased" onto the tree resulting from a sibling proof, using only the information locally available in the proofs, even in the presence of deletion operations that result in level-compression of the serialized form. The disadvantage is performance: validating an updatable proof requires a number of hash operations lower-bounded by the length of the key in bits. ==Inclusion proofs== An inclusion proof is a prefix tree pruned to contain a subset of its keys. The serialization of an inclusion proof takes the following form: inclusion_proof := variant || root_hash || root_node || checksum Where variant is a single-byte value indicating the presence of level-compression (0 for proof-updatable hashing, 1 for level-compressed hashing). root_hash is the Merkle compression hash of the tree, the 32-byte SHA-256 hash of the root node. tree is the possibly pruned, serialized representation of the tree. And finally, checksum is the first 4 bytes of the SHA-256 checksum of variant, root_hash, and root_node. For ease of transport, the standard envelope for display of an inclusion proof is internet-standard base64 encoding in the following format: - -----BEGIN INCLUSION PROOF----- ATzPZheRnns6KfqBKzZs0dXLOxithdan2p18KgJ2c4O0DgARBwAauzpZmgIQAAP9agcPADEt7Zxd TC3tABAAA/1xBylsTG0t7cwBEAAD/YgHJ5OO2rOcGhAAA/3ZByEg+2g= - -----END INCLUSION PROOF----- Decoded, it looks like this: 0x01 # Level-compressed hashing # Merkle root: 0x3ccf6617919e7b3a29fa812b366cd1d5cb3b18ad85d6a7da9d7c2a02767383b4 # Serialized tree (unpruned): 0x0e001107001abb3a599a02100003fd6a070f00312ded9c5d4c2ded00100003fd 0x7107296c4c6d2dedcc01100003fd880727938edab39c1a100003fdd907 # Checksum: 0x2120fb68 ==Operational proofs== An operational proof is a list of insert/update and delete operations suffixed to an inclusion proof which contains the pathways necessary to perform the specified operations. The inclusion proof must contain the key values to be updated or deleted, and the nearest adjacent key values for each insertion. The serialization of an operational proof takes the following form: operational_proof := variant || root_hash || tree || VARLIST(delete*) || VARLIST(update*) || new_hash || checksum delete := VARCHAR(key) update := VARCHAR(key) || VARCHAR(value) The first three fields, variant, root_hash, and tree are the inclusion proof, and take the same values described in the previous section. deletes is a list of key values to be deleted; each key value in this list must exist in the inclusion proof. updates is a list of key, value mappings which are to be inserted into the tree, possibly replacing any mapping for the key which already exists; either the key itself if it exists (update), or the two lexicographically closest keys on either side if it does not (insert) must be present in the insertion proof. new_hash is the resulting Merkle root after the insertion, updates, and deletes are performed, and checksum is the initial 4 bytes of the SHA-256 hash of the preceding fields. Just like inclusion proofs, an operational proof is encoded in base64 for display and transport. Here's the same - -----BEGIN OPERATIONAL PROOF----- ATzPZheRnns6KfqBKzZs0dXLOxithdan2p18KgJ2c4O0LgARaIsVaQi/GdhOPOgA8p4Pu4PiEfEg lcmy3j7bOc7hXw0DLSeTjtqznBoQAAP92QcBMOS4reacrACzuZJbyP7fqIOf5VEk4iarG4+uPoZC oun8BztQMQBy0LHVeSY= - -----END OPERATIONAL PROOF----- Decoded and broken into its constituent fields: 0x01 # Level-compressed hashing # Original Merkle root: 0x3ccf6617919e7b3a29fa812b366cd1d5cb3b18ad85d6a7da9d7c2a02767383b4 # Serialized tree (included keys: '中本'): 0x2e0011688b156908bf19d84e3ce800f29e0fbb83e211f12095c9b2de3edb39ce 0xe15f0d032d27938edab39c1a100003fdd907 # Deletion list ['中本']: 0x01 0x30e4b8ade69cac # Insertion list []: 0x00 # New Merkle root: 0xb3b9925bc8fedfa8839fe55124e226ab1b8fae3e8642a2e9fc073b50310072d0 # Checksum: 0xb1d57926 ~End of File~ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.14 (GNU/Linux) Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJSs6HIAAoJEAdzVfsmodw4gooQAJm7XNsZjgdeTSpKIvUIU38f tQx2FD08hQdLl48me5mDUbHJgGlYINsKAgoZ8Mqwi/kHEEYhuIlLIX1p6Ovigidb 21BiVoOLdG1egGOwxp17DuwYaDPTppFTlN9TBjZzW6WKc7+4aNvyc1KtrbHIhtj/ 04ekFyAn4U5UH0ht7CI79j0u3Kp85p5D4PyYZB2m82mzti6OxpSM4tXlMkDW7ihg QJwiZSjzejqTd7WF0zr0SLeGVRSN2A0dzUCoVsI98eIa3hkw2N4ae6dRkibyStOT V8VEDvHArEDlvu8jiryajhsom5mvtOOclNDkVXWAf/Te4gj05iYdTIvNvDEJtqsP XDbmw6GgV1kBLlLo0mp//t/+wr+nIvy+sVAP+eqtM/0vjaVXBkXxkUMqqNkrtVpB f3whq7nFahssUMSoWE93jgob1ayAax2XUALVMAXYsJl7b2MqBGlhiTZ8FQZ+TW4A tIpKeUprPmDvA18rO3SCbmLMQryZqYiH0sRyvUc5kvn3qCRHrISZNkEuK591eS+x BO1eOluPzVqeXPPSK1jvGeY0FNJtwzbov4nI1mzOvzQHLCvkHn5PhUFCK5tL5tAe b0Z5qwDV+SvVs7W1R7ejYBzEj77U1zuzZ9AtikOuvy+bNGrkIlpI49EyXHijm7C3 Q6JacTuI0PelYji2gaBJ =BbDs -----END PGP SIGNATURE-----