From chandhok at cis.ohio-state.edu Sat Jun 1 11:04:02 2002 From: chandhok at cis.ohio-state.edu (Nikhil Chandhok) Date: Sat Dec 9 22:11:45 2006 Subject: [p2p-hackers] question about SHA-1 Message-ID: <043901c20996$ad1a1fb0$ae3d6ba4@dell1lnxd01> If I take the message digest of all the IP addresses (all of which are 32-bits long and cover the whole 2exp32 space), will the resulting m-bit message digests produced by SHA-a be uniformly distributed in the 2exp160 space? SHA-1 documentation does not make any such claims. If the message digest produced are not uniformly distributed, what can we say about its diistribution? ~nikhil -------------- next part -------------- An HTML attachment was scrubbed... URL: http://zgp.org/pipermail/p2p-hackers/attachments/20020601/14923013/attachment.htm From coderman at mindspring.com Sat Jun 1 12:36:02 2002 From: coderman at mindspring.com (coderman) Date: Sat Dec 9 22:11:45 2006 Subject: [p2p-hackers] question about SHA-1 References: <043901c20996$ad1a1fb0$ae3d6ba4@dell1lnxd01> Message-ID: <3CF92223.85EE7AAB@mindspring.com> They will be uniformly distributed. This is an implicit feature of a good cryptographic hash digest. > Nikhil Chandhok wrote: > > If I take the message digest of all the IP addresses (all of which are > 32-bits long and cover the whole 2exp32 space), will the resulting m-bit message digests produced by SHA-a be > uniformly distributed in the > 2exp160 space? SHA-1 documentation does not make any such claims. If the message digest produced are not uniformly > distributed, what can we say about its diistribution? > > ~nikhil > From levine at vinecorp.com Sat Jun 1 15:20:01 2002 From: levine at vinecorp.com (James D. Levine) Date: Sat Dec 9 22:11:45 2006 Subject: [p2p-hackers] (SF Bay Area) South Bay PeerPunks meeting Message-ID: Since the Silicon Valley contingent of p2p hackerdom can't always make it up to SF to meet with the rest, I'm organizing the South Bay PeerPunks meeting the evening of Tuesday, June 11th in Mountain View. All p2p enthusiasts, hackers, well-wishers, etc. are invited. Come and participate! I will be easy to find there, just look for the guy in the CodeCon t-shirt. Where: Dana Street Roasting Company 744 W Dana St,Mountain View,CA 94041 Phone: (650) 390-9638 (map URL below) When: 7:00 sharp Map: http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=744+W+Dana+St&city=Mountain+View&state=CA&csz=Mountain+View,+CA+94041-1304&slt=37.392353&sln=-122.078945&name=&zip=94041-1304&country=us&&BFKey=&BFCat=&BFClient=&mag=9&desc=&cs=9&newmag=8&poititle=&poi=&ds=n Why PeerPunks and not p2p-hackers? Well, why not? I also want to encourage non-hackers to show up in addition to the usual suspects. See you there! From mfreed at MIT.EDU Sat Jun 1 18:00:02 2002 From: mfreed at MIT.EDU (Michael J Freedman) Date: Sat Dec 9 22:11:45 2006 Subject: [p2p-hackers] question about SHA-1 In-Reply-To: Your message of "Sat, 01 Jun 2002 14:03:47 EDT." <043901c20996$ad1a1fb0$ae3d6ba4@dell1lnxd01> Message-ID: <200206020059.UAA19784@buzzword-bingo.mit.edu> Nikhil, It is believed that the output of a universal hash function such as SHA-1 will be uniformly distributed at random over its range (2^160). However, this has not been proven. Still, you will find that many cryptographic proofs of security are based on this assumption -- termed the random oracle assumption. That is, there exists some black box which, upon request, will return a random number uniformly distributed over a range. If this is true, then we use this property to prove some security properties of our algorithm (for example, RSA with OAEP is secure against IND-CCA2). Cryptographic hash functions are widely modeled as such random oracles for such proofs. These are considered weaker proofs than general assumptions of computational hardness -- such as the existance of trapdoor permutations or that or one-way functions. However, they may allow one to prove stronger properties. From bram at gawth.com Wed Jun 5 15:15:01 2002 From: bram at gawth.com (Bram Cohen) Date: Sat Dec 9 22:11:45 2006 Subject: [p2p-hackers] New release of BitTorrent out, and documentation Message-ID: Anyone who's interested in how BitTorrent works can read a pretty thorough explanation of the protocol and the reasoning behind it here - http://bitconjurer.org/BitTorrent/protocol.html There's also a new release up, which should scale comfortably to thousands of downloaders - http://bitconjurer.org/BitTorrent/download.html -Bram Cohen "Markets can remain irrational longer than you can remain solvent" -- John Maynard Keynes From tboyle at rosehill.net Wed Jun 5 20:54:02 2002 From: tboyle at rosehill.net (Todd Boyle) Date: Sat Dec 9 22:11:45 2006 Subject: [p2p-hackers] New release of BitTorrent out, and documentation In-Reply-To: Message-ID: <5.1.0.14.0.20020605192005.03d0d160@popmail.cortland.com> At 03:14 PM 6/5/02, you wrote: >Anyone who's interested in how BitTorrent works can read a pretty thorough >explanation of the protocol and the reasoning behind it here - > >http://bitconjurer.org/BitTorrent/protocol.html > >There's also a new release up, which should scale comfortably to thousands >of downloaders - > >http://bitconjurer.org/BitTorrent/download.html > >-Bram Cohen SeattleWireless dudes are doing a series of field days with legacy 802.11b. Sunday they ran a connection from an omni antenna from a laptop in Alki Point to Magnolia, and sent streaming video 3 miles over it. http://www.seattlewireless.net/archive/ezmlm.cgi?5:mmp:293 After a few go-rounds I understand that multicasting is not trivial :-( So, clueless end-users would not be able to do it. The universe does not seem to want IP multicasting to be implemented too widely... Freenet is #15 in all-time downloads on Sourceforge with 1.3million downloads http://sourceforge.net/top/toplist.php?type=downloads They appear to be still grinding away on the code. What if there were a device one could put on the roof, with automatic mesh networking, Freenet storage and bitTorrent or another swarmcast technology? That would be, ehmm. disruptive. Todd From sam at neurogrid.com Fri Jun 7 01:44:01 2002 From: sam at neurogrid.com (Sam Joseph) Date: Sat Dec 9 22:11:45 2006 Subject: [p2p-hackers] Simulator Documentation Message-ID: <3D00736F.2020907@neurogrid.com> Hi all, I probably posted previously here about the NeuroGrid simulator which is a java-based p2p simulation tool that currently supports Gnutella, NeuroGrid and Freenet. I've started to get activity on the neurogrid-simulation mailing list with people asking for things. Gasp, asking for *documentation* no less. Okay, improbable as it seems, I have finally created a bit of documentation on how to extend the NeuroGrid simulator code. It's just a set of power points, but it should get across the basic code structure and how to implement the abstract methods required to create new p2p functionality in the simulator. http://www.neurogrid.net/CodeCamp_EnglishVersion.htm Let me know what you think. I hope to follow this up with some html documentation that goes through the same process in more detail. Hopefully some feedback on these power points will push me in the right direction :-) CHEERS> SAM From bram at gawth.com Sun Jun 9 09:15:02 2002 From: bram at gawth.com (Bram Cohen) Date: Sat Dec 9 22:11:45 2006 Subject: [p2p-hackers] p2p-hackers meeting in a week, mozilla party Message-ID: So, I think it's about time for a new p2p-hackers meeting. When: Next Sunday, the 16th, 3pm Where: SONY Metreon, in the food court area That was easy. Who's going to the Mozilla party? It's free in San Francisco - http://mozilla.org/party/2002/flyer.html I'll be there, hopefully will even be able to track down who's responsible for the windows registry stuff in mozilla. -Bram Cohen "Markets can remain irrational longer than you can remain solvent" -- John Maynard Keynes From levine at vinecorp.com Sun Jun 9 23:15:02 2002 From: levine at vinecorp.com (James D. Levine) Date: Sat Dec 9 22:11:45 2006 Subject: [p2p-hackers] Reminder: (SF Bay Area) South Bay PeerPunks meeting this Tuesday In-Reply-To: Message-ID: A friendly reminder for those of us lacking a Palm Pilot. Tuesday at 7pm at Dana St. Roasting Co. in Mountain View. Come for the company, stay for the free 802.11b... -James On Sat, 1 Jun 2002, James D. Levine wrote: > > Since the Silicon Valley contingent of p2p hackerdom > can't always make it up to SF to meet with the rest, > I'm organizing the South Bay PeerPunks meeting the > evening of Tuesday, June 11th in Mountain View. > > > All p2p enthusiasts, hackers, well-wishers, etc. are > invited. Come and participate! > > I will be easy to find there, just look for the > guy in the CodeCon t-shirt. > > > Where: > > Dana Street Roasting Company > 744 W Dana St,Mountain View,CA 94041 > Phone: (650) 390-9638 > (map URL below) > > > When: 7:00 sharp > > > Map: > > http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=744+W+Dana+St&city=Mountain+View&state=CA&csz=Mountain+View,+CA+94041-1304&slt=37.392353&sln=-122.078945&name=&zip=94041-1304&country=us&&BFKey=&BFCat=&BFClient=&mag=9&desc=&cs=9&newmag=8&poititle=&poi=&ds=n > > > Why PeerPunks and not p2p-hackers? Well, why not? I also > want to encourage non-hackers to show up in addition to the > usual suspects. > > > See you there! > > > -- From justin at chapweske.com Mon Jun 10 20:51:02 2002 From: justin at chapweske.com (Justin Chapweske) Date: Sat Dec 9 22:11:45 2006 Subject: [p2p-hackers] Tree Hash EXchange format (THEX) Message-ID: <3D057368.9040306@chapweske.com> The Tree Hash EXchange format (THEX) specification, written by myself and Gordon Mohr of Bitzi, is now available at: This specification describes a serialization format for exchanging Merkle Hash Trees. These hash trees can be used to perform fine-grained integrity checking of content within distributed content delivery networks. This specification should be useful for many other systems besides the OCN, so we'd love to hear any feedback you have. Thanks, -- Justin Chapweske, Onion Networks http://onionnetworks.com/ -------------- next part -------------- J. Chapweske Onion Networks, Inc. G. Mohr Bitzi, Inc. June 10, 2002 Tree Hash EXchange format (THEX) Abstract This memo presents the Tree Hash Exchange (THEX) format, for exchanging Merkle Hash Trees built up from the subrange hashes of discrete digital files. Such tree hash data structures assist in file integrity verification, allowing arbitrary subranges of bytes to be verified before the entire file has been received. Chapweske & Mohr [Page 1] Tree Hash EXchange format (THEX) June 2002 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Merkle Hash Trees . . . . . . . . . . . . . . . . . . . . . 4 2.1 Unbalanced Trees . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Choice Of Segment Size . . . . . . . . . . . . . . . . . . . 6 3. Serialization Format . . . . . . . . . . . . . . . . . . . . 8 3.1 DIME Encapsulation . . . . . . . . . . . . . . . . . . . . . 8 3.2 XML Tree Description . . . . . . . . . . . . . . . . . . . . 8 3.2.1 File Size . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2.2 File Segment Size . . . . . . . . . . . . . . . . . . . . . 9 3.2.3 Digest Algorithm . . . . . . . . . . . . . . . . . . . . . . 9 3.2.4 Digest Output Size . . . . . . . . . . . . . . . . . . . . . 10 3.2.5 Serialized Tree Depth . . . . . . . . . . . . . . . . . . . 10 3.2.6 Serialized Tree Type . . . . . . . . . . . . . . . . . . . . 10 3.2.7 Serialized Tree URI . . . . . . . . . . . . . . . . . . . . 10 3.3 Breadth-First Serialization . . . . . . . . . . . . . . . . 10 3.3.1 Serialization Type URI . . . . . . . . . . . . . . . . . . . 11 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 11 Chapweske & Mohr [Page 2] Tree Hash EXchange format (THEX) June 2002 1. Introduction The Merkle Hash Tree, invented by Ralph Merkle, is a hash construct that has very nice properties for verifying the integrity of files and file subranges in an incremental or out-of-order fashion. This document describes a binary serialization format for hash trees that is compact and optimized for both sequential and random access. This memo has two goals: 1. To describe Merkle Hash Trees and how they are used for file integrity verification. 2. To describe a serialization format for storage and transmission of hash trees. Chapweske & Mohr [Page 3] Tree Hash EXchange format (THEX) June 2002 2. Merkle Hash Trees It is common practice in distributed systems to use secure hash algorithms to verify the integrity of content. The employment of secure hash algorithms enables systems to retreive content from completely untrusted hosts with only a small amount of trusted metadata. Typically, algorithms such as SHA-1 and MD5 have been used to check the content integrity after retrieving the entire file. These full file hash techniques work fine in an environment where the content is received from a single host and there are no streaming requirements. However, there are an increasing number of systems that retrieve a single piece of content from multiple untrusted hosts, and require content verification well in advance of retrieving the entire file. Many modern peer-to-peer content delivery systems employ fixed size "block hashes" to provide a finer level of granularity in their integrity checking. This approach is still limited in the verification resolution it can attain. Additionally, all of the hash information must be retrieved from a trusted host, which can limit the scalability and reliability of the system. Another way to verify content is to use the hash tree approach. This approach has the desired characteristics missing from the full file hash approach and works well for very large files. The idea is to break the file up into a number of small pieces, hash those pieces, and then iteratively combine and rehash the resulting hashes in a tree-like fashion until a single "root hash" is created. The root hash by itself behaves exactly the same way that full file hashes do. If the root hash is retrieved from a trusted source, it can be used to verify the integrity of the entire content. More importantly, the root hash can be combined with a small number of other hashes to verify the integrity of any of the file segments. For example, consider a file made up of four segments, S1, S2, S3, and S4. Let H() be the hash function, and '+' indicate concatenation. You could take the traditional hash value: VALUE=H(S1+S2+S3+S4) Chapweske & Mohr [Page 4] Tree Hash EXchange format (THEX) June 2002 Or, you could employ a tree approach: ROOT=H(E+F) / \ / \ E=H(A+B) F=H(C+D) / \ / \ / \ / \ A=H(S1) B=H(S2) C=H(S3) D=H(S4) Now, assuming that the ROOT is retrieved from a trusted source, the integrity of a file segment coming from an untrusted source can be checked with a small amount of hash data. For instance, if S1 is received from an untrusted host, the integrity of S1 can be verified with just B and F. With these, it can be verified that, yes: S1 can be combined up to equal the ROOT hash, even without seeing the other segments. (It is just as impractical to create falsified values of B and F as it is to manipulate any good hash function to give desired results -- so B and F can come from untrusted sources as well.) Similarly, if some other untrusted source provides segments S3 and S4, their integrity can be easily checked when combined with hash E. From segments S3 and S4, the values of C and D and then F can be calculated. With these, you can verify that S3 and S4 can combine up to create the ROOT -- even if other sources are providing bogus S1 and S2 segments. Bad info can be immediately recognized and discarded, and good info retained, even in situations where you could not even begin to calculate a traditional full-file hash. Another interesting property of the tree approach is that it can be used to verify (tree-aligned) subranges whose size is any multiple of the base segment size. Consider for example an initial segment size of 1,024 bytes, and a file of 32GB. You could verify a single 1,024-byte block, with about 25 proof-assist values, or a block of size 16GB, with a single proof- assist value -- or anything in between. 2.1 Unbalanced Trees For trees that are unbalanced -- that is, they have a number of leaves which is not a power of 2 -- interim hash values which do not have a sibling value to which they may be concatenated are promoted, unchanged, up the tree until a sibling is found. Chapweske & Mohr [Page 5] Tree Hash EXchange format (THEX) June 2002 For example, consider a file made up of 5 segments, S1, S2, S3, S4, and S5. ROOT=H(H+E) / \ / \ H=H(E+F) E / \ \ / \ \ F=H(A+B) G=H(C+D) E / \ / \ \ / \ / \ \ A=H(S1) B=H(S2) C=H(S3) D=H(S4) E=H(S5) In the above example, E does not have any immediate siblings with which to be combined to calculate the next generation. So, E is promoted up the tree, without being rehashed, until it can be paired with value H. The values H and E are then concatenated, and hashed, to produce the ROOT hash. 2.2 Choice Of Segment Size Any segment size is possible, but the choice of base segment size establishes the smallest possible unit of verification. If the the segment size is equal to or larger than the file to be hashed, the tree hash value is the value of the single segment's value, which is the same as the underlying hash algorithm value for the whole file. A segment size equal to the digest algorithm output size would more than double the total amount of data to be hashed, and thus more than double the time required to calculate the tree hash structure, as compared to a simple full-file hash. However, once the segment size reaches several multiples of the digest size, calculating the tree adds only a small fractional time overhead beyond what a traditional full-file hash would cost. Otherwise, smaller segments are better. Smaller segments allow, but do not require, the retention and use of fine-grained verification info, (A stack-based tree calculation procedure need never retain more than one pending internal node value per generation before it can be combined with a sibling, and all interim values below a certain generation size of interest can be discarded.) Further, it is beneficial for multiple application domains and even files of wildly different sizes to share the same base segment size, so that tree structures can be shared and used to discover correlated subranges. Chapweske & Mohr [Page 6] Tree Hash EXchange format (THEX) June 2002 Thus the authors recommend a segment size of 1,024 bytes for most applications, as a sort of "smallest common denominator", even for applications involving multi-gigabyte or terabyte files. This segment size is 40-50 times larger than common secure hash digest lengths (20-24 bytes), and thus adds no more than 5-10% in running time as compared to the "infinite segment" size case -- the traditional full-file hash. Considering a 1 terabyte file, the maximum dynamic state required during the calculation of the tree root value is 29 interim node values -- less than 1KB assuming a 20-byte digest algorithm like SHA- 1. Only interim values in generations of interest for range verification need to be remembered for tree exchange, so if only 8GB ranges ever need to be verified, all but the top 8 generations of internal values (255 hashes) can be discarded. Chapweske & Mohr [Page 7] Tree Hash EXchange format (THEX) June 2002 3. Serialization Format This section presents a serialization format for Merkle Hash Trees that utilizes the Direct Internet Message Encapsulation (DIME) format. DIME is a generic message format that allows for multiple payloads, either text or binary. The Merkle Hash Tree serialization format consists of two different payloads. The first is XML encoded meta-data about the hash tree, and the second is binary serialization of the tree itself. The binary serialization is required for two important reasons: 1. Compactness of Representation - A key virtue of the hash tree approach is that it provides considerable integrity checking power with a relatively small amount of data. A typical hash tree consists of a large number of small hashes. Thus a text encoding, such as XML, could easily double the storage and transmission requirements of the hash tree, negating one of its key benefits. 2. Random Access - In order to take full advantage of the hash tree construct, it is often necessary to read the elements of the hash tree in a random access fashion. A common usage of this serialization format will be to access hash data over the HTTP protocol using "Range Requests". This will allow implementors to retrieve small bits of hash information on-demand, even requesting different parts of the tree from different hosts on the network. 3.1 DIME Encapsulation It is RECOMMENDED that DIME be used to encapsulate the payloads described in this specification. The current version of DIME is "draft-nielsen-dime-01" at (http://gotdotnet.com/team/xml_wsspecs/ dime/default.aspx). It is RECOMMENDED that the first payload in the DIME Message be the XML Tree Description. The XML Tree description payload MUST be before the the binary serialized tree. It is RECOMMENDED that the binary serialized tree be stored in a single payload rather than using chunked payloads. This will allow implementations to read the tree hash data in a random access fashion within the payload. 3.2 XML Tree Description The XML Tree Description contains metadata about the hash tree and Chapweske & Mohr [Page 8] Tree Hash EXchange format (THEX) June 2002 file that is necessary to interpret the binary serialized tree. An important consideration in the design of THEX is the intention for it to be received from untrusted sources within a distributed network. The only information that needs to be obtained from a trusted source is the root hash and the segment size. The root hash by itself can be used to verify the integrity of the serialized tree and of the file itself. It is RECOMMENDED that implementers assume that the serialized file was obtained from an untrusted source, thus the use of this format to store non-verifiable information, such as general file metadata, is highly discouraged. For instance, a malicious party could easily forge metadata, such as the author or file name. <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE hashtree "http://open-content.net/spec/thex/thex.dtd"> <hashtree> <file size='1146045066' segmentsize='1024'/> <digest algorithm='http://www.w3.org/2000/09/xmldsig#sha1' outputsize='20'/> <serializedtree depth='22' type='http://open-content.net/spec/thex/breadthfirst' uri='uuid:09233523-345b-4351-b623-5dsf35sgs5d6'/> </hashtree> 3.2.1 File Size The file size attribute refers to the size, in bytes, of the file that the hash tree was generated from. 3.2.2 File Segment Size The file segment size identifies the size, in bytes, of the file segments that were used to create the hash tree. As noted in Section 2.2, it is recommended that applications use a small, common segment size such as 1,024 bytes in order to retain maximum flexibility and interoperability. 3.2.3 Digest Algorithm This attribute provides the identifier URI for the digest algorithm. A URI is used here as an identifier instead of a regular string to avoid the overhead of IANA-style registration. By using URIs, new types can be created without having to consult any other entity. The URIs are only to be used for type identification purposes, but it is RECOMMENDED that the URIs point to information about the given digest Chapweske & Mohr [Page 9] Tree Hash EXchange format (THEX) June 2002 function. This convention is inspired by RFC 3275, the XML Signature Specification. For instance, the SHA-1 algorithm is identified by "http://www.w3.org/2000/09/xmldsig#sha1" 3.2.4 Digest Output Size This attribute specifies the size of the output of the hash function, in bytes. 3.2.5 Serialized Tree Depth This attribute specifies the number of levels of the tree that have been serialized. This value allows control over the amount of storage space required by the serialized tree. In general, each row added to the tree will double the storage requirements while also doubling the verification resolution. 3.2.6 Serialized Tree Type This attribute provides the identifier URI for the serialization type. Just as with the Digest Algorithm, new serialization types can be added and described without going through a formal IANA-style process. One serialization type is defined for "Breadth-First Serialization" later in this document. 3.2.7 Serialized Tree URI This attribute provides the URI of the binary serialized tree payload. If used within a DIME payload, it is recommended that this URI be location independant, such as the "uuid:" URI's used in the SOAP in DIME specification or SHA-1 URNs. 3.3 Breadth-First Serialization Normal breadth-first serialization is the recommended manner in which to serialize the hash tree. This format includes the root hash first, and then each "row" of hashes is serialized until the tree has been serialized to the lowest level as specified by the "Serialized Tree Depth" field. Chapweske & Mohr [Page 10] Tree Hash EXchange format (THEX) June 2002 For example, consider a file made up of 5 segments, S1, S2, S3, S4, and S5. ROOT=H(H+E) / \ / \ H=H(F+G) E / \ \ / \ \ F=H(A+B) G=H(C+D) E / \ / \ \ / \ / \ \ A=H(S1) B=H(S2) C=H(S3) D=H(S4) E=H(S5) The hashes would be serialized in the following order: ROOT, H, E, F, G, E, A, B, C, D, E. Notice that E is serialized as a part of the each row. This is due to its promotion as there are no available siblings in the lower rows. If we choose to serialize the entire tree, the serialized tree depth would be 4, and for a 20 byte digest output, the entire tree payload would occupy 11*20 = 220 bytes. 3.3.1 Serialization Type URI The serialization type URI for a Merkle Hash Tree serialized in normal breadth-first form is "http://open-content.net/spec/thex/ breadthfirst". Authors' Addresses Justin Chapweske Onion Networks, Inc. 1668 Rosehill Circle Lauderdale, MN 55108 US EMail: justin@onionnetworks.com URI: http://onionnetworks.com/ Gordon Mohr Bitzi, Inc. EMail: gojomo@bitzi.com URI: http://bitzi.com/ Chapweske & Mohr [Page 11] From justin at onionnetworks.com Tue Jun 11 02:21:02 2002 From: justin at onionnetworks.com (Justin Chapweske) Date: Sat Dec 9 22:11:45 2006 Subject: [p2p-hackers] Tree Hash EXchange format (THEX) Message-ID: <3D056FE5.9090503@onionnetworks.com> The Tree Hash EXchange format (THEX) specification, written by myself and Gordon Mohr of Bitzi, is now available at: This specification describes a serialization format for exchanging Merkle Hash Trees. These hash trees can be used to perform fine-grained integrity checking of content within distributed content delivery networks. This specification should be useful for many other systems besides the OCN, so we'd love to hear any feedback you have. Thanks, -- Justin Chapweske, Onion Networks http://onionnetworks.com/ -------------- next part -------------- J. Chapweske Onion Networks, Inc. G. Mohr Bitzi, Inc. June 10, 2002 Tree Hash EXchange format (THEX) Abstract This memo presents the Tree Hash Exchange (THEX) format, for exchanging Merkle Hash Trees built up from the subrange hashes of discrete digital files. Such tree hash data structures assist in file integrity verification, allowing arbitrary subranges of bytes to be verified before the entire file has been received. Chapweske & Mohr [Page 1] Tree Hash EXchange format (THEX) June 2002 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Merkle Hash Trees . . . . . . . . . . . . . . . . . . . . . 4 2.1 Unbalanced Trees . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Choice Of Segment Size . . . . . . . . . . . . . . . . . . . 6 3. Serialization Format . . . . . . . . . . . . . . . . . . . . 8 3.1 DIME Encapsulation . . . . . . . . . . . . . . . . . . . . . 8 3.2 XML Tree Description . . . . . . . . . . . . . . . . . . . . 8 3.2.1 File Size . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2.2 File Segment Size . . . . . . . . . . . . . . . . . . . . . 9 3.2.3 Digest Algorithm . . . . . . . . . . . . . . . . . . . . . . 9 3.2.4 Digest Output Size . . . . . . . . . . . . . . . . . . . . . 10 3.2.5 Serialized Tree Depth . . . . . . . . . . . . . . . . . . . 10 3.2.6 Serialized Tree Type . . . . . . . . . . . . . . . . . . . . 10 3.2.7 Serialized Tree URI . . . . . . . . . . . . . . . . . . . . 10 3.3 Breadth-First Serialization . . . . . . . . . . . . . . . . 10 3.3.1 Serialization Type URI . . . . . . . . . . . . . . . . . . . 11 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 11 Chapweske & Mohr [Page 2] Tree Hash EXchange format (THEX) June 2002 1. Introduction The Merkle Hash Tree, invented by Ralph Merkle, is a hash construct that has very nice properties for verifying the integrity of files and file subranges in an incremental or out-of-order fashion. This document describes a binary serialization format for hash trees that is compact and optimized for both sequential and random access. This memo has two goals: 1. To describe Merkle Hash Trees and how they are used for file integrity verification. 2. To describe a serialization format for storage and transmission of hash trees. Chapweske & Mohr [Page 3] Tree Hash EXchange format (THEX) June 2002 2. Merkle Hash Trees It is common practice in distributed systems to use secure hash algorithms to verify the integrity of content. The employment of secure hash algorithms enables systems to retreive content from completely untrusted hosts with only a small amount of trusted metadata. Typically, algorithms such as SHA-1 and MD5 have been used to check the content integrity after retrieving the entire file. These full file hash techniques work fine in an environment where the content is received from a single host and there are no streaming requirements. However, there are an increasing number of systems that retrieve a single piece of content from multiple untrusted hosts, and require content verification well in advance of retrieving the entire file. Many modern peer-to-peer content delivery systems employ fixed size "block hashes" to provide a finer level of granularity in their integrity checking. This approach is still limited in the verification resolution it can attain. Additionally, all of the hash information must be retrieved from a trusted host, which can limit the scalability and reliability of the system. Another way to verify content is to use the hash tree approach. This approach has the desired characteristics missing from the full file hash approach and works well for very large files. The idea is to break the file up into a number of small pieces, hash those pieces, and then iteratively combine and rehash the resulting hashes in a tree-like fashion until a single "root hash" is created. The root hash by itself behaves exactly the same way that full file hashes do. If the root hash is retrieved from a trusted source, it can be used to verify the integrity of the entire content. More importantly, the root hash can be combined with a small number of other hashes to verify the integrity of any of the file segments. For example, consider a file made up of four segments, S1, S2, S3, and S4. Let H() be the hash function, and '+' indicate concatenation. You could take the traditional hash value: VALUE=H(S1+S2+S3+S4) Chapweske & Mohr [Page 4] Tree Hash EXchange format (THEX) June 2002 Or, you could employ a tree approach: ROOT=H(E+F) / \ / \ E=H(A+B) F=H(C+D) / \ / \ / \ / \ A=H(S1) B=H(S2) C=H(S3) D=H(S4) Now, assuming that the ROOT is retrieved from a trusted source, the integrity of a file segment coming from an untrusted source can be checked with a small amount of hash data. For instance, if S1 is received from an untrusted host, the integrity of S1 can be verified with just B and F. With these, it can be verified that, yes: S1 can be combined up to equal the ROOT hash, even without seeing the other segments. (It is just as impractical to create falsified values of B and F as it is to manipulate any good hash function to give desired results -- so B and F can come from untrusted sources as well.) Similarly, if some other untrusted source provides segments S3 and S4, their integrity can be easily checked when combined with hash E. From segments S3 and S4, the values of C and D and then F can be calculated. With these, you can verify that S3 and S4 can combine up to create the ROOT -- even if other sources are providing bogus S1 and S2 segments. Bad info can be immediately recognized and discarded, and good info retained, even in situations where you could not even begin to calculate a traditional full-file hash. Another interesting property of the tree approach is that it can be used to verify (tree-aligned) subranges whose size is any multiple of the base segment size. Consider for example an initial segment size of 1,024 bytes, and a file of 32GB. You could verify a single 1,024-byte block, with about 25 proof-assist values, or a block of size 16GB, with a single proof- assist value -- or anything in between. 2.1 Unbalanced Trees For trees that are unbalanced -- that is, they have a number of leaves which is not a power of 2 -- interim hash values which do not have a sibling value to which they may be concatenated are promoted, unchanged, up the tree until a sibling is found. Chapweske & Mohr [Page 5] Tree Hash EXchange format (THEX) June 2002 For example, consider a file made up of 5 segments, S1, S2, S3, S4, and S5. ROOT=H(H+E) / \ / \ H=H(E+F) E / \ \ / \ \ F=H(A+B) G=H(C+D) E / \ / \ \ / \ / \ \ A=H(S1) B=H(S2) C=H(S3) D=H(S4) E=H(S5) In the above example, E does not have any immediate siblings with which to be combined to calculate the next generation. So, E is promoted up the tree, without being rehashed, until it can be paired with value H. The values H and E are then concatenated, and hashed, to produce the ROOT hash. 2.2 Choice Of Segment Size Any segment size is possible, but the choice of base segment size establishes the smallest possible unit of verification. If the the segment size is equal to or larger than the file to be hashed, the tree hash value is the value of the single segment's value, which is the same as the underlying hash algorithm value for the whole file. A segment size equal to the digest algorithm output size would more than double the total amount of data to be hashed, and thus more than double the time required to calculate the tree hash structure, as compared to a simple full-file hash. However, once the segment size reaches several multiples of the digest size, calculating the tree adds only a small fractional time overhead beyond what a traditional full-file hash would cost. Otherwise, smaller segments are better. Smaller segments allow, but do not require, the retention and use of fine-grained verification info, (A stack-based tree calculation procedure need never retain more than one pending internal node value per generation before it can be combined with a sibling, and all interim values below a certain generation size of interest can be discarded.) Further, it is beneficial for multiple application domains and even files of wildly different sizes to share the same base segment size, so that tree structures can be shared and used to discover correlated subranges. Chapweske & Mohr [Page 6] Tree Hash EXchange format (THEX) June 2002 Thus the authors recommend a segment size of 1,024 bytes for most applications, as a sort of "smallest common denominator", even for applications involving multi-gigabyte or terabyte files. This segment size is 40-50 times larger than common secure hash digest lengths (20-24 bytes), and thus adds no more than 5-10% in running time as compared to the "infinite segment" size case -- the traditional full-file hash. Considering a 1 terabyte file, the maximum dynamic state required during the calculation of the tree root value is 29 interim node values -- less than 1KB assuming a 20-byte digest algorithm like SHA- 1. Only interim values in generations of interest for range verification need to be remembered for tree exchange, so if only 8GB ranges ever need to be verified, all but the top 8 generations of internal values (255 hashes) can be discarded. Chapweske & Mohr [Page 7] Tree Hash EXchange format (THEX) June 2002 3. Serialization Format This section presents a serialization format for Merkle Hash Trees that utilizes the Direct Internet Message Encapsulation (DIME) format. DIME is a generic message format that allows for multiple payloads, either text or binary. The Merkle Hash Tree serialization format consists of two different payloads. The first is XML encoded meta-data about the hash tree, and the second is binary serialization of the tree itself. The binary serialization is required for two important reasons: 1. Compactness of Representation - A key virtue of the hash tree approach is that it provides considerable integrity checking power with a relatively small amount of data. A typical hash tree consists of a large number of small hashes. Thus a text encoding, such as XML, could easily double the storage and transmission requirements of the hash tree, negating one of its key benefits. 2. Random Access - In order to take full advantage of the hash tree construct, it is often necessary to read the elements of the hash tree in a random access fashion. A common usage of this serialization format will be to access hash data over the HTTP protocol using "Range Requests". This will allow implementors to retrieve small bits of hash information on-demand, even requesting different parts of the tree from different hosts on the network. 3.1 DIME Encapsulation It is RECOMMENDED that DIME be used to encapsulate the payloads described in this specification. The current version of DIME is "draft-nielsen-dime-01" at (http://gotdotnet.com/team/xml_wsspecs/ dime/default.aspx). It is RECOMMENDED that the first payload in the DIME Message be the XML Tree Description. The XML Tree description payload MUST be before the the binary serialized tree. It is RECOMMENDED that the binary serialized tree be stored in a single payload rather than using chunked payloads. This will allow implementations to read the tree hash data in a random access fashion within the payload. 3.2 XML Tree Description The XML Tree Description contains metadata about the hash tree and Chapweske & Mohr [Page 8] Tree Hash EXchange format (THEX) June 2002 file that is necessary to interpret the binary serialized tree. An important consideration in the design of THEX is the intention for it to be received from untrusted sources within a distributed network. The only information that needs to be obtained from a trusted source is the root hash and the segment size. The root hash by itself can be used to verify the integrity of the serialized tree and of the file itself. It is RECOMMENDED that implementers assume that the serialized file was obtained from an untrusted source, thus the use of this format to store non-verifiable information, such as general file metadata, is highly discouraged. For instance, a malicious party could easily forge metadata, such as the author or file name. <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE hashtree "http://open-content.net/spec/thex/thex.dtd"> <hashtree> <file size='1146045066' segmentsize='1024'/> <digest algorithm='http://www.w3.org/2000/09/xmldsig#sha1' outputsize='20'/> <serializedtree depth='22' type='http://open-content.net/spec/thex/breadthfirst' uri='uuid:09233523-345b-4351-b623-5dsf35sgs5d6'/> </hashtree> 3.2.1 File Size The file size attribute refers to the size, in bytes, of the file that the hash tree was generated from. 3.2.2 File Segment Size The file segment size identifies the size, in bytes, of the file segments that were used to create the hash tree. As noted in Section 2.2, it is recommended that applications use a small, common segment size such as 1,024 bytes in order to retain maximum flexibility and interoperability. 3.2.3 Digest Algorithm This attribute provides the identifier URI for the digest algorithm. A URI is used here as an identifier instead of a regular string to avoid the overhead of IANA-style registration. By using URIs, new types can be created without having to consult any other entity. The URIs are only to be used for type identification purposes, but it is RECOMMENDED that the URIs point to information about the given digest Chapweske & Mohr [Page 9] Tree Hash EXchange format (THEX) June 2002 function. This convention is inspired by RFC 3275, the XML Signature Specification. For instance, the SHA-1 algorithm is identified by "http://www.w3.org/2000/09/xmldsig#sha1" 3.2.4 Digest Output Size This attribute specifies the size of the output of the hash function, in bytes. 3.2.5 Serialized Tree Depth This attribute specifies the number of levels of the tree that have been serialized. This value allows control over the amount of storage space required by the serialized tree. In general, each row added to the tree will double the storage requirements while also doubling the verification resolution. 3.2.6 Serialized Tree Type This attribute provides the identifier URI for the serialization type. Just as with the Digest Algorithm, new serialization types can be added and described without going through a formal IANA-style process. One serialization type is defined for "Breadth-First Serialization" later in this document. 3.2.7 Serialized Tree URI This attribute provides the URI of the binary serialized tree payload. If used within a DIME payload, it is recommended that this URI be location independant, such as the "uuid:" URI's used in the SOAP in DIME specification or SHA-1 URNs. 3.3 Breadth-First Serialization Normal breadth-first serialization is the recommended manner in which to serialize the hash tree. This format includes the root hash first, and then each "row" of hashes is serialized until the tree has been serialized to the lowest level as specified by the "Serialized Tree Depth" field. Chapweske & Mohr [Page 10] Tree Hash EXchange format (THEX) June 2002 For example, consider a file made up of 5 segments, S1, S2, S3, S4, and S5. ROOT=H(H+E) / \ / \ H=H(F+G) E / \ \ / \ \ F=H(A+B) G=H(C+D) E / \ / \ \ / \ / \ \ A=H(S1) B=H(S2) C=H(S3) D=H(S4) E=H(S5) The hashes would be serialized in the following order: ROOT, H, E, F, G, E, A, B, C, D, E. Notice that E is serialized as a part of the each row. This is due to its promotion as there are no available siblings in the lower rows. If we choose to serialize the entire tree, the serialized tree depth would be 4, and for a 20 byte digest output, the entire tree payload would occupy 11*20 = 220 bytes. 3.3.1 Serialization Type URI The serialization type URI for a Merkle Hash Tree serialized in normal breadth-first form is "http://open-content.net/spec/thex/ breadthfirst". Authors' Addresses Justin Chapweske Onion Networks, Inc. 1668 Rosehill Circle Lauderdale, MN 55108 US EMail: justin@onionnetworks.com URI: http://onionnetworks.com/ Gordon Mohr Bitzi, Inc. EMail: gojomo@bitzi.com URI: http://bitzi.com/ Chapweske & Mohr [Page 11] From levine at vinecorp.com Tue Jun 11 13:06:03 2002 From: levine at vinecorp.com (James D. Levine) Date: Sat Dec 9 22:11:45 2006 Subject: [p2p-hackers] TONIGHT: (SF Bay Area) South Bay PeerPunks meeting In-Reply-To: <20020610102908.93BCF3FC9B@capsicum.zgp.org> Message-ID: If you're reading this, you may be too late! Last call- PeerPunks TONIGHT 7pm at DSRC in Mountain View. See you there! ------- Since the Silicon Valley contingent of p2p hackerdom can't always make it up to SF to meet with the rest, I'm organizing the South Bay PeerPunks meeting the evening of Tuesday, June 11th in Mountain View. All p2p enthusiasts, hackers, well-wishers, etc. are invited. Come and participate! I will be easy to find there, just look for the guy in the CodeCon t-shirt. Where: Dana Street Roasting Company 744 W Dana St,Mountain View,CA 94041 Phone: (650) 390-9638 (map URL below) When: 7:00 sharp Map: http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=744+W+Dana+St&city=Mountain+View&state=CA&csz=Mountain+View,+CA+94041-1304&slt=37.392353&sln=-122.078945&name=&zip=94041-1304&country=us&&BFKey=&BFCat=&BFClient=&mag=9&desc=&cs=9&newmag=8&poititle=&poi=&ds=n Why PeerPunks and not p2p-hackers? Well, why not? I also want to encourage non-hackers to show up in addition to the usual suspects. See you there! From bram at gawth.com Wed Jun 12 13:41:02 2002 From: bram at gawth.com (Bram Cohen) Date: Sat Dec 9 22:11:45 2006 Subject: [p2p-hackers] Reminder: mozilla party tonight, p2p-hackers meeting sunday Message-ID: For those of you who have forgotten, or somehow managed to not find out in the first place, there's a no cover charge release party for mozilla happening tonight at DNA in san francisco - http://mozilla.org/party/2002/flyer.html Also, remember that there's a p2p-hackers meeting this sunday, at 3pm, at the metreon in san francisco. And finally, not terribly p2p-related, but there's a book reading by Rudy Rucker happening on saturday at 2pm at this address - Borderlands Books 866 Valencia Street (between 19th and 20th Streets) San Francisco 415 824-8203 -Bram Cohen "Markets can remain irrational longer than you can remain solvent" -- John Maynard Keynes From sam at neurogrid.com Sun Jun 16 19:55:02 2002 From: sam at neurogrid.com (Sam Joseph) Date: Sat Dec 9 22:11:45 2006 Subject: [p2p-hackers] Distributed Meta-Data Strategies Message-ID: <3D0D5093.1020906@neurogrid.com> Hi All, So I've scraped together a document on distributed meta-data strategies. It's still rough and not ready, and I haven't been able to add all the systems I want, but release early and often right? http://www.neurogrid.net/Decentralized_Meta-Data_Strategies-neat.html I'm hopeful that I can get some feedback on this before subsequently releasing it to a wider audience, e.g. the decentralization list, haha :-) It also needs some link and reference fixing - but then you guessed that already. Well, I 'm still trying to work out whether to categorise "semantic-routing" and "hash-based-routing" differently or not ... Anyways ... All feedback gratefully received. CHEERS> SAM From gojomo at bitzi.com Tue Jun 18 05:59:02 2002 From: gojomo at bitzi.com (Gordon Mohr) Date: Sat Dec 9 22:11:45 2006 Subject: [p2p-hackers] Comments sought: MAGNET-URI scheme Message-ID: <013001c21649$acfadaa0$640a000a@golden> I've cooked up a proposal for a generic, "open standard" way for websites to display links that handoff operations to client-side applications -- like file-management tools and P2P content-delivery networks or file-sharing applications. It involves a new URI scheme called "magnet:". (URIs, or Uniform/Universal Resource Identifiers, are the more general class of identifier-like things of which Uniform Resource Names (URNs) are just one kind.) In a way, "magnet:" URIs could be thought of as project- and vendor- neutral versions of the P2P-system-specific URIs that have been proliferating. (Examples include "ed2k", "freenet", "mnet", "sig2dat", and probably others.) However, "magnet" URIs are more general and fuzzy in meaning -- a "magnet:" URI will bring up a list of locally available options, instead of telling a single program, which monopolizes the "magnet:" type, an exact action to take. These URIs could be very useful for activating Gnutella servents, but would not be Gnutella-only. For more info, please check out the details and examples at: http://magnet-uri.sourceforge.net I'll be tracking comments here and in the "magnet-uri" YahooGroup. Ideas for how this could be made more robust, general, & useful especially appreciated! - Gojomo ____________________ Gordon Mohr Bitzi CTO describe and discover files of every kind. _ http://bitzi.com _ Bitzi knows bits -- because you teach it! From burton at openprivacy.org Sun Jun 23 15:50:02 2002 From: burton at openprivacy.org (Kevin A. Burton) Date: Sat Dec 9 22:11:45 2006 Subject: [p2p-hackers] Comments sought: MAGNET-URI scheme In-Reply-To: <013001c21649$acfadaa0$640a000a@golden> References: <013001c21649$acfadaa0$640a000a@golden> Message-ID: <87bsa162b5.fsf@openprivacy.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 "Gordon Mohr" writes: > I've cooked up a proposal for a generic, "open standard" way for websites to > display links that handoff operations to client-side applications -- like > file-management tools and P2P content-delivery networks or file-sharing > applications. Gordon... just some thoughts I had while reviewing this on the plane: - - should be done with protozilla - - /magnet10 URL could also be implemented under Tomcat fairly easily with the ROOT webapp and URL handling. - how would one implement things like options.js with params. - - Have you thought about having magnet10 URL implemented as a SOAP/XML-RPC service? This seems like it would be a LOT easier to implement and I don't think it needs to be a REST service. - - This would be GREAT for M3U and RSS files. - -- Kevin A. Burton ( burton@apache.org, burton@openprivacy.org, burton@peerfear.org ) Location - San Francisco, CA, Cell - 415.595.9965 Jabber - burtonator@jabber.org, Web - http://www.peerfear.org/ IRC - openprojects.net #infoanarchy | #p2p-hackers | #reptile Stuff to read: What's Wrong with Copy Protection, by John Gilmore -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: Get my public key at: http://relativity.yi.org/pgpkey.txt iD8DBQE9Fk/OAwM6xb2dfE0RAqS2AJwIYU8vTFGBSFAl2Ackk6hXNT0PWQCgnkio VgniaRr7rj4Cfa5TVBscX0Q= =0Fp0 -----END PGP SIGNATURE----- From gojomo at usa.net Mon Jun 24 01:38:01 2002 From: gojomo at usa.net (Gordon Mohr) Date: Sat Dec 9 22:11:45 2006 Subject: [p2p-hackers] Comments sought: MAGNET-URI scheme References: <013001c21649$acfadaa0$640a000a@golden> <87bsa162b5.fsf@openprivacy.org> Message-ID: <004a01c21b5a$56950ad0$640a000a@golden> Kevin Burton writes: > Gordon... just some thoughts I had while reviewing this on the plane: > > - - should be done with protozilla Yes, for Mozilla, that could be the way to go. However, Mozilla on Windows picks up registry-entered handlers as well -- so once you take the step of requiring a local/pltform/browser-specific install, that's probably the right way to go on Windows -- it catches both IE and Mozilla... and maybe Opera etc too. > - how would one implement things like options.js with params. What do you mean? > - - Have you thought about having magnet10 URL implemented as a SOAP/XML-RPC > service? This seems like it would be a LOT easier to implement and I don't > think it needs to be a REST service. The idea has come up from several sources that there should be a cleaner way to probe for services. The problem is it's tough to do anything other than "