The following memo presents the Tree Hash Exchange (THEX) format, for exchanging Merkle Hash Trees built up from the subrange digests of discrete digital files. Such tree hash data structures assist in file integrity verification, allowing arbitrary subranges of bytes to be verified before the entire file has been received.
To get the latest complete version of the THEX specification, visit:
http://open-content.net/specs/draft-jchapweske-thex-01.htmlTHEX can be used to exchange Merkle Hash Trees computed with various message digest algorithms and various digest sizes (including "CRC32", "MD5", "SHA1" or "Tiger" with all their variants).
Despite THEX trees built with CRC32 are very fast to compute and can detect most errors in transmissions, they don't offer security against undesired tampering of file contents. In addition, CRC32 tends to be too small for large file contents where THEX is typically needed. So, stronger digest algorithms with longer lengths are highly preferable.
Most THEX applications will then use the 160-bit "SHA1" message digest algorithm, or the faster and stronger 192-bit "Tiger" message digest algorithm, as they are currently irreversible.
It's possible to use reduced versions of these two message digests to minimize the storage space used by serialized THEX tree data, but message digests should generate at least 128 bits, each bit with approximately equal encryption strength.
The standard THEX tree data exchange format uses XML in its encapsulation layer according to the Direct Internet Message Encapsulation (alias DIME) specification used for XML Web Services, initially developed by Henrik Frystik Nielsen, and developped as an IETF draft by Microsoft/IBM for SOAP. Visit:
http://msdn.microsoft.com/webservices/understanding/gxa/default.asp?pull=/library/en-us/dnglobspec/html/dimeindex.asp)The THEX serialized tree data transported in a DIME encapsulation should be accessible in a location independant way, for example the secure "urn:sha1:" URN or very secure "urn:bitprint:" URN (both requires precomputing the digests of the fully serialized tree), or a more simple "uuid:" URI defined in SOAP and referenced in DIME (this UUID can be generated independantly of the serialized tree data content, and may reduce the time to generate the DIME encapsulation as it does not require an additional hash, but this makes THEX serialized trees less secure).
For some distributed applications (in peer-to-peer file exchange protocols or distributed file systems), the THEX encapsulation in XML with DIME may be unnecessary, if user-agents all agree on the message digest algorithm to use, and on its tree data serialization format. In that case, only the tree data URN may be necessary, and transported for example during connection handshake headers (if using HTTP-like protocols that allow transporting such extensions before the actual file content data). Note however that DIME allows further extension to stronger or faster alternate algorithms if they become necessary.
A typical application of Merkle Hash Trees is "TigerTree" which is another file Digest that can complement "SHA1" file digests.
A "TigerTree" digest differs from a full "Tiger" because it is NOT computed by digesting the full file, but by combining "Tiger" digests computed on individual 1KB blocks, and combining them in a Merkle Hash Tree. The "TigerTree" digest of the file is the root hash of the Merkle Hash tree computed with the standard "Tiger" digest.
The Bitzi's "bitprint:" URN scheme uses the "TigerTree" file digest, NOT the "Tiger" file digest. They will most often be different for any file that is larger than 1024-9=1013 bytes exactly, and will always be identical ONLY for small files up to 1013 bytes.
"bitprint:" URNs can be computed without generating and serializing the full Merkle Hash Tree. But for applications in Gnutella with swarmed downloads, it's best to keep a storage for intermediate hash values, that complies to the THEX binary serialization format.
Note: Bitzi's "bitprint:" URN are using the following format:
"urn:bitprint:SHA1.TigerTree"where:
- "urn:" is NOT case significant (according to URI specification) and designates the protocol format for Universal Resource Names that MUST be location-independant (lowercase is strongly recommanded as the canonical format);
- "bitprint:" is NOT case significant (according to URN specification) and designates the URN encoding scheme that SHOULD be registered (lowercase is strongly recommanded as the canonical format, as this URN scheme is not formally registered);
- the rest of the string normally depends of the encoding and may differentiate lowercase and uppercase letters, so a "canonical" representation is needed to conform to the URN standard:
- "SHA1" is the Canonical Base32 encoding of the 160-bit (20 bytes) "SHA1" digest of the full file, as a 32-characters ASCII string;
- "TigerTree" is the Canonical Base32 encoding of the 192-bit (24 bytes) "TigerTree" digest of the full file, as a 39-characters ASCII string;
- (A Canonical Base32 string uses only uppercase ASCII letters 'A' to 'Z' to encode base 32 digit values 0 to 25, and ASCII digits '2' to '7' to encode base 32 digit values 26 to 31. So servents MAY accept lowercase letters as equivalent, but they SHOULD only generate uppercase letters.)
- The total length of a "urn:bitprint:" URI is exactly 3+1+8+1+32+1+39=85 characters
Note: The shorter (but less secure) "sha1:" URN for any file content can be simply infered from an existing "bitprint:" URN for the same file content by replacing the URN encoding scheme, and stripping the "." and the TigerTree part. So transporting both the "sha1:" URN and the "bitprint:" URN is not needed, as the latter will suffice in most cases.
To get the latest complete specification of the "bitprint:" URN scheme, visit:
http://bitzi.com/developer/bitprint.To get reference documentation about the standard "Tiger" message digest, and a sample C implementation, visit:
http://www.cs.technion.ac.il/~biham/Reports/Tiger/.To get reference documentation about the standard "Base32" encoding, visit:
http://www.ietf.org/internet-drafts/draft-josefsson-base-encoding-03.txt.To get a sample Public Domain implementation in Java of the "Tiger" Digest and of simple Base16, Base32, Base64 encoders/decoders, visit: http://groups.yahoo.com/group/the_gdf/files/Proposals/HUGE/com.bitzi.util/.
THEX works best with peer-to-peer file exchanges and distributed filesystems.
The "swarmed downloads" feature on Gnutella will best benefit from THEX as it allows verifying the integrity of files downloaded by fragments from multiple sources, as discovered with the "HUGE" protocol extension proposal for Gnutella that is largely approved by most Gnutella servent vendors
For a complete specification of the HUGE protocol extension (Hash/URN Generic Extension) by Gnutella servents, visit:
http://groups.yahoo.com/group/the_gdf/files/Proposals/HUGE/.For a complete specification of the PFSP protocol extension (Partial File Sharing Protocol) by Gnutella servents, visit:
http://groups.yahoo.com/group/the_gdf/files/Proposals/PFSP/.For a complete specification of the standard Gnutella protocol, visit:
http://groups.yahoo.com/group/the_gdf/files/Development/.For developers only, technical discussions about the evolutions of the Gnutella protocol, visit:
http://groups.yahoo.com/group/the_gdf/ (may require user registration on the Yahoo! service).
- 1. Introduction
- 2. Merkle Hash Trees 2.1 Unbalanced Trees 2.2 Choice Of Segment Size
- 3. Serialization Format
- 3.1 DIME Encapsulation
- 3.2 XML Tree Description
- 3.2.1 File Size
- 3.2.2 File Segment Size
- 3.2.3 Digest Algorithm
- 3.2.4 Digest Output Size
- 3.2.5 Serialized Tree Depth
- 3.2.6 Serialized Tree Type
- 3.2.7 Serialized Tree URI
- 3.3 Breadth-First Serialization
- 3.3.1 Serialization Type URI
- § Authors' Addresses