From bram at gawth.com Fri Jan 4 18:02:01 2002 From: bram at gawth.com (Bram Cohen) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] BitTorrent trial run going on right this minute! Message-ID: Everyone who's online now and would like to participate in a test of BitTorrent hop on irc at irc.openprojects.net and go to #bittorrent -Bram Cohen "Markets can remain irrational longer than you can remain solvent" -- John Maynard Keynes From bram at gawth.com Sat Jan 5 17:38:01 2002 From: bram at gawth.com (Bram Cohen) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] New release of BitTorrent out Message-ID: A new release of BitTorrent is out, along with a new page - http://bitconjurer.org/BitTorrent/ New in this release - New file in demo, in case you already got the old one :-) A complete rewrite of all the upload/download logic Tit-for-tat upload preferences TCP buffering awareness Better install process under UNIX And tons of other small improvements. We did a test run of the new release yesterday, and it handled six simultaneous downloaders without breaking a sweat, my guess is the current version handles twenty no problem. This release marks the confluence of two events - a) BitTorrent getting mature enough to be used commercially b) Bram has spent enough time unemployed and needs to find a job If you're one of the many companies who could benefit from reduced bandwidth costs, now would be a good time to hire me to ensure BitTorrent reaches full maturity. -Bram Cohen "Markets can remain irrational longer than you can remain solvent" -- John Maynard Keynes From bram at gawth.com Wed Jan 16 05:01:01 2002 From: bram at gawth.com (Bram Cohen) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] CodeCon presentations announced and registration open Message-ID: CodeCon is the premier event in 2002 for the P2P, cypherpunk, and network/security application developer community. It is a workshop for developers of real-world applications that support individual liberties. CodeCon registration is $50, a $10 discount is available if you register prior to February 1st. It will be held February 15-17, noon-5pm, at DNA lounge in San Francisco. http://codecon.org/ Presentations will include - * Peek-A-Booty - a distributed anti-censorship application * Invisible IRC Project - secure, anonymous client/server networks * Idel - lightweight mobile code for p2p cpu sharing * Reptile - a distributed but uniform content exchange mechanism * MNet - a universal shared filestore * Alpine - a social discovery mechanism which can handle high churn rates, malicious peers, and limited bandwidth * Eikon - an image search engine * CryptoMail - encrypted email for all * libfreenet - a case study in horrors incomprehensible to the mind of man, and other secure protocol design mistakes * BitTorrent - hosting large, popular files cheaply From brandon at blanu.net Fri Jan 18 05:28:01 2002 From: brandon at blanu.net (brandon@blanu.net) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Re: [decentralization] Re: Costs of Decentralization In-Reply-To: ; from osokin@osokin.com on Fri, Jan 18, 2002 at 05:15:28AM -0000 References: <20020117132814.A11278@blanu.net> Message-ID: <20020118041031.A13946@blanu.net> > Exactly. Since log N tends to grow with the growth of N, it > means that sooner or later (depending on the average bandwidth, > average stream of requests and so on) the Chord network will be > unable to function. Ah, I see what you mean now. I guess this discussion of infinitely scalablility is rather irrelevant anyway since there are 255^4 IPs available. So I guess the discussion here resolves to a difference in strategy between Gnutella and Chord. Both attempt to route among a large number of nodes with reasonable resource usage per node. Gnutella solves this by routing to an arbitrary subset of the nodes and returning an arbitrary subset of results. Chord solves this by arranging the network so that every node is reachable from every other node with a reasonable number of hops and a reasonable number of connections per node, given the current maximum number of nodes of 255^4. It is interesting (but not of practical relevance) to see what happens to these networks when you give them a very (very) large number of nodes. Gnutella will give you a very small (and random) percentage of the total results. Chord will increase the number of connections per node until it is no longer reasonable. Both networks have a maximum size to which they can scale before their performance starts to degrade. For Gnutella this is the size at which all of the nodes in the network fit inside the horizon, but no more nodes can be added inside the horizon. This is < m^p where m=max connections and p=max TTL. Just how much less N is than m^p depends on the topology of the networks. For instance, a network with a single supernode with m-1 connections to slave nodes and p=1 would contain exactly m^p nodes. Every other topology contains less than that because of duplicate connections. For every node added after that, the number of search results as compared to the total degrades. I would be interested in finding out from a Gnutella person what common values for m and p are as from there you could compute what the approximate size of the searchable subset in the public Gnutella network. p is published somewhere, but m would depend on the servants. In Chord, the maximum size is 2^x, where x is the number of nodes that you think any node can reasonable keep track of. All this really means is handling update traffic as the network membership churns. The amount of other traffic to be handled (searches, file transfers, and the like) is unrelated to the number of edges per node. So if your network has a churn rate of c nodes per second, then that means that the number of updates per second for each node is cx/N. So if the maximum number of updates per second that a node can handle is z, you get z=cx/N and N=2^x. I am far too sleepy to attempt to solve this and in fact too sleepy to even be able to tell if it's correct, but the interesting part is just that the maximum size of the network is based on the amount of update traffic you can handle and the amount of update traffic generated by the network. So a stable network can have more nodes with no adverse consequences. This is all of only theoretical interest anyway since Chord implementations fix x at 160 anyway. It might be useful to determine from the fixed x what churn rates are acceptable, but we'd have to make up z first. From clay at clifford.webservepro.com Fri Jan 18 05:28:03 2002 From: clay at clifford.webservepro.com (Clay Shirky) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Re: [decentralization] Re: Costs of Decentralization In-Reply-To: <20020118041031.A13946@blanu.net> from "brandon@blanu.net" at Jan 18, 2002 04:10:31 AM Message-ID: <200201181221.g0ICLPlQ020168@clifford.webservepro.com> > > means that sooner or later (depending on the average bandwidth, > > average stream of requests and so on) the Chord network will be > > unable to function. The Chord paper seems to indicate that they get around this by setting what they call 'r', the 'sucessors list', i.e. the number of nodes any other node knows about. By making r a function of the number of N nodes, so that r is 2(log2(N)), it increases the length-covering of the first hop. (Every hop in Chord is provably closer to its target.) So as I read this, it avoids a bandwidth flood, at the expense of making the number of connections per node the break point as the network grows large. > Ah, I see what you mean now. I guess this discussion of infinitely > scalablility is rather irrelevant anyway since there are 255^4 IPs > available. No, its quite relevant, since namespaces become crunchy long before they are even moderately populated -- viz the dynamic IP address hack. Infinite scalability is always relevant, not because you'll ever get there, but because you won't be remembered for being the guy who said it would be a good idea to store the year in two bytes, or that 640K ought to be enough for anybody. > In Chord, the maximum size is 2^x, I think its 2(log2(X)). (The relevant bits are on page 22 of Dabek's thesis paper.) -clay From zooko at zooko.com Fri Jan 18 05:54:01 2002 From: zooko at zooko.com (Zooko) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Re: [decentralization] Re: Costs of Decentralization In-Reply-To: Message from Clay Shirky of "Fri, 18 Jan 2002 07:21:25 EST." <200201181221.g0ICLPlQ020168@clifford.webservepro.com> References: <200201181221.g0ICLPlQ020168@clifford.webservepro.com> Message-ID: [note that this discussion is being cross-posted to decentralization and p2p-hackers] Clay Shirky wrote the lines prepended with "> ". > So as I read this, it avoids a bandwidth flood, at the expense of > making the number of connections per node the break point as the > network grows large. Each node in Chord must handle approximately log2(N) other nodes for a network with N total nodes. For example, if there were 2^70 total nodes (or approximately one Chord node per hydrogen atom in the entire universe), then each node would have to handle 70 other nodes. So running out of space in the Chord tables is never going to be a problem unless the devices running the nodes are surprisingly constrained in terms of memory. I estimate memory requirements of at most 20 KB for all of your "tracking other nodes" needs even for one of those universe-spanning Chord networks with trillions upon trillions upon trillions of nodes. No problem for desktops, handhelds, or most embededs. Possibly an issue for sub-embeddeds, nano-computers or a quantum computer running inside one of the aforementioned hydrogen atoms... > No, its quite relevant, since namespaces become crunchy long before > they are even moderately populated -- viz the dynamic IP address hack. I think this is a consequence of IP (and DNS) being (ultimately) centrally managed and allocated. Chord should have no such problem. Brandon's point against Chord's scalability is a telling one, however. The scalability problem isn't that your Chord node runs out of space to remember its log2(N) peers; the problem is that too many of those peers are unavailable, or churning in and out of the network, or returning buggy results, or trying to DoS your node, etc... Regards, Zooko --- zooko.com Security and Distributed Systems Engineering --- From gojomo at bitzi.com Fri Jan 18 07:38:01 2002 From: gojomo at bitzi.com (Gordon Mohr) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Re: [decentralization] Re: Costs of Decentralization References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> Message-ID: <001001c1a034$bbccf6c0$1fc77940@golden> Brandon Wiley writes: > So I guess the discussion here resolves to a difference in strategy > between Gnutella and Chord. Both attempt to route among a large number > of nodes with reasonable resource usage per node. Gnutella solves this > by routing to an arbitrary subset of the nodes and returning an > arbitrary subset of results. Chord solves this by arranging the network > so that every node is reachable from every other node with a reasonable > number of hops and a reasonable number of connections per node, given > the current maximum number of nodes of 255^4. Rather than contrasting the two, why not think about a hybrid? Imagine that some Gnutella servents started to... (1) Assign themselves a "home" position on the hash unit circle (2) Prefer connections with other servents in accordance with the "finger" positions dictated by Chord Just those steps could bias the Gnutella network into a less arbitrary, more characterizable and perhaps more efficient topology. (For example, broadcast cycles could be easily avoided.) If Gnutella servents further began to... (3) Upon connection stabilization, announce their available files to the servents around the unit circle "responsible" for indexing those files (4) Narrowcast Queries Chord-style for files that are likely to have been so indexed ...then Gnutella would gain scaling behavior, for those Queries, a lot like Chord. - Gordon ____________________ Gordon Mohr, gojomo@ bitzi.com, Bitzi CTO _ http://bitzi.com _ From blanu at bozonics.com Fri Jan 18 12:00:01 2002 From: blanu at bozonics.com (Brandon Wiley) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Re: [decentralization] Re: Costs of Decentralization In-Reply-To: Message-ID: > Brandon's point against Chord's scalability is a telling one, however. The > scalability problem isn't that your Chord node runs out of space to remember its > log2(N) peers; the problem is that too many of those peers are unavailable, or > churning in and out of the network, or returning buggy results, or trying to DoS > your node, etc... Upon further thought, it seems like there are two related issues in Chord here. First, the network takes time to adjust to churning (how long depends on your update strategy), so it becomes unreliable if the churn rate is too high. Second, if the churn rate is too high then the amount of update traffic per node can become too high and nodes will start either lagging behind or dropping updates, causing the network to degrade even faster. What's worse, opting for a more aggressive update strategy to avoid the first problem will push the second problem closer to realization, and vice versa. The problem of a high churn rate making the network suck is a problem in every network I can think of. It's just significant in Chord because in other networks the limitation is hit before you get to the churn rate problem. From jbone at jump.net Sat Jan 19 12:37:02 2002 From: jbone at jump.net (Jeff Bone) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> Message-ID: <3C49D6C0.D020AB22@jump.net> Gordon Mohr wrote: > Brandon Wiley writes: > > So I guess the discussion here resolves to a difference in strategy > > between Gnutella and Chord. Actually, the discussion misses a much larger, more fundamental point: comparing these things an apples-to-oranges endeavor. It misses the point that semantically meaningful names are fundamentally different from identifiers, and that searching is fundamentally different from locating. Different strategies are *fundamentally* required; many techniques for efficient location of objects given identifiers are simply not applicable to the task of searching given semantic names. Semantically-meaningful names --- aka queries --- identify a set (more properly a multiset) of objects which match the name in question. All kinds of things can be semantic names: pathnames in a file namespace, query terms like "brittany spears oops did it", etc. The key point here is that names resolve to a set of objects (possibly of order 1) rather than a single object, and some additional layer or mechanism has to be used to actually locate and interact with an object of interest from the response set. Identifiers are specific and probabilistically unique tokens that map to exactly one object of interest. (There may be multiple replicas of this object; any is as good as any other, and identifiers themselves have no embedded location semantics.) There are various kinds of identifiers: hashes, device/inode pairs, etc. The objects identified have a persistent identity over time and are in some sense immutable. Hashes of bitstrings, for instance, uniquely and persistently identify the bitstring in question. In a filesharing context, that means that different bitrate recordings of the same song are identified as different objects. (Pedantic point: all identifiers are semantic names, but very few semantic names are identifiers.) Identifiers can be arranged across different communication topologies in ways that make location of the object associated with a given identifier very cheap. Chord and Oceanstore are interesting examples of this, and employ similar strategies. In both, there is both a cheap, probabilistic location / lookup service, and a slower, cannonically "correct" lookup mechanism as a fallback --- and the location strategy is "complete" in an important sense described below. Both rely on the mathematical properties of the network topologies they employ in order to constrain the amount of lookup that must be done to find an object for a given identifier. In the case of Oceanstore, the fundamental abstractions are hashes of immutable objects, a Plaxton mesh topology, and attenuated Bloom filters for lookup; in Chord, it's DHashes/block IDs, a mechanism for distributed consistent hashing, and successor / finger lists. There is a profound consequence of this style of location service that *cannot* be achieved for general distributed search: in Oceanstore and (not clear that this is true, but it's been claimed) Chord, a suitably constrained walk of the network (max O(log n) messages / hops) can verifiably guarantee that an object for a given identifier is or is not present anywhere in the network. In Oceanstore, if you hit the root of the per-object spanning tree for the identifier you are looking for without finding it, *it does not exist anywhere in the network.* (Otherwise, you've found the object.) Such a location service is "complete." (Pedantic point: there is a vanishingly small possibility of a false negative in i.e. Oceanstore: if an object matching an identifier is inserted into the network after a location operation has started but before it finishes, then it might not be found in that operation. This is an unavoidable consequence of any such distributed system unless you require several kinds of costly global consistency.) *NO* distributed search mechanism --- mechanism for resolving semantic names into sets of objects that match the name --- can do the same thing without (a) a full walk of all nodes in the network, or (b) complete, dynamically-consistent replication of the index to all nodes. For a comprehensive, "correct" distributed search, amount of network traffic is unavoidably proportional to the size of the network in a way that isn't true for lookup of identifiers. Looked at another way, the accuracy of the search results (in terms of finding / not finding the object of interest) is proportional to the amount of network traffic generated. This has some strategic implications for building distributed filesystems and file sharing mechanisms. First, both mechanisms --- search and location --- are probably necessary for any useful system, and certainly for any system of ad-hoc sharing of media files. Gordon's "hybrid" suggestion is right on the money in at least this respect: each mechanism has a distinct, "appropriate" architecture that should be employed. Services like Bitzi map semantic names to identifiers, and such mechanisms are appropriately "more centralized" (not to say centralized, but limited, complete replication is a more appropriate strategy than partitioning of indices) than an underlying location services that resolves identifiers into content. Second implication: while incomplete search results may be suitable for some applications (such as finding a brittany spears song), the same cannot be said for all applications of location. For a distributed filesystem supporting arbitrary applications, being unable to retrieve (via its identifier) a given object that exists somewhere in the network is both avoidable and undesirable. We shouldn't accept such limitations. IMO, we shouldn't build distinct storage substrates for different applications; doing so increases balkanization of data and limits the growth of value of the Internet as a whole via Metcalfe's. A hybrid mechanism that supports arbitrary applications to the limits of what is technologically feasible isn't just a good idea, it's a requirement in the long term. The point of this message isn't to endorse or slam any particular technology or architectural strategy, but rather to point out the fine but important differences between two very different mechanisms in distributed storage systems. It's my hope that consideration of the differences between these abstractions and their associated features will stimulate some positive design discussion and thought. $0.02, jb From jbone at jump.net Sat Jan 19 13:35:02 2002 From: jbone at jump.net (Jeff Bone) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Followup: The Politics of Searching and Locating References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> Message-ID: <3C49E5D5.14869660@jump.net> Factoring these two things out has potentially interesting legal / political implications for distributed content sharing networks. Search indices such as Bitzi (even a potentially distributed-through-replication Bitzi) are divorced from any underlying location service --- they merely serve to map names into identifiers. AFAICT, there is no *reasonable* legal argument for contributory copyright infringement applied to such things --- they are, after all, merely card catalogs. (Caveat: CDDB, etc.) Distributed location services which then map identifiers into content are still subject to this kind of thing, but there are various strategies (distributing stored fragments over many nodes, etc.) that can minimize this risk to the original developer(s) of the code in question. Just another random thought, jb From gojomo at usa.net Sun Jan 20 23:44:01 2002 From: gojomo at usa.net (Gordon Mohr) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> Message-ID: <008401c1a24f$649ab0a0$1fc77940@golden> Jeff Bone writes: > > Gordon Mohr wrote > > > Brandon Wiley writes: > > > So I guess the discussion here resolves to a difference in strategy > > > between Gnutella and Chord. > > Actually, the discussion misses a much larger, more fundamental point: > comparing these things an apples-to-oranges endeavor. It misses the point > that semantically meaningful names are fundamentally different from > identifiers, and that searching is fundamentally different from locating. > Different strategies are *fundamentally* required; many techniques for > efficient location of objects given identifiers are simply not applicable to > the task of searching given semantic names. Whew, good thing you only used the word "fundamental" four times in five lines. Once more would have set off my bombast filter and I would have missed the rest of your message. :) > There is a profound consequence of this style of location service that > *cannot* be achieved for general distributed search: in Oceanstore and (not > clear that this is true, but it's been claimed) Chord, a suitably constrained > walk of the network (max O(log n) messages / hops) can verifiably guarantee > that an object for a given identifier is or is not present anywhere in the > network. In Oceanstore, if you hit the root of the per-object spanning tree > for the identifier you are looking for without finding it, *it does not exist > anywhere in the network.* (Otherwise, you've found the object.) Such a > location service is "complete." (Pedantic point: there is a vanishingly > small possibility of a false negative in i.e. Oceanstore: if an object > matching an identifier is inserted into the network after a location > operation has started but before it finishes, then it might not be found in > that operation. This is an unavoidable consequence of any such distributed > system unless you require several kinds of costly global consistency.) > > *NO* distributed search mechanism --- mechanism for resolving semantic names > into sets of objects that match the name --- can do the same thing without > (a) a full walk of all nodes in the network, or (b) complete, > dynamically-consistent replication of the index to all nodes. For a > comprehensive, "correct" distributed search, amount of network traffic is > unavoidably proportional to the size of the network in a way that isn't true > for lookup of identifiers. This does not follow; consider the common example of "semantic names" that are strings of keywords. To insert an object, insert a pointer to it at every index-node responsible for any one of the keywords. Yes, this takes extra insert steps (by a constant factor: avg keyword count per object), but then a query for any string of keywords will give a definitive answer about the presence or absence of matching object(s) in O(log n) hops, just as with the "identifier" case. I suspect that Google is using some mega-advanced refinement of this sort of strategy to achieve their amazing responsiveness. - Gordon ____________________ Gordon Mohr, gojomo@ bitzi.com, Bitzi CTO _ http://bitzi.com _ From jbone at jump.net Mon Jan 21 08:01:01 2002 From: jbone at jump.net (Jeff Bone) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> Message-ID: <3C4C3A98.529AB052@jump.net> Gordon Mohr wrote: > > *NO* distributed search mechanism --- mechanism for resolving semantic names > > into sets of objects that match the name --- can do the same thing without > > (a) a full walk of all nodes in the network, or (b) complete, > > dynamically-consistent replication of the index to all nodes. For a > > comprehensive, "correct" distributed search, amount of network traffic is > > unavoidably proportional to the size of the network in a way that isn't true > > for lookup of identifiers. > > This does not follow; consider the common example of "semantic names" that are strings of keywords. Which thing are you saying doesn't follow? (I did realize after the fact that I wasn't particularly accurate in the above paragraph, so the claim looks larger than intended.) > To insert an object, insert a > pointer to it at every index-node responsible for any one of the keywords. Yes, this takes extra insert steps (by a constant factor: > avg keyword count per object), but then a query for any string of keywords will give a definitive answer about the presence or > absence of matching object(s) in O(log n) hops, just as with the "identifier" case. This will work --- but it's still not practical for the most general case. In this scheme, the number of messages for a particular query is proportional to the specificity of the query in total number of keywords given; each message (fragment of the query with a single keyword) might even only travel O(1) hops (assuming the partial index node for a keyword can be directly computed and the node directly contacted) but you have the undesirable consequence that you've got to do more of them the more information you're giving about the object you desire. Message count rises with the "fineness" of the partitioning scheme and the specificity of the query. Further, the each partial result specifies a set of objects matching a single keyword; the actual answer is found by intersecting the partial result sets. Each of those partial result sets is likely to be *very large* (size described by O(f*N) where f is the frequency of the keyword, i.e., the proportion of all documents "containing" the keyword to total documents) in a general system where you're not just indexing a small amount of metadata but rather the entire document. Hence while it's true that there are other strategies (besides full walk / total replication of index) it remains the case that for a comprehensive, "correct" distributed search, amount of network traffic unavoidably grows in a way that isn't true for lookup of identifiers. The factors which impact this are size of the network, size of data / metadata, partitioning scheme for distributing the indices, the "language of discourse" used for keywords, etc.. You can do tradeoffs and optimize costs, but for general keyword systems / search distribution inevitably increases the amount of network traffic --- and for truly general systems of large size, fine distribution quickly becomes impractical. $0.02, jb From oskar at freenetproject.org Mon Jan 21 09:56:01 2002 From: oskar at freenetproject.org (Oskar Sandberg) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization In-Reply-To: <3C4C3A98.529AB052@jump.net>; from jbone@jump.net on Mon, Jan 21, 2002 at 09:58:16AM -0600 References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> Message-ID: <20020121185455.P771@sandbergs.org> On Mon, Jan 21, 2002 at 09:58:16AM -0600, Jeff Bone wrote: <> > > To insert an object, insert a > > pointer to it at every index-node responsible for any one of the keywords. Yes, this takes extra insert steps (by a constant factor: > > avg keyword count per object), but then a query for any string of keywords will give a definitive answer about the presence or > > absence of matching object(s) in O(log n) hops, just as with the "identifier" case. > > This will work --- but it's still not practical for the most general case. In this scheme, the number of messages for a particular > query is proportional to the specificity of the query in total number of keywords given; each message (fragment of the query with a > single keyword) might even only travel O(1) hops (assuming the partial index node for a keyword can be directly computed and the node > directly contacted) but you have the undesirable consequence that you've got to do more of them the more information you're giving about > the object you desire. Message count rises with the "fineness" of the partitioning scheme and the specificity of the query. > > Further, the each partial result specifies a set of objects matching a single keyword; the actual answer is found by intersecting the > partial result sets. Each of those partial result sets is likely to be *very large* (size described by O(f*N) where f is the frequency > of the keyword, i.e., the proportion of all documents "containing" the keyword to total documents) in a general system where you're not > just indexing a small amount of metadata but rather the entire document. To get around this, one could simply include the total list of keywords for every index entry - then to do an intersection search one would only need to search for any one of the words. This makes the index entries quite large if every word is indexed, but most writing uses only a couple of thousand words so it is not impossible (and in pratice one can exclude most of /usr/dict/words without much loss to usability). > Hence while it's true that there are other strategies (besides > full walk / total replication of index) it remains the case that for a comprehensive, "correct" distributed search, amount of network > traffic unavoidably grows in a way that isn't true for lookup of identifiers. The factors which impact this are size of the network, > size of data / metadata, partitioning scheme for distributing the indices, the "language of discourse" used for keywords, etc.. You can > do tradeoffs and optimize costs, but for general keyword systems / search distribution inevitably increases the amount of network > traffic --- and for truly general systems of large size, fine distribution quickly becomes impractical. I won't venture to guess whether or not this is so, but I must say you have wandered quite from any mathematical argument for this position, which I feel is necessary if one is to claim that something holds fundamentally. What can be said mathematically is that if there is no way to sort the identifiers, then the utility of the search will depend only on the number of index entries reached, that is the steps taken times the average number of indexes per peer. This is trivialy proved since the lack of effective sorting means that the probability that an entry is relevant is independent of the relevance of all other entries passed. Saying that there can be no effective sorting of "semantic identifiers", however, seems to be either a hasty conclusion or based on an, in pratice, unecessarily broad defenition of such identifiers. -- Oskar Sandberg oskar@freenetproject.org From alk at pobox.com Mon Jan 21 10:30:02 2002 From: alk at pobox.com (Tony Kimball) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <20020121185455.P771@sandbergs.org> Message-ID: <15436.45529.442686.597489@gargle.gargle.HOWL> Quoth Oskar Sandberg on Monday, 21 January: : Saying that there can be no effective sorting of "semantic : identifiers", however, seems to be either a hasty conclusion or based on : an, in pratice, unecessarily broad defenition of such identifiers. For any given natural language, it's also (arguably) falsifiable, by counter-example. Consider Roget's Thesaurus. From jbone at jump.net Mon Jan 21 10:36:01 2002 From: jbone at jump.net (Jeff Bone) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <20020121185455.P771@sandbergs.org> Message-ID: <3C4C5EF9.946A1E05@jump.net> Oskar Sandberg wrote: > To get around this, one could simply include the total list of keywords > for every index entry - then to do an intersection search one would only > need to search for any one of the words. This makes the index entries > quite large if every word is indexed, but most writing uses only a > couple of thousand words so it is not impossible (and in pratice one can > exclude most of /usr/dict/words without much loss to usability). Hmmmm... first problem: a single query for a single keyword could then return a list of all objects for which that keyword applied, along with all the other keywords that describe each of those objects. But this is likely to result in a very large result set in the general case, and the requestor is then responsible for performing the full intersection. Okay, that's easily solved: simply have the requestor supply the full set of keywords to the index node for a given keyword, and have that node do the intersection before returning the result set. What we've now described does in fact begin to resemble the Google architecture. [1] A couple of things to point out: we're no longer directly exploiting the network topology, and this doesn't look like "routing" really. We're no longer passing the query around widely according to some mapping of keywords onto nodes and feeding back results as we go; rather, a single query goes to a single server for its results. (The main point of the original message was that "searching" in a routed fashion doesn't necessarily and in current practice does not cost-effectively exploit topology, while "locating" does. It's a mathematical implication of the fact that semantic names result in sets of partial or complete results, and those sets grow rapidly. I think. :-) There are still potential problems with the amount of insert traffic, but this is probably manageable for read-mostly applications. Key questions to resolve: how to maintain a reliable, complete set of keyword-serving nodes? How to maintain consistency among the nodes serving a particular keyword, or is that even necessary? It's possible that topology and routing-like behavior on insert might be useful in answering those questions. > I won't venture to guess whether or not this is so, but I must say you > have wandered quite from any mathematical argument for this position, > which I feel is necessary if one is to claim that something holds > fundamentally. Absolutely true. :-) I'm mostly just thinking out loud and looking feedback; I haven't quite fully formed my mental model of the problem yet, though I've got a whiteboard full of currently half-ass equations as placeholders. ;-) I may push this forward some to try to formalize the argument... if so, more later. jb [1] http://www-db.stanford.edu/~backrub/google.html From gojomo at bitzi.com Mon Jan 21 10:58:02 2002 From: gojomo at bitzi.com (Gordon Mohr) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> Message-ID: <00a001c1a2ac$ad4b6a80$1fc77940@golden> Jeff Bone writes: > Gordon Mohr wrote: > > > > *NO* distributed search mechanism --- mechanism for resolving semantic names > > > into sets of objects that match the name --- can do the same thing without > > > (a) a full walk of all nodes in the network, or (b) complete, > > > dynamically-consistent replication of the index to all nodes. For a > > > comprehensive, "correct" distributed search, amount of network traffic is > > > unavoidably proportional to the size of the network in a way that isn't true > > > for lookup of identifiers. > > > > This does not follow; consider the common example of "semantic names" that are strings of keywords. > > Which thing are you saying doesn't follow? (I did realize after the fact that I wasn't particularly accurate in the above paragraph, so > the claim looks larger than intended.) A distributed search mechanism for resolving "semantic names" to objects CAN give a definitive answer as to whether matching objects exist in the system, in O(log n) steps. Neither a full walk or complete index replication is necessary. > > To insert an object, insert a > > pointer to it at every index-node responsible for any one of the keywords. Yes, this takes extra insert steps (by a constant factor: > > avg keyword count per object), but then a query for any string of keywords will give a definitive answer about the presence or > > absence of matching object(s) in O(log n) hops, just as with the "identifier" case. > > This will work --- but it's still not practical for the most general case. In this scheme, the number of messages for a particular > query is proportional to the specificity of the query in total number of keywords given; each message (fragment of the query with a > single keyword) might even only travel O(1) hops (assuming the partial index node for a keyword can be directly computed and the node > directly contacted) but you have the undesirable consequence that you've got to do more of them the more information you're giving about > the object you desire. Message count rises with the "fineness" of the partitioning scheme and the specificity of the query. So what? That's more traffic by a constant factor -- the number of keywords. The traffic as a function of network size is still O(log n). > Further, the each partial result specifies a set of objects matching a single keyword; the actual answer is found by intersecting the > partial result sets. As Oskar points out (on p2p-hackers followup), if you put the object's full keyword list at each node that tracks any one of the keywords, you don't need to computer the intersection at the querying client. (Actually, if you assume an ordering of the keywords, you need only put the "subsequent" keywords at each node. That is, with keywords X Y Z, you put X Y Z at node(X), Y Z at node(Y), and Z at node(Z). I also suspect that if the ordering can be reverse-frequency, the average size of the per-object keyword lists at index nodes can be way less than half the average keyword list size.) > Hence while it's true that there are other strategies (besides > full walk / total replication of index) it remains the case that for a comprehensive, "correct" distributed search, amount of network > traffic unavoidably grows in a way that isn't true for lookup of identifiers. The factors which impact this are size of the network, No, this one factor is your still-incorrect claim. Both the "identifier" and "keyword" cases grow with network size in the exact same way, bounded by O(log n). That the (insert) traffic and (index) memory usage in the "keyword" case are both larger by constant factors does not indicate an extra traffic growth factor. > size of data / metadata, partitioning scheme for distributing the indices, the "language of discourse" used for keywords, etc.. Yes, those are all independent factors affecting the level of traffic. > You can > do tradeoffs and optimize costs, but for general keyword systems / search distribution inevitably increases the amount of network > traffic --- and for truly general systems of large size, fine distribution quickly becomes impractical. I would again offer Google as a counterexample: a general indexing system of large size capable of very fine distinction offering practical -- indeed impressive -- performance. - Gordon From jbone at jump.net Mon Jan 21 11:00:01 2002 From: jbone at jump.net (Jeff Bone) Date: Sat Dec 9 22:11:44 2006 Subject: [decentralization] Re: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <20020121185455.P771@sandbergs.org> <15436.45529.442686.597489@gargle.gargle.HOWL> Message-ID: <3C4C6461.4CCDA78@jump.net> Tony Kimball wrote: > Quoth Oskar Sandberg on Monday, 21 January: > : Saying that there can be no effective sorting of "semantic > : identifiers", however, seems to be either a hasty conclusion or based on > : an, in pratice, unecessarily broad defenition of such identifiers. > > For any given natural language, it's also (arguably) falsifiable, by > counter-example. Consider Roget's Thesaurus. Oskar's point is valid: the claim that there is no effective sorting of i.e. keywords is trivially false. A thesaurus doesn't, however, prove that this this can be reasonably exploited in the domain in question. A thesaurus only maps keywords onto other objects, specifically other keywords; it does not map keywords into lists of compound names that can be formed with such keywords or other useful objects. Using Gordon's / Oskar's evolved scheme, we can write an equation that describes the amount of insert traffic it generates. The scheme for insert is: (1) For an object, document, generate its keyword list (2) Map each keyword to the node(s) that index that keyword (3) Insert a record into each of these nodes with the id and keyword list The equation for this would be amount of insert traffic ~= rk^2 r is the degree of replication for index nodes, the # of nodes indexing a given keyword k is the average size of keyword lists for objects indexed k < size of object, and "compressible" (wordIDs instead of words) This insert traffic might or might not be acceptable for a given domain of objects. Pushing on, jb From zooko at zooko.com Mon Jan 21 11:11:01 2002 From: zooko at zooko.com (Zooko) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Re: Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization In-Reply-To: Message from "Gordon Mohr" of "Mon, 21 Jan 2002 10:51:42 PST." <00a001c1a2ac$ad4b6a80$1fc77940@golden> References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <00a001c1a2ac$ad4b6a80$1fc77940@golden> Message-ID: Folks: It sounds to me like some of the ideas about remote lookup of "semantic names" are implicitly assuming that all the nodes are "honest" cooperators. This is fine as long as only good people use your system, or as long as the system is under a single shared authority scope, and that authority employs plenty of administrators to keep everyone in line. If you include the possibility of participants that are malicious, selfish, or arbitrarily badly confused, then this calls into question the mere possibility, not to mention the resource expenditure, of remote lookup of "semantic names". On the other hand it has no effect on the possibility, and hardly any effect on the expense, of remote lookup of "self-authenticating names", which is why I try to use the latter kind wherever possible. I'm aware that there are lots of smart people working on the problem of remote lookup of semantic names in the presence of potentially malicious participants, but the only solutions that have been proven to work are, well, "Byzantine" in their complexity, and are only reliable when a certain number of nodes are known to be honest. Regards, Zooko --- zooko.com Security and Distributed Systems Engineering --- From jbone at jump.net Mon Jan 21 11:14:02 2002 From: jbone at jump.net (Jeff Bone) Date: Sat Dec 9 22:11:44 2006 Subject: [decentralization] Re: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <00a001c1a2ac$ad4b6a80$1fc77940@golden> Message-ID: <3C4C67BB.C97242F2@jump.net> Gordon Mohr wrote: > I would again offer Google as a counterexample: a general indexing > system of large size capable of very fine distinction offering > practical -- indeed impressive -- performance. Take a look at any of the various write-ups of Google's architecture. We're now arguing semantics, but to call their distribution / partitioning scheme for indices "fine" is a bit of a mistake --- they certainly look more centralized than e.g. Gnutella. It's not clear to me that increased distribution and partitioning of their indices results in better performance or constant network traffic. jb From jbone at jump.net Mon Jan 21 11:18:02 2002 From: jbone at jump.net (Jeff Bone) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <00a001c1a2ac$ad4b6a80$1fc77940@golden> Message-ID: <3C4C68C9.9D4BA0EE@jump.net> Gordon Mohr wrote: > That the (insert) traffic and (index) > memory usage in the "keyword" case are both larger by constant > factors does not indicate an extra traffic growth factor. Nitty point: we should be careful about how we're measuring traffic, here. There are two measures of interest: number of messages generated for either an insert or query operation, and the size of the messages for either of those. Insert traffic by the second measure --- size of the insert --- does indeed grow with fine distribution in the keyword case. jb From gojomo at usa.net Mon Jan 21 11:37:01 2002 From: gojomo at usa.net (Gordon Mohr) Date: Sat Dec 9 22:11:44 2006 Subject: [decentralization] Re: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <00a001c1a2ac$ad4b6a80$1fc77940@golden> <3C4C68C9.9D4BA0EE@jump.net> Message-ID: <010201c1a2b2$e76e2bc0$1fc77940@golden> Jeff Bone writes: > Gordon Mohr wrote: > > > That the (insert) traffic and (index) > > memory usage in the "keyword" case are both larger by constant > > factors does not indicate an extra traffic growth factor. > > Nitty point: we should be careful about how we're measuring traffic, > here. There are two measures of interest: number of messages generated > for either an insert or query operation, and the size of the messages for > either of those. Insert traffic by the second measure --- size of the > insert --- does indeed grow with fine distribution in the keyword case. Yes, traffic is greater. But once more: as a function of network size, the identifier and keyword cases face the same exact same sort of logarithmic traffic growth. You previously claimed the contrary, in your initial message with this subject line, and your first followup. - Gordon From oskar at freenetproject.org Mon Jan 21 12:14:01 2002 From: oskar at freenetproject.org (Oskar Sandberg) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization In-Reply-To: <3C4C5EF9.946A1E05@jump.net>; from jbone@jump.net on Mon, Jan 21, 2002 at 12:33:29PM -0600 References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <20020121185455.P771@sandbergs.org> <3C4C5EF9.946A1E05@jump.net> Message-ID: <20020121211349.Q771@sandbergs.org> On Mon, Jan 21, 2002 at 12:33:29PM -0600, Jeff Bone wrote: <> > What we've now described does in fact begin to resemble the Google architecture. [1] A couple of things to point out: we're no longer > directly exploiting the network topology, and this doesn't look like "routing" really. We're no longer passing the query around widely > according to some mapping of keywords onto nodes and feeding back results as we go; rather, a single query goes to a single server for its > results. In essense, what this scheme does is reduce the problem of searching to that of locating. By building identifiers based on a keyword from each inclusive term in the query, we can, through any locating procedure, identify the nodes that can provide the search results. It is no different from the mapping of identifier to data, except that the resource at the located node is not bandwidth and storage but the capacity to process our query. So from one perspective, one might say that we are not actually using the distributed nature for the search, but we do have a scheme for searching that is in fact distributed, functional, scalable, and deterministic (given that the locating procedure used has those properties). It is not what I would describe as an elegant scheme, and it is questionable whether the constant multiplier given by the number of keywords (as you note) is not great enough to offset the scalability advantage over semi-centralized schemes like supernode or Power-law distribution networks for realistic sizes and traffic levels (the world, of course, is not always asymptotic), but it does establish that such a system is possible. Of course, the search procedure is only (the lesser) half of the problem with semantic lookups. As Zooko noted, the lack of good ways to verify results is a huge problem in any distributed search system (and indeed in centralized searches as well, though Google's policy of considering links endorsements seems to work quite well). > (The main point of the original message was that "searching" in a routed fashion doesn't necessarily and in current practice does > not cost-effectively exploit topology, while "locating" does. It's a mathematical implication of the fact that semantic names result in sets > of partial or complete results, and those sets grow rapidly. I think. :-) There are still potential problems with the amount of insert > traffic, but this is probably manageable for read-mostly applications. Key questions to resolve: how to maintain a reliable, complete set of > keyword-serving nodes? How to maintain consistency among the nodes serving a particular keyword, or is that even necessary? It's possible > that topology and routing-like behavior on insert might be useful in answering those questions. In fact though, the problem of maintaining reliability and consistancy among the peers is equally a problem in identifier lookups. It is my standing criticism of schemes like Chord and Tapestry (Oceanstore) that they turn from very elegant beauties on paper into hideous beasts in practice when attempting to force deterministic and consistant properties on an inherently inconsistant and unreliable set of peers. <> -- Oskar Sandberg oskar@freenetproject.org From jbone at jump.net Mon Jan 21 12:24:02 2002 From: jbone at jump.net (Jeff Bone) Date: Sat Dec 9 22:11:44 2006 Subject: [decentralization] Re: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <00a001c1a2ac$ad4b6a80$1fc77940@golden> <3C4C68C9.9D4BA0EE@jump.net> <010201c1a2b2$e76e2bc0$1fc77940@golden> Message-ID: <3C4C786F.DB75EC09@jump.net> Gordon Mohr wrote: > But once more: as a function of network size, the identifier and > keyword cases face the same exact same sort of logarithmic traffic > growth. I agree. > You previously claimed the contrary, in your initial message with > this subject line, and your first followup. So what's your point with this? This is an evolving conversation, not an effort at textbook-writing. Anything offered is offered hypothetically. jb From oskar at freenetproject.org Mon Jan 21 12:30:01 2002 From: oskar at freenetproject.org (Oskar Sandberg) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization In-Reply-To: <15436.45529.442686.597489@gargle.gargle.HOWL>; from alk@pobox.com on Mon, Jan 21, 2002 at 06:27:05PM -0600 References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <20020121185455.P771@sandbergs.org> <15436.45529.442686.597489@gargle.gargle.HOWL> Message-ID: <20020121212922.R771@sandbergs.org> On Mon, Jan 21, 2002 at 06:27:05PM -0600, Tony Kimball wrote: > Quoth Oskar Sandberg on Monday, 21 January: > : Saying that there can be no effective sorting of "semantic > : identifiers", however, seems to be either a hasty conclusion or based on > : an, in pratice, unecessarily broad defenition of such identifiers. > > For any given natural language, it's also (arguably) falsifiable, by > counter-example. Consider Roget's Thesaurus. I'm not a linguist, but off hand I would consider a thesaurus an example of sorting on natural vocabulary - natural language would have to include concepts described by sentences and phrases, which seem far from clearly sortable to me. Of course, the combination of keywords with boolean operations is in fact an attempt to simplify "semantic identifiers" from natural language to vocabulary and a few simple rules. -- Oskar Sandberg oskar@freenetproject.org From gojomo at usa.net Mon Jan 21 12:42:01 2002 From: gojomo at usa.net (Gordon Mohr) Date: Sat Dec 9 22:11:44 2006 Subject: [decentralization] Re: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <00a001c1a2ac$ad4b6a80$1fc77940@golden> <3C4C68C9.9D4BA0EE@jump.net> <010201c1a2b2$e76e2bc0$1fc77940@golden> <3C4C786F.DB75EC09@jump.net> Message-ID: <001201c1a2bb$f301f3a0$1fc77940@golden> Jeff Bone writes: > Gordon Mohr wrote: > > > But once more: as a function of network size, the identifier and > > keyword cases face the same exact same sort of logarithmic traffic > > growth. > > I agree. Great! You had danced around this matter so much that I wasn't sure we had reached agreement. > > You previously claimed the contrary, in your initial message with > > this subject line, and your first followup. > > So what's your point with this? This is an evolving conversation, not > an effort at textbook-writing. Anything offered is offered > hypothetically. Actually, your initial message had more the tone of a lecture than thinking-out-loud. And as a lecture, it had a major error. I just wanted to make sure we had all reached the same page about this central point. - Gordon From jbone at jump.net Mon Jan 21 12:53:02 2002 From: jbone at jump.net (Jeff Bone) Date: Sat Dec 9 22:11:44 2006 Subject: [decentralization] Re: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization References: <20020117132814.A11278@blanu.net> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <00a001c1a2ac$ad4b6a80$1fc77940@golden> <3C4C68C9.9D4BA0EE@jump.net> <010201c1a2b2$e76e2bc0$1fc77940@golden> <3C4C786F.DB75EC09@jump.net> <001201c1a2bb$f301f3a0$1fc77940@golden> Message-ID: <3C4C7F3F.52785384@jump.net> Gordon Mohr wrote: > Great! You had danced around this matter so much that I wasn't sure > we had reached agreement. I'm still playing with the model to determine if there are other factors that, combined with network size, don't interact to put a lower-bound on the amount of traffic generated in various different measures by various schemes. BTW, no dancing intended. > Actually, your initial message had more the tone of a lecture than > thinking-out-loud. And as a lecture, it had a major error. I just > wanted to make sure we had all reached the same page about this > central point. Thank God you exist to keep me in line, Gordon. There might be hoards of naive fools out there laboring under mission-critical misperceptions if not for your diligent editorial efforts. ;-) Who knows how many existing and future participants of both of these lists might have unskeptically read that article and falsely believed that --- in the *complete absence* of any mathematical support --- it actually asserted some universal law? (Indeed, there may be a universal law lurking in there somewhere, or there may not. I offered my note as a strawman in an attempt to find out.) $0.02, jb From cyb at azrael.dyn.cheapnet.net Mon Jan 21 12:54:01 2002 From: cyb at azrael.dyn.cheapnet.net (Brandon Wiley) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Re: Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization In-Reply-To: Message-ID: > I'm aware that there are lots of smart people working on the problem of remote > lookup of semantic names in the presence of potentially malicious participants, > but the only solutions that have been proven to work are, well, "Byzantine" in > their complexity, and are only reliable when a certain number of nodes are known > to be honest. As far as I'm concerned, name lookup is a finished problem. Either a name is self-authenticating or else it is vouched for by some authority (singular or a group) or it is vulnerable to attack. Since every scheme I've ever heard of falls into one of these 3 catagories, the problem of semantic name lookup is reduced to choosing which of the three you want your names to be in. From jbone at jump.net Mon Jan 21 15:14:01 2002 From: jbone at jump.net (Jeff Bone) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Searching and Locating Strawman Restated References: Message-ID: <3C4CA025.97F4A507@jump.net> Michael Bauer wrote: > I'm not sure past, present, or future readers of this list have any idea > what this thread is about anymore. Good point! Actually, I'm not sure that the author or any of the participants have had any idea of what the thread was about at any given point in its evolution. ;-) > If there were a summary "same page" URL I for one would quite grateful. In order to avoid another gojocascade, I'll hold back on any specific assertions about this until it's ready for prime time. The summary strawman: (1) names are different from identifiers (how?) (2) "search" in name space is different from "locating" in identifier space (how?) and (3) the various quantifiable costs (number of messages, message size, latency, memory costs, compute cycles, etc.) and tradeoffs that need to be paid / made are different. Any insight into the validity of this strawman continues to be appreciated. jb From burton at openprivacy.org Tue Jan 22 13:23:02 2002 From: burton at openprivacy.org (Kevin A. Burton) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] I might be able to host someone for codecon. Message-ID: <878zaqjd20.fsf@universe.yi.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 OK. Does anyone need a place to stay during codecon? The Hotels can be kind of expensive in SF and I already live here. If you want a place to stay so that you can save $ let me know. I am fairly sure I can convince my roommates. :) Kevin P.S. No Microsoft employees or FBI Agents :) - -- Kevin A. Burton ( burton@apache.org, burton@openprivacy.org, burtonator@acm.org ) Location - San Francisco, CA, Cell - 415.595.9965 Jabber - burtonator@jabber.org, Web - http://relativity.yi.org/ How are you gentleman? -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: Get my public key at: http://relativity.yi.org/pgpkey.txt iD8DBQE8TdgnAwM6xb2dfE0RAq16AKCtg8yXDKqwTD1J5vJBh37a48GSOgCfbxSt vMKNo+WNuQ0rrkNdEWJ3slc= =RiLB -----END PGP SIGNATURE----- From jbone at jump.net Sun Jan 27 16:24:01 2002 From: jbone at jump.net (Jeff Bone) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Centralized Versus Decentralized Indexing Message-ID: <3C549976.DC380E62@jump.net> Found a few interesting resources over the last week or so re: quantifying the many and different costs of decentralized indexes for non-specific (semantic name) searches. Additional pointers welcome. http://citeseer.nj.nec.com/basu97performance.html http://citeseer.nj.nec.com/40188.html http://dblp.uni-trier.de/db/journals/tkde/LiebeherrOA93.html http:/ www.soi.city.ac.uk/~andym jb From greg at electricrain.com Wed Jan 30 12:32:02 2002 From: greg at electricrain.com (Gregory P. Smith) Date: Sat Dec 9 22:11:44 2006 Subject: [p2p-hackers] Fwd: Stanford Networking Seminar, Thu 1/31, Pei Cao Message-ID: <20020130123108.A8854@zot.electricrain.com> Stanford Networking Seminar When: 12:45PM, Thursday, January 31st, 2002 Where: Room 104, Gates Computer Science Building URL: http://netseminar.stanford.edu/sessions/2002-01-31.html ----------------------------------------------------------------- Title: Search and Replication in Unstructured Peer-to-Peer Networks Speaker: Pei Cao Cisco Systems, Inc. Abstract: File sharing in Peer-to-Peer(P2P) networks are popular applications on the Internet today. In this talk, we briefly survey architectures of existing P2P systems, then focus on decentralized and unstructured networks such as Gnutella. We study two aspects of the system: file search efficiency and replication strategies. We show that simple flooding-based search algorithm scales poorly, especially for power-law random graphs. A multiple-walker random-walk search algorithm can improve upon the simple flooding by two orders of magnitude. We also show that, in unstructured networks, an object's replication ratio should be proportional to the square root of its popularity in order to minimize overall network search traffic. With simulations, we show that the path replication strategy, such as the one used in FreeNet, leads to the close-to-optimal replication ratios. This project is joint work with Christine Lv, Edith Cohen, Kai Li and Scott Shenker. Bio: Pei Cao is currently a system architect at Cisco systems. Prior to that she is an assistant professor at the CS Dept. at Univ. Wisconsin-Madison. Her research interests include operating systems, Web caching and content delivery, and storage systems. Notes: Lunch will be available at 12:15. A vegetarian selection will be available. No drinks will be provided. The talk itself will begin at 12:45 +----------------------------------------------------------------------------+ | This message was sent via the Stanford Computer Science Department | | colloquium mailing list. To be added to this list send an arbitrary | | message to colloq-subscribe@cs.stanford.edu. To be removed from this list,| | send a message to colloq-unsubscribe@cs.stanford.edu. For more information,| | send an arbitrary message to colloq-request@cs.stanford.edu. For directions| | to Stanford, check out http://www-forum.stanford.edu | +-------------------------------------------------------------------------xcl+ -- Gregory P. Smith