From bram at gawth.com  Fri Jan  4 18:02:01 2002
From: bram at gawth.com (Bram Cohen)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] BitTorrent trial run going on right this minute!
Message-ID: <Pine.LNX.4.21.0201041800060.4181-100000@ultra.gawth.com>

Everyone who's online now and would like to participate in a test of
BitTorrent hop on irc at irc.openprojects.net and go to #bittorrent

-Bram Cohen

"Markets can remain irrational longer than you can remain solvent"
                                        -- John Maynard Keynes


From bram at gawth.com  Sat Jan  5 17:38:01 2002
From: bram at gawth.com (Bram Cohen)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] New release of BitTorrent out
Message-ID: <Pine.LNX.4.21.0201051722110.4181-100000@ultra.gawth.com>

A new release of BitTorrent is out, along with a new page - 

http://bitconjurer.org/BitTorrent/

New in this release -

New file in demo, in case you already got the old one :-)

A complete rewrite of all the upload/download logic

Tit-for-tat upload preferences

TCP buffering awareness

Better install process under UNIX

And tons of other small improvements.

We did a test run of the new release yesterday, and it handled six
simultaneous downloaders without breaking a sweat, my guess is the current
version handles twenty no problem.

This release marks the confluence of two events -

a) BitTorrent getting mature enough to be used commercially

b) Bram has spent enough time unemployed and needs to find a job

If you're one of the many companies who could benefit from reduced
bandwidth costs, now would be a good time to hire me to ensure BitTorrent
reaches full maturity.

-Bram Cohen

"Markets can remain irrational longer than you can remain solvent"
                                        -- John Maynard Keynes


From bram at gawth.com  Wed Jan 16 05:01:01 2002
From: bram at gawth.com (Bram Cohen)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] CodeCon presentations announced and registration open
Message-ID: <Pine.LNX.4.21.0201160500470.1453-100000@ultra.gawth.com>

CodeCon is the premier event in 2002 for the P2P, cypherpunk, and
network/security application developer community. It is a workshop for
developers of real-world applications that support individual liberties.

CodeCon registration is $50, a $10 discount is available if you register
prior to February 1st. It will be held February 15-17, noon-5pm, at DNA
lounge in San Francisco.

http://codecon.org/

Presentations will include -

    * Peek-A-Booty - a distributed anti-censorship application
    * Invisible IRC Project - secure, anonymous client/server networks
    * Idel - lightweight mobile code for p2p cpu sharing
    * Reptile - a distributed but uniform content exchange mechanism
    * MNet - a universal shared filestore
    * Alpine - a social discovery mechanism which can handle high churn
        rates, malicious peers, and limited bandwidth
    * Eikon - an image search engine
    * CryptoMail - encrypted email for all
    * libfreenet - a case study in horrors incomprehensible to the mind of
        man, and other secure protocol design mistakes
    * BitTorrent - hosting large, popular files cheaply


From brandon at blanu.net  Fri Jan 18 05:28:01 2002
From: brandon at blanu.net (brandon@blanu.net)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Re: [decentralization] Re: Costs of Decentralization
In-Reply-To: <a28b1g+ffbt@eGroups.com>; from osokin@osokin.com on Fri, Jan 18, 2002 at 05:15:28AM -0000
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com>
Message-ID: <20020118041031.A13946@blanu.net>

>    Exactly. Since log N tends to grow with the growth of N, it
> means that sooner or later (depending on the average bandwidth,
> average stream of requests and so on) the Chord network will be
> unable to function. 

Ah, I see what you mean now. I guess this discussion of infinitely
scalablility is rather irrelevant anyway since there are 255^4 IPs
available.

So I guess the discussion here resolves to a difference in strategy
between Gnutella and Chord. Both attempt to route among a large number
of nodes with reasonable resource usage per node. Gnutella solves this
by routing to an arbitrary subset of the nodes and returning an
arbitrary subset of results. Chord solves this by arranging the network
so that every node is reachable from every other node with a reasonable
number of hops and a reasonable number of connections per node, given
the current maximum number of nodes of 255^4.

It is interesting (but not of practical relevance) to see what happens
to these networks when you give them a very (very) large number of
nodes. Gnutella will give you a very small (and random) percentage of
the total results. Chord will increase the number of connections per
node until it is no longer reasonable.

Both networks have a maximum size to which they can scale before their
performance starts to degrade. For Gnutella this is the size at which
all of the nodes in the network fit inside the horizon, but no more
nodes can be added inside the horizon. This is < m^p where m=max
connections and p=max TTL. Just how much less N is than m^p depends on
the topology of the networks. For instance, a network with a single
supernode with m-1 connections to slave nodes and p=1 would contain
exactly m^p nodes. Every other topology contains less than that because
of duplicate connections. For every node added after that, the number of
search results as compared to the total degrades. I would be interested
in finding out from a Gnutella person what common values for m and p are
as from there you could compute what the approximate size of the
searchable subset in the public Gnutella network. p is published
somewhere, but m would depend on the servants.

In Chord, the maximum size is 2^x, where x is the number of nodes that
you think any node can reasonable keep track of. All this really means
is handling update traffic as the network membership churns. The amount
of other traffic to be handled (searches, file transfers, and the like)
is unrelated to the number of edges per node.

So if your network has a churn rate of c nodes per second, then that
means that the number of updates per second for each node is cx/N. So if
the maximum number of updates per second that a node can handle is z,
you get z=cx/N and N=2^x. I am far too sleepy to attempt to solve this
and in fact too sleepy to even be able to tell if it's correct, but the
interesting part is just that the maximum size of the network is based
on the amount of update traffic you can handle and the amount of update
traffic generated by the network. So a stable network can have more
nodes with no adverse consequences. This is all of only theoretical
interest anyway since Chord implementations fix x at 160 anyway. It
might be useful to determine from the fixed x what churn rates are
acceptable, but we'd have to make up z first.

From clay at clifford.webservepro.com  Fri Jan 18 05:28:03 2002
From: clay at clifford.webservepro.com (Clay Shirky)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Re: [decentralization] Re: Costs of Decentralization
In-Reply-To: <20020118041031.A13946@blanu.net> from "brandon@blanu.net" at Jan 18, 2002 04:10:31 AM
Message-ID: <200201181221.g0ICLPlQ020168@clifford.webservepro.com>

> > means that sooner or later (depending on the average bandwidth,
> > average stream of requests and so on) the Chord network will be
> > unable to function. 

The Chord paper seems to indicate that they get around this by setting
what they call 'r', the 'sucessors list', i.e. the number of nodes any
other node knows about. By making r a function of the number of N
nodes, so that r is 2(log2(N)), it increases the length-covering of
the first hop. (Every hop in Chord is provably closer to its target.)

So as I read this, it avoids a bandwidth flood, at the expense of
making the number of connections per node the break point as the
network grows large.

> Ah, I see what you mean now. I guess this discussion of infinitely
> scalablility is rather irrelevant anyway since there are 255^4 IPs
> available.

No, its quite relevant, since namespaces become crunchy long before
they are even moderately populated -- viz the dynamic IP address hack.

Infinite scalability is always relevant, not because you'll ever get
there, but because you won't be remembered for being the guy who said
it would be a good idea to store the year in two bytes, or that 640K
ought to be enough for anybody.

> In Chord, the maximum size is 2^x,

I think its 2(log2(X)). (The relevant bits are on page 22 of Dabek's
thesis paper.)

-clay

From zooko at zooko.com  Fri Jan 18 05:54:01 2002
From: zooko at zooko.com (Zooko)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Re: [decentralization] Re: Costs of Decentralization 
In-Reply-To: Message from Clay Shirky <clay@clifford.webservepro.com> 
   of "Fri, 18 Jan 2002 07:21:25 EST." <200201181221.g0ICLPlQ020168@clifford.webservepro.com> 
References: <200201181221.g0ICLPlQ020168@clifford.webservepro.com> 
Message-ID: <E16RZHf-0003pq-00@imp>

[note that this discussion is being cross-posted to decentralization and 
p2p-hackers]

Clay Shirky wrote the lines prepended with "> ".

> So as I read this, it avoids a bandwidth flood, at the expense of
> making the number of connections per node the break point as the
> network grows large.

Each node in Chord must handle approximately log2(N) other nodes for a network
with N total nodes.  For example, if there were 2^70 total nodes (or 
approximately one Chord node per hydrogen atom in the entire universe), then 
each node would have to handle 70 other nodes.

So running out of space in the Chord tables is never going to be a problem 
unless the devices running the nodes are surprisingly constrained in terms of 
memory.  I estimate memory requirements of at most 20 KB for all of your 
"tracking other nodes" needs even for one of those universe-spanning Chord 
networks with trillions upon trillions upon trillions of nodes.  No problem for 
desktops, handhelds, or most embededs.  Possibly an issue for sub-embeddeds, 
nano-computers or a quantum computer running inside one of the aforementioned 
hydrogen atoms...

> No, its quite relevant, since namespaces become crunchy long before
> they are even moderately populated -- viz the dynamic IP address hack.

I think this is a consequence of IP (and DNS) being (ultimately) centrally 
managed and allocated.  Chord should have no such problem.


Brandon's point against Chord's scalability is a telling one, however.  The 
scalability problem isn't that your Chord node runs out of space to remember its 
log2(N) peers; the problem is that too many of those peers are unavailable, or 
churning in and out of the network, or returning buggy results, or trying to DoS 
your node, etc...

Regards,

Zooko

---
                 zooko.com
Security and Distributed Systems Engineering
---

From gojomo at bitzi.com  Fri Jan 18 07:38:01 2002
From: gojomo at bitzi.com (Gordon Mohr)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Re: [decentralization] Re: Costs of Decentralization
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com> <20020118041031.A13946@blanu.net>
Message-ID: <001001c1a034$bbccf6c0$1fc77940@golden>

Brandon Wiley writes:
> So I guess the discussion here resolves to a difference in strategy
> between Gnutella and Chord. Both attempt to route among a large number
> of nodes with reasonable resource usage per node. Gnutella solves this
> by routing to an arbitrary subset of the nodes and returning an
> arbitrary subset of results. Chord solves this by arranging the network
> so that every node is reachable from every other node with a reasonable
> number of hops and a reasonable number of connections per node, given
> the current maximum number of nodes of 255^4.

Rather than contrasting the two, why not think about a hybrid?

Imagine that some Gnutella servents started to...

  (1) Assign themselves a "home" position on the hash unit circle
  (2) Prefer connections with other servents in accordance with the
      "finger" positions dictated by Chord

Just those steps could bias the Gnutella network into a less
arbitrary, more characterizable and perhaps more efficient
topology. (For example, broadcast cycles could be easily 
avoided.) 

If Gnutella servents further began to...

  (3) Upon connection stabilization, announce their available 
      files to the servents around the unit circle "responsible" 
      for indexing those files
  (4) Narrowcast Queries Chord-style for files that are likely 
      to have been so indexed 

...then Gnutella would gain scaling behavior, for those Queries,
a lot like Chord.

- Gordon
____________________
Gordon Mohr, gojomo@
bitzi.com, Bitzi CTO
_ http://bitzi.com _


From blanu at bozonics.com  Fri Jan 18 12:00:01 2002
From: blanu at bozonics.com (Brandon Wiley)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Re: [decentralization] Re: Costs of
 Decentralization 
In-Reply-To: <E16RZHf-0003pq-00@imp>
Message-ID: <Pine.NEB.4.03.10201181300270.4284-100000@galaxie.superpuppy.net>

> Brandon's point against Chord's scalability is a telling one, however.  The 
> scalability problem isn't that your Chord node runs out of space to remember its 
> log2(N) peers; the problem is that too many of those peers are unavailable, or 
> churning in and out of the network, or returning buggy results, or trying to DoS 
> your node, etc...

Upon further thought, it seems like there are two related issues in Chord
here. First, the network takes time to adjust to churning (how long
depends on your update strategy), so it becomes unreliable if the churn
rate is too high. Second, if the churn rate is too high then the amount of
update traffic per node can become too high and nodes will start either
lagging behind or dropping updates, causing the network to degrade even
faster. What's worse, opting for a more aggressive update strategy to
avoid the first problem will push the second problem closer to
realization, and vice versa.

The problem of a high churn rate making the network suck is a problem in
every network I can think of. It's just significant in Chord because in
other networks the limitation is hit before you get to the churn rate
problem.


From jbone at jump.net  Sat Jan 19 12:37:02 2002
From: jbone at jump.net (Jeff Bone)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of
 Decentralization
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden>
Message-ID: <3C49D6C0.D020AB22@jump.net>


Gordon Mohr wrote:

> Brandon Wiley writes:
> > So I guess the discussion here resolves to a difference in strategy
> > between Gnutella and Chord.

Actually, the discussion misses a much larger, more fundamental point:
comparing these things an apples-to-oranges endeavor.  It misses the point
that semantically meaningful names are fundamentally different from
identifiers, and that searching is fundamentally different from locating.
Different strategies are *fundamentally* required;  many techniques for
efficient location of objects given identifiers are simply not applicable to
the task of searching given semantic names.

Semantically-meaningful names --- aka queries --- identify a set (more
properly a multiset) of objects which match the name in question.  All kinds
of things can be semantic names:  pathnames in a file namespace, query terms
like "brittany spears oops did it", etc.  The key point here is that names
resolve to a set of objects (possibly of order 1) rather than a single
object, and some additional layer or mechanism has to be used to actually
locate and interact with an object of interest from the response set.

Identifiers are specific and probabilistically unique tokens that map to
exactly one object of interest.  (There may be multiple replicas of this
object;  any is as good as any other, and identifiers themselves have no
embedded location semantics.)  There are various kinds of identifiers:
hashes, device/inode pairs, etc.  The objects identified have a persistent
identity over time and are in some sense immutable.  Hashes of bitstrings,
for instance, uniquely and persistently identify the bitstring in question.
In a filesharing context, that means that different bitrate recordings of the
same song are identified as different objects.  (Pedantic point:  all
identifiers are semantic names, but very few semantic names are identifiers.)

Identifiers can be arranged across different communication topologies in ways
that make location of the object associated with a given identifier very
cheap.  Chord and Oceanstore are interesting examples of this, and employ
similar strategies.  In both, there is both a cheap, probabilistic location /
lookup service, and a slower, cannonically "correct" lookup mechanism as a
fallback --- and the location strategy is "complete" in an important sense
described below.  Both rely on the mathematical properties of the network
topologies they employ in order to constrain the amount of lookup that must
be done to find an object for a given identifier.  In the case of Oceanstore,
the fundamental abstractions are hashes of immutable objects, a Plaxton mesh
topology, and attenuated Bloom filters for lookup;  in Chord, it's
DHashes/block IDs, a mechanism for distributed consistent hashing, and
successor / finger lists.

There is a profound consequence of this style of location service that
*cannot* be achieved for general distributed search:  in Oceanstore and (not
clear that this is true, but it's been claimed) Chord, a suitably constrained
walk of the network (max O(log n) messages / hops) can verifiably guarantee
that an object for a given identifier is or is not present anywhere in the
network.  In Oceanstore, if you hit the root of the per-object spanning tree
for the identifier you are looking for without finding it, *it does not exist
anywhere in the network.*  (Otherwise, you've found the object.)  Such a
location service is "complete."  (Pedantic point:  there is a vanishingly
small possibility of a false negative in i.e. Oceanstore:  if an object
matching an identifier is inserted into the network after a location
operation has started but before it finishes, then it might not be found in
that operation.  This is an unavoidable consequence of any such distributed
system unless you require several kinds of costly global consistency.)

*NO* distributed search mechanism --- mechanism for resolving semantic names
into sets of objects that match the name --- can do the same thing without
(a) a full walk of all nodes in the network, or (b) complete,
dynamically-consistent replication of the index to all nodes.  For a
comprehensive, "correct" distributed search, amount of network traffic is
unavoidably proportional to the size of the network in a way that isn't true
for lookup of identifiers.  Looked at another way, the accuracy of the search
results (in terms of finding / not finding the object of interest) is
proportional to the amount of network traffic generated.

This has some strategic implications for building distributed filesystems and
file sharing mechanisms.  First, both mechanisms --- search and location ---
are probably necessary for any useful system, and certainly for any system of
ad-hoc sharing of media files.  Gordon's "hybrid" suggestion is right on the
money in at least this respect:  each mechanism has a distinct, "appropriate"
architecture that should be employed.  Services like Bitzi map semantic names
to identifiers, and such mechanisms are appropriately "more centralized" (not
to say centralized, but limited, complete replication is a more appropriate
strategy than partitioning of indices) than an underlying location services
that resolves identifiers into content.

Second implication:  while incomplete search results may be suitable for some
applications (such as finding a brittany spears song), the same cannot be
said for all applications of location.  For a distributed filesystem
supporting arbitrary applications, being unable to retrieve (via its
identifier) a given object that exists somewhere in the network is both
avoidable and undesirable.  We shouldn't accept such limitations.  IMO, we
shouldn't build distinct storage substrates for different applications;
doing so increases balkanization of data and limits the growth of value of
the Internet as a whole via Metcalfe's.  A hybrid mechanism that supports
arbitrary applications to the limits of what is technologically feasible
isn't just a good idea, it's a requirement in the long term.

The point of this message isn't to endorse or slam any particular technology
or architectural strategy, but rather to point out the fine but important
differences between two very different mechanisms in distributed storage
systems.  It's my hope that consideration of the differences between these
abstractions and their associated features will stimulate some positive
design discussion and thought.

$0.02,

jb


From jbone at jump.net  Sat Jan 19 13:35:02 2002
From: jbone at jump.net (Jeff Bone)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Followup:  The Politics of Searching and Locating
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net>
Message-ID: <3C49E5D5.14869660@jump.net>

Factoring these two things out has potentially interesting legal / political
implications for distributed content sharing networks.  Search indices such as
Bitzi (even a potentially distributed-through-replication Bitzi) are divorced
from any underlying location service --- they merely serve to map names into
identifiers.  AFAICT, there is no *reasonable* legal argument for contributory
copyright infringement applied to such things --- they are, after all, merely
card catalogs.  (Caveat:  CDDB, etc.)  Distributed location services which then
map identifiers into content are still subject to this kind of thing, but there
are various strategies (distributing stored fragments over many nodes, etc.) that
can minimize this risk to the original developer(s) of the code in question.

Just another random thought,

jb


From gojomo at usa.net  Sun Jan 20 23:44:01 2002
From: gojomo at usa.net (Gordon Mohr)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net>
Message-ID: <008401c1a24f$649ab0a0$1fc77940@golden>

Jeff Bone writes:
>
> Gordon Mohr wrote
>
> > Brandon Wiley writes:
> > > So I guess the discussion here resolves to a difference in strategy
> > > between Gnutella and Chord.
>
> Actually, the discussion misses a much larger, more fundamental point:
> comparing these things an apples-to-oranges endeavor.  It misses the point
> that semantically meaningful names are fundamentally different from
> identifiers, and that searching is fundamentally different from locating.
> Different strategies are *fundamentally* required;  many techniques for
> efficient location of objects given identifiers are simply not applicable to
> the task of searching given semantic names.

Whew, good thing you only used the word "fundamental" four times in five lines. Once more would have set off my bombast filter and I
would have missed the rest of your message. :)

> There is a profound consequence of this style of location service that
> *cannot* be achieved for general distributed search:  in Oceanstore and (not
> clear that this is true, but it's been claimed) Chord, a suitably constrained
> walk of the network (max O(log n) messages / hops) can verifiably guarantee
> that an object for a given identifier is or is not present anywhere in the
> network.  In Oceanstore, if you hit the root of the per-object spanning tree
> for the identifier you are looking for without finding it, *it does not exist
> anywhere in the network.*  (Otherwise, you've found the object.)  Such a
> location service is "complete."  (Pedantic point:  there is a vanishingly
> small possibility of a false negative in i.e. Oceanstore:  if an object
> matching an identifier is inserted into the network after a location
> operation has started but before it finishes, then it might not be found in
> that operation.  This is an unavoidable consequence of any such distributed
> system unless you require several kinds of costly global consistency.)
>
> *NO* distributed search mechanism --- mechanism for resolving semantic names
> into sets of objects that match the name --- can do the same thing without
> (a) a full walk of all nodes in the network, or (b) complete,
> dynamically-consistent replication of the index to all nodes.  For a
> comprehensive, "correct" distributed search, amount of network traffic is
> unavoidably proportional to the size of the network in a way that isn't true
> for lookup of identifiers.

This does not follow; consider the common example of "semantic names" that are strings of keywords. To insert an object, insert a
pointer to it at every index-node responsible for any one of the keywords. Yes, this takes extra insert steps (by a constant factor:
avg keyword count per object), but then a query for any string of keywords will give a definitive answer about the presence or
absence of matching object(s) in O(log n) hops, just as with the "identifier" case.

I suspect that Google is using some mega-advanced refinement of this sort of strategy to achieve their amazing responsiveness.

- Gordon

____________________
Gordon Mohr, gojomo@
bitzi.com, Bitzi CTO
_ http://bitzi.com _


From jbone at jump.net  Mon Jan 21 08:01:01 2002
From: jbone at jump.net (Jeff Bone)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: 
 Costs of Decentralization
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden>
Message-ID: <3C4C3A98.529AB052@jump.net>


Gordon Mohr wrote:

> > *NO* distributed search mechanism --- mechanism for resolving semantic names
> > into sets of objects that match the name --- can do the same thing without
> > (a) a full walk of all nodes in the network, or (b) complete,
> > dynamically-consistent replication of the index to all nodes.  For a
> > comprehensive, "correct" distributed search, amount of network traffic is
> > unavoidably proportional to the size of the network in a way that isn't true
> > for lookup of identifiers.
>
> This does not follow; consider the common example of "semantic names" that are strings of keywords.

Which thing are you saying doesn't follow?  (I did realize after the fact that I wasn't particularly accurate in the above paragraph, so
the claim looks larger than intended.)

> To insert an object, insert a
> pointer to it at every index-node responsible for any one of the keywords. Yes, this takes extra insert steps (by a constant factor:
> avg keyword count per object), but then a query for any string of keywords will give a definitive answer about the presence or
> absence of matching object(s) in O(log n) hops, just as with the "identifier" case.

This will work --- but it's still not practical for the most general case.  In this scheme, the number of messages for a particular
query is proportional to the specificity of the query in total number of keywords given;  each message (fragment of the query with a
single keyword) might even only travel O(1) hops (assuming the partial index node for a keyword can be directly computed and the node
directly contacted) but you have the undesirable consequence that you've got to do more of them the more information you're giving about
the object you desire.  Message count rises with the "fineness" of the partitioning scheme and the specificity of the query.

Further, the each partial result specifies a set of objects matching a single keyword;  the actual answer is found by intersecting the
partial result sets.  Each of those partial result sets is likely to be *very large* (size described by O(f*N) where f is the frequency
of the keyword, i.e., the proportion of all documents "containing" the keyword to total documents) in a general system where you're not
just indexing a small amount of metadata but rather the entire document.  Hence while it's true that there are other strategies (besides
full walk / total replication of index) it remains the case that for a comprehensive, "correct" distributed search, amount of network
traffic unavoidably grows in a way that isn't true for lookup of identifiers.  The factors which impact this are size of the network,
size of data / metadata, partitioning scheme for distributing the indices, the "language of discourse" used for keywords, etc..  You can
do tradeoffs and optimize costs, but for general keyword systems / search distribution inevitably increases the amount of network
traffic --- and for truly general systems of large size, fine distribution quickly becomes impractical.

$0.02,

jb


From oskar at freenetproject.org  Mon Jan 21 09:56:01 2002
From: oskar at freenetproject.org (Oskar Sandberg)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization
In-Reply-To: <3C4C3A98.529AB052@jump.net>; from jbone@jump.net on Mon, Jan 21, 2002 at 09:58:16AM -0600
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net>
Message-ID: <20020121185455.P771@sandbergs.org>

On Mon, Jan 21, 2002 at 09:58:16AM -0600, Jeff Bone wrote:
<> 
> > To insert an object, insert a
> > pointer to it at every index-node responsible for any one of the keywords. Yes, this takes extra insert steps (by a constant factor:
> > avg keyword count per object), but then a query for any string of keywords will give a definitive answer about the presence or
> > absence of matching object(s) in O(log n) hops, just as with the "identifier" case.
> 
> This will work --- but it's still not practical for the most general case.  In this scheme, the number of messages for a particular
> query is proportional to the specificity of the query in total number of keywords given;  each message (fragment of the query with a
> single keyword) might even only travel O(1) hops (assuming the partial index node for a keyword can be directly computed and the node
> directly contacted) but you have the undesirable consequence that you've got to do more of them the more information you're giving about
> the object you desire.  Message count rises with the "fineness" of the partitioning scheme and the specificity of the query.
> 
> Further, the each partial result specifies a set of objects matching a single keyword;  the actual answer is found by intersecting the
> partial result sets.  Each of those partial result sets is likely to be *very large* (size described by O(f*N) where f is the frequency
> of the keyword, i.e., the proportion of all documents "containing" the keyword to total documents) in a general system where you're not
> just indexing a small amount of metadata but rather the entire document.  

To get around this, one could simply include the total list of keywords
for every index entry - then to do an intersection search one would only
need to search for any one of the words. This makes the index entries
quite large if every word is indexed, but most writing uses only a
couple of thousand words so it is not impossible (and in pratice one can
exclude most of /usr/dict/words without much loss to usability).

> Hence while it's true that there are other strategies (besides
> full walk / total replication of index) it remains the case that for a comprehensive, "correct" distributed search, amount of network
> traffic unavoidably grows in a way that isn't true for lookup of identifiers.  The factors which impact this are size of the network,
> size of data / metadata, partitioning scheme for distributing the indices, the "language of discourse" used for keywords, etc..  You can
> do tradeoffs and optimize costs, but for general keyword systems / search distribution inevitably increases the amount of network
> traffic --- and for truly general systems of large size, fine distribution quickly becomes impractical.

I won't venture to guess whether or not this is so, but I must say you
have wandered quite from any mathematical argument for this position,
which I feel is necessary if one is to claim that something holds
fundamentally. What can be said mathematically is that if there is no
way to sort the identifiers, then the utility of the search will depend
only on the number of index entries reached, that is the steps taken
times the average number of indexes per peer. This is trivialy proved
since the lack of effective sorting means that the probability that an
entry is relevant is independent of the relevance of all other entries
passed. Saying that there can be no effective sorting of "semantic
identifiers", however, seems to be either a hasty conclusion or based on
an, in pratice, unecessarily broad defenition of such identifiers.

-- 

Oskar Sandberg
oskar@freenetproject.org

From alk at pobox.com  Mon Jan 21 10:30:02 2002
From: alk at pobox.com (Tony Kimball)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization
References: <20020117132814.A11278@blanu.net>
	<a28b1g+ffbt@eGroups.com>
	<20020118041031.A13946@blanu.net>
	<001001c1a034$bbccf6c0$1fc77940@golden>
	<3C49D6C0.D020AB22@jump.net>
	<008401c1a24f$649ab0a0$1fc77940@golden>
	<3C4C3A98.529AB052@jump.net>
	<20020121185455.P771@sandbergs.org>
Message-ID: <15436.45529.442686.597489@gargle.gargle.HOWL>

Quoth Oskar Sandberg on Monday, 21 January:
: Saying that there can be no effective sorting of "semantic
: identifiers", however, seems to be either a hasty conclusion or based on
: an, in pratice, unecessarily broad defenition of such identifiers.

For any given natural language, it's also (arguably) falsifiable, by
counter-example.  Consider Roget's Thesaurus.


From jbone at jump.net  Mon Jan 21 10:36:01 2002
From: jbone at jump.net (Jeff Bone)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: 
 Costs of Decentralization
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <20020121185455.P771@sandbergs.org>
Message-ID: <3C4C5EF9.946A1E05@jump.net>

Oskar Sandberg wrote:

> To get around this, one could simply include the total list of keywords
> for every index entry - then to do an intersection search one would only
> need to search for any one of the words. This makes the index entries
> quite large if every word is indexed, but most writing uses only a
> couple of thousand words so it is not impossible (and in pratice one can
> exclude most of /usr/dict/words without much loss to usability).

Hmmmm...  first problem:  a single query for a single keyword could then return a list of all objects for which that keyword applied, along
with all the other keywords that describe each of those objects.  But this is likely to result in a very large result set in the general case,
and the requestor is then responsible for performing the full intersection.  Okay, that's easily solved:  simply have the requestor supply the
full set of keywords to the index node for a given keyword, and have that node do the intersection before returning the result set.

What we've now described does in fact begin to resemble the Google architecture. [1]  A couple of things to point out:  we're no longer
directly exploiting the network topology, and this doesn't look like "routing" really.  We're no longer passing the query around widely
according to some mapping of keywords onto nodes and feeding back results as we go;  rather, a single query goes to a single server for its
results.  (The main point of the original message was that "searching" in a routed fashion doesn't necessarily and in current practice does
not cost-effectively exploit topology, while "locating" does.  It's a mathematical implication of the fact that semantic names result in sets
of partial or complete results, and those sets grow rapidly.  I think. :-)  There are still potential problems with the amount of insert
traffic, but this is probably manageable for read-mostly applications.  Key questions to resolve:  how to maintain a reliable, complete set of
keyword-serving nodes?  How to maintain consistency among the nodes serving a particular keyword, or is that even necessary?  It's possible
that topology and routing-like behavior on insert might be useful in answering those questions.

> I won't venture to guess whether or not this is so, but I must say you
> have wandered quite from any mathematical argument for this position,
> which I feel is necessary if one is to claim that something holds
> fundamentally.

Absolutely true. :-)  I'm mostly just thinking out loud and looking feedback;  I haven't quite fully formed my mental model of the problem
yet, though I've got a whiteboard full of currently half-ass equations as placeholders. ;-)  I may push this forward some to try to formalize
the argument...  if so, more later.

jb

[1] http://www-db.stanford.edu/~backrub/google.html


From gojomo at bitzi.com  Mon Jan 21 10:58:02 2002
From: gojomo at bitzi.com (Gordon Mohr)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re:  Costs of Decentralization
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net>
Message-ID: <00a001c1a2ac$ad4b6a80$1fc77940@golden>

Jeff Bone writes:
> Gordon Mohr wrote:
>
> > > *NO* distributed search mechanism --- mechanism for resolving
semantic names
> > > into sets of objects that match the name --- can do the same
thing without
> > > (a) a full walk of all nodes in the network, or (b) complete,
> > > dynamically-consistent replication of the index to all nodes.
For a
> > > comprehensive, "correct" distributed search, amount of network
traffic is
> > > unavoidably proportional to the size of the network in a way
that isn't true
> > > for lookup of identifiers.
> >
> > This does not follow; consider the common example of "semantic
names" that are strings of keywords.
>
> Which thing are you saying doesn't follow?  (I did realize after
the fact that I wasn't particularly accurate in the above paragraph,
so
> the claim looks larger than intended.)

A distributed search mechanism for resolving "semantic names" to
objects CAN give a definitive answer as to whether matching objects
exist in the system, in O(log n) steps. Neither a full walk or
complete index replication is necessary.

> > To insert an object, insert a
> > pointer to it at every index-node responsible for any one of the
keywords. Yes, this takes extra insert steps (by a constant factor:
> > avg keyword count per object), but then a query for any string of
keywords will give a definitive answer about the presence or
> > absence of matching object(s) in O(log n) hops, just as with the
"identifier" case.
>
> This will work --- but it's still not practical for the most
general case.  In this scheme, the number of messages for a
particular
> query is proportional to the specificity of the query in total
number of keywords given;  each message (fragment of the query with a
> single keyword) might even only travel O(1) hops (assuming the
partial index node for a keyword can be directly computed and the
node
> directly contacted) but you have the undesirable consequence that
you've got to do more of them the more information you're giving
about
> the object you desire.  Message count rises with the "fineness" of
the partitioning scheme and the specificity of the query.

So what? That's more traffic by a constant factor -- the number of
keywords. The traffic as a function of network size is still O(log
n).

> Further, the each partial result specifies a set of objects
matching a single keyword;  the actual answer is found by
intersecting the
> partial result sets.

As Oskar points out (on p2p-hackers followup), if you put the
object's full keyword list at each node that tracks any one of the
keywords, you don't need to computer the intersection at the querying
client.

(Actually, if you assume an ordering of the keywords, you need only
put the "subsequent" keywords at each node. That is, with keywords X
Y Z, you put X Y Z at node(X), Y Z at node(Y), and Z at node(Z). I
also suspect that if the ordering can be reverse-frequency, the
average size of the per-object keyword lists at index nodes can be
way less than half the average keyword list size.)

> Hence while it's true that there are other strategies (besides
> full walk / total replication of index) it remains the case that
for a comprehensive, "correct" distributed search, amount of network
> traffic unavoidably grows in a way that isn't true for lookup of
identifiers.  The factors which impact this are size of the network,

No, this one factor is your still-incorrect claim. Both the
"identifier" and "keyword" cases grow with network size in the exact
same way, bounded by O(log n). That the (insert) traffic and (index)
memory usage in the "keyword" case are both larger by constant
factors does not indicate an extra traffic growth factor.

> size of data / metadata, partitioning scheme for distributing the
indices, the "language of discourse" used for keywords, etc..

Yes, those are all independent factors affecting the level of
traffic.

> You can
> do tradeoffs and optimize costs, but for general keyword systems /
search distribution inevitably increases the amount of network
> traffic --- and for truly general systems of large size, fine
distribution quickly becomes impractical.

I would again offer Google as a counterexample: a general indexing
system of large size capable of very fine distinction offering
practical -- indeed impressive -- performance.

- Gordon


From jbone at jump.net  Mon Jan 21 11:00:01 2002
From: jbone at jump.net (Jeff Bone)
Date: Sat Dec  9 22:11:44 2006
Subject: [decentralization] Re: [p2p-hackers] Names vs. Identifiers, Search 
 vs. Location was re: Costs of Decentralization
References: <20020117132814.A11278@blanu.net>
		<a28b1g+ffbt@eGroups.com>
		<20020118041031.A13946@blanu.net>
		<001001c1a034$bbccf6c0$1fc77940@golden>
		<3C49D6C0.D020AB22@jump.net>
		<008401c1a24f$649ab0a0$1fc77940@golden>
		<3C4C3A98.529AB052@jump.net>
		<20020121185455.P771@sandbergs.org> <15436.45529.442686.597489@gargle.gargle.HOWL>
Message-ID: <3C4C6461.4CCDA78@jump.net>


Tony Kimball wrote:

> Quoth Oskar Sandberg on Monday, 21 January:
> : Saying that there can be no effective sorting of "semantic
> : identifiers", however, seems to be either a hasty conclusion or based on
> : an, in pratice, unecessarily broad defenition of such identifiers.
>
> For any given natural language, it's also (arguably) falsifiable, by
> counter-example.  Consider Roget's Thesaurus.

Oskar's point is valid:  the claim that there is no effective sorting of i.e.
keywords is trivially false.  A thesaurus doesn't, however, prove that this
this can be reasonably exploited in the domain in question.  A thesaurus only
maps keywords onto other objects, specifically other keywords;  it does not
map keywords into lists of compound names that can be formed with such
keywords or other useful objects.

Using Gordon's / Oskar's evolved scheme, we can write an equation that
describes the amount of insert traffic it generates.  The scheme for insert
is:

   (1)  For an object, document, generate its keyword list
   (2)  Map each keyword to the node(s) that index that keyword
   (3)  Insert a record into each of these nodes with the id and keyword list

The equation for this would be

   amount of insert traffic ~= rk^2
   r is the degree of replication for index nodes, the # of nodes indexing a
given keyword
   k is the average size of keyword lists for objects indexed
   k < size of object, and "compressible" (wordIDs instead of words)

This insert traffic might or might not be acceptable for a given domain of
objects.

Pushing on,

jb


From zooko at zooko.com  Mon Jan 21 11:11:01 2002
From: zooko at zooko.com (Zooko)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Re: Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization
In-Reply-To: Message from "Gordon Mohr" <gojomo@bitzi.com> 
   of "Mon, 21 Jan 2002 10:51:42 PST." <00a001c1a2ac$ad4b6a80$1fc77940@golden> 
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net>  <00a001c1a2ac$ad4b6a80$1fc77940@golden> 
Message-ID: <E16SjeY-0007nO-00@imp>

Folks:

It sounds to me like some of the ideas about remote lookup of "semantic names" 
are implicitly assuming that all the nodes are "honest" cooperators.  This is 
fine as long as only good people use your system, or as long as the system is 
under a single shared authority scope, and that authority employs plenty of 
administrators to keep everyone in line.

If you include the possibility of participants that are malicious, selfish, or 
arbitrarily badly confused, then this calls into question the mere possibility, 
not to mention the resource expenditure, of remote lookup of "semantic names".

On the other hand it has no effect on the possibility, and hardly any effect on 
the expense, of remote lookup of "self-authenticating names", which is why 
I try to use the latter kind wherever possible.

I'm aware that there are lots of smart people working on the problem of remote 
lookup of semantic names in the presence of potentially malicious participants, 
but the only solutions that have been proven to work are, well, "Byzantine" in 
their complexity, and are only reliable when a certain number of nodes are known 
to be honest.

Regards,

Zooko

---
                 zooko.com
Security and Distributed Systems Engineering
---

From jbone at jump.net  Mon Jan 21 11:14:02 2002
From: jbone at jump.net (Jeff Bone)
Date: Sat Dec  9 22:11:44 2006
Subject: [decentralization] Re: [p2p-hackers] Names vs. Identifiers, Search 
 vs. Location was re:  Costs of Decentralization
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <00a001c1a2ac$ad4b6a80$1fc77940@golden>
Message-ID: <3C4C67BB.C97242F2@jump.net>


Gordon Mohr wrote:

> I would again offer Google as a counterexample: a general indexing
> system of large size capable of very fine distinction offering
> practical -- indeed impressive -- performance.

Take a look at any of the various write-ups of Google's architecture.  We're
now arguing semantics, but to call their distribution / partitioning scheme
for indices "fine" is a bit of a mistake --- they certainly look more
centralized than e.g. Gnutella.  It's not clear to me that increased
distribution and partitioning of their indices results in better performance
or constant network traffic.

jb


From jbone at jump.net  Mon Jan 21 11:18:02 2002
From: jbone at jump.net (Jeff Bone)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re:  
 Costs of Decentralization
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <00a001c1a2ac$ad4b6a80$1fc77940@golden>
Message-ID: <3C4C68C9.9D4BA0EE@jump.net>


Gordon Mohr wrote:

> That the (insert) traffic and (index)
> memory usage in the "keyword" case are both larger by constant
> factors does not indicate an extra traffic growth factor.

Nitty point:  we should be careful about how we're measuring traffic,
here.  There are two measures of interest:  number of messages generated
for either an insert or query operation, and the size of the messages for
either of those.  Insert traffic by the second measure --- size of the
insert --- does indeed grow with fine distribution in the keyword case.

jb


From gojomo at usa.net  Mon Jan 21 11:37:01 2002
From: gojomo at usa.net (Gordon Mohr)
Date: Sat Dec  9 22:11:44 2006
Subject: [decentralization] Re: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re:   Costs of Decentralization
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <00a001c1a2ac$ad4b6a80$1fc77940@golden> <3C4C68C9.9D4BA0EE@jump.net>
Message-ID: <010201c1a2b2$e76e2bc0$1fc77940@golden>

Jeff Bone writes:
> Gordon Mohr wrote:
>
> > That the (insert) traffic and (index)
> > memory usage in the "keyword" case are both larger by constant
> > factors does not indicate an extra traffic growth factor.
>
> Nitty point:  we should be careful about how we're measuring
traffic,
> here.  There are two measures of interest:  number of messages
generated
> for either an insert or query operation, and the size of the
messages for
> either of those.  Insert traffic by the second measure --- size of
the
> insert --- does indeed grow with fine distribution in the keyword
case.

Yes, traffic is greater.

But once more: as a function of network size, the identifier and
keyword
cases face the same exact same sort of logarithmic traffic growth.

You previously claimed the contrary, in your initial message with
this
subject line, and your first followup.

- Gordon


From oskar at freenetproject.org  Mon Jan 21 12:14:01 2002
From: oskar at freenetproject.org (Oskar Sandberg)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization
In-Reply-To: <3C4C5EF9.946A1E05@jump.net>; from jbone@jump.net on Mon, Jan 21, 2002 at 12:33:29PM -0600
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <20020121185455.P771@sandbergs.org> <3C4C5EF9.946A1E05@jump.net>
Message-ID: <20020121211349.Q771@sandbergs.org>

On Mon, Jan 21, 2002 at 12:33:29PM -0600, Jeff Bone wrote:
<>
> What we've now described does in fact begin to resemble the Google architecture. [1]  A couple of things to point out:  we're no longer
> directly exploiting the network topology, and this doesn't look like "routing" really.  We're no longer passing the query around widely
> according to some mapping of keywords onto nodes and feeding back results as we go;  rather, a single query goes to a single server for its
> results.  

In essense, what this scheme does is reduce the problem of searching to
that of locating. By building identifiers based on a keyword from each
inclusive term in the query, we can, through any locating procedure,
identify the nodes that can provide the search results. It is no
different from the mapping of identifier to data, except that the
resource at the located node is not bandwidth and storage but the
capacity to process our query.

So from one perspective, one might say that we are not actually using
the distributed nature for the search, but we do have a scheme for
searching that is in fact distributed, functional, scalable, and
deterministic (given that the locating procedure used has those
properties). It is not what I would describe as an elegant scheme, and
it is questionable whether the constant multiplier given by the number
of keywords (as you note) is not great enough to offset the scalability
advantage over semi-centralized schemes like supernode or Power-law
distribution networks for realistic sizes and traffic levels (the world,
of course, is not always asymptotic), but it does establish that such a
system is possible.

Of course, the search procedure is only (the lesser) half of the problem
with semantic lookups. As Zooko noted, the lack of good ways to verify
results is a huge problem in any distributed search system (and indeed
in centralized searches as well, though Google's policy of considering
links endorsements seems to work quite well).

> (The main point of the original message was that "searching" in a routed fashion doesn't necessarily and in current practice does
> not cost-effectively exploit topology, while "locating" does.  It's a mathematical implication of the fact that semantic names result in sets
> of partial or complete results, and those sets grow rapidly.  I think. :-)  There are still potential problems with the amount of insert
> traffic, but this is probably manageable for read-mostly applications.  Key questions to resolve:  how to maintain a reliable, complete set of
> keyword-serving nodes?  How to maintain consistency among the nodes serving a particular keyword, or is that even necessary?  It's possible
> that topology and routing-like behavior on insert might be useful in answering those questions.

In fact though, the problem of maintaining reliability and consistancy
among the peers is equally a problem in identifier lookups. It is my
standing criticism of schemes like Chord and Tapestry (Oceanstore) that
they turn from very elegant beauties on paper into hideous beasts in
practice when attempting to force deterministic and consistant
properties on an inherently inconsistant and unreliable set of peers.


<>
-- 

Oskar Sandberg
oskar@freenetproject.org

From jbone at jump.net  Mon Jan 21 12:24:02 2002
From: jbone at jump.net (Jeff Bone)
Date: Sat Dec  9 22:11:44 2006
Subject: [decentralization] Re: [p2p-hackers] Names vs. Identifiers, Search 
 vs. Location was re:   Costs of Decentralization
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <00a001c1a2ac$ad4b6a80$1fc77940@golden> <3C4C68C9.9D4BA0EE@jump.net> <010201c1a2b2$e76e2bc0$1fc77940@golden>
Message-ID: <3C4C786F.DB75EC09@jump.net>

Gordon Mohr wrote:

> But once more: as a function of network size, the identifier and
> keyword cases face the same exact same sort of logarithmic traffic
> growth.

I agree.

> You previously claimed the contrary, in your initial message with
> this subject line, and your first followup.

So what's your point with this?  This is an evolving conversation, not
an effort at textbook-writing.  Anything offered is offered
hypothetically.

jb


From oskar at freenetproject.org  Mon Jan 21 12:30:01 2002
From: oskar at freenetproject.org (Oskar Sandberg)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Names vs. Identifiers, Search vs. Location was re: Costs of Decentralization
In-Reply-To: <15436.45529.442686.597489@gargle.gargle.HOWL>; from alk@pobox.com on Mon, Jan 21, 2002 at 06:27:05PM -0600
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <20020121185455.P771@sandbergs.org> <15436.45529.442686.597489@gargle.gargle.HOWL>
Message-ID: <20020121212922.R771@sandbergs.org>

On Mon, Jan 21, 2002 at 06:27:05PM -0600, Tony Kimball wrote:
> Quoth Oskar Sandberg on Monday, 21 January:
> : Saying that there can be no effective sorting of "semantic
> : identifiers", however, seems to be either a hasty conclusion or based on
> : an, in pratice, unecessarily broad defenition of such identifiers.
> 
> For any given natural language, it's also (arguably) falsifiable, by
> counter-example.  Consider Roget's Thesaurus.

I'm not a linguist, but off hand I would consider a thesaurus an example
of sorting on natural vocabulary - natural language would have to
include concepts described by sentences and phrases, which seem far from
clearly sortable to me. 

Of course, the combination of keywords with boolean operations is
in fact an attempt to simplify "semantic identifiers" from natural
language to vocabulary and a few simple rules.

-- 

Oskar Sandberg
oskar@freenetproject.org

From gojomo at usa.net  Mon Jan 21 12:42:01 2002
From: gojomo at usa.net (Gordon Mohr)
Date: Sat Dec  9 22:11:44 2006
Subject: [decentralization] Re: [p2p-hackers] Names vs. Identifiers, Search  vs. Location was re:   Costs of Decentralization
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <00a001c1a2ac$ad4b6a80$1fc77940@golden> <3C4C68C9.9D4BA0EE@jump.net> <010201c1a2b2$e76e2bc0$1fc77940@golden> <3C4C786F.DB75EC09@jump.net>
Message-ID: <001201c1a2bb$f301f3a0$1fc77940@golden>

Jeff Bone writes:
> Gordon Mohr wrote:
>
> > But once more: as a function of network size, the identifier and
> > keyword cases face the same exact same sort of logarithmic
traffic
> > growth.
>
> I agree.

Great! You had danced around this matter so much that I wasn't sure
we had reached agreement.

> > You previously claimed the contrary, in your initial message with
> > this subject line, and your first followup.
>
> So what's your point with this?  This is an evolving conversation,
not
> an effort at textbook-writing.  Anything offered is offered
> hypothetically.

Actually, your initial message had more the tone of a lecture than
thinking-out-loud. And as a lecture, it had a major error. I just
wanted to make sure we had all reached the same page about this
central point.

- Gordon


From jbone at jump.net  Mon Jan 21 12:53:02 2002
From: jbone at jump.net (Jeff Bone)
Date: Sat Dec  9 22:11:44 2006
Subject: [decentralization] Re: [p2p-hackers] Names vs. Identifiers, Search  
 vs. Location was re:   Costs of Decentralization
References: <20020117132814.A11278@blanu.net> <a28b1g+ffbt@eGroups.com> <20020118041031.A13946@blanu.net> <001001c1a034$bbccf6c0$1fc77940@golden> <3C49D6C0.D020AB22@jump.net> <008401c1a24f$649ab0a0$1fc77940@golden> <3C4C3A98.529AB052@jump.net> <00a001c1a2ac$ad4b6a80$1fc77940@golden> <3C4C68C9.9D4BA0EE@jump.net> <010201c1a2b2$e76e2bc0$1fc77940@golden> <3C4C786F.DB75EC09@jump.net> <001201c1a2bb$f301f3a0$1fc77940@golden>
Message-ID: <3C4C7F3F.52785384@jump.net>


Gordon Mohr wrote:

> Great! You had danced around this matter so much that I wasn't sure
> we had reached agreement.

I'm still playing with the model to determine if there are other factors
that, combined with network size, don't interact to put a lower-bound on
the amount of traffic generated in various different measures by various
schemes.

BTW, no dancing intended.


> Actually, your initial message had more the tone of a lecture than
> thinking-out-loud. And as a lecture, it had a major error. I just
> wanted to make sure we had all reached the same page about this
> central point.

Thank God you exist to keep me in line, Gordon.  There might be hoards of
naive fools out  there laboring under mission-critical misperceptions if
not for your diligent editorial efforts.  ;-)  Who knows how many
existing and future participants of both of these lists might have
unskeptically read that article and falsely believed that --- in the
*complete absence* of any mathematical support --- it actually asserted
some universal law?  (Indeed, there may be a universal law lurking in
there somewhere, or there may not.  I offered my note as a strawman in an
attempt to find out.)

$0.02,

jb


From cyb at azrael.dyn.cheapnet.net  Mon Jan 21 12:54:01 2002
From: cyb at azrael.dyn.cheapnet.net (Brandon Wiley)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Re: Names vs. Identifiers, Search vs. Location
 was re: Costs of Decentralization
In-Reply-To: <E16SjeY-0007nO-00@imp>
Message-ID: <Pine.LNX.4.21.0201211450510.19916-100000@azrael.dyn.cheapnet.net>

> I'm aware that there are lots of smart people working on the problem of remote 
> lookup of semantic names in the presence of potentially malicious participants, 
> but the only solutions that have been proven to work are, well, "Byzantine" in 
> their complexity, and are only reliable when a certain number of nodes are known 
> to be honest.

As far as I'm concerned, name lookup is a finished problem. Either a name
is self-authenticating or else it is vouched for by some authority
(singular or a group) or it is vulnerable to attack. Since every scheme
I've ever heard of falls into one of these 3 catagories, the problem of
semantic name lookup is reduced to choosing which of the three you want
your names to be in.


From jbone at jump.net  Mon Jan 21 15:14:01 2002
From: jbone at jump.net (Jeff Bone)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Searching and Locating Strawman Restated
References: <Pine.LNX.4.33.0201211351220.19337-100000@akari.michaelbauer.com>
Message-ID: <3C4CA025.97F4A507@jump.net>


Michael Bauer wrote:

> I'm not sure past, present, or future readers of this list have any idea
> what this thread is about anymore.

Good point! Actually, I'm not sure that the author or any of the participants
have had any idea of what the thread was about at any given point in its
evolution. ;-)

> If there were a summary "same page" URL I for one would quite grateful.

In order to avoid another gojocascade, I'll hold back on any specific
assertions about this until it's ready for prime time.  The summary strawman:
(1) names are different from identifiers (how?) (2) "search" in name space is
different from "locating" in identifier space (how?) and (3) the various
quantifiable costs (number of messages, message size, latency, memory costs,
compute cycles, etc.) and tradeoffs that need to be paid / made are different.
Any insight into the validity of this strawman continues to be appreciated.

jb


From burton at openprivacy.org  Tue Jan 22 13:23:02 2002
From: burton at openprivacy.org (Kevin A. Burton)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] I might be able to host someone for codecon.
Message-ID: <878zaqjd20.fsf@universe.yi.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


OK.

Does anyone need a place to stay during codecon?

The Hotels can be kind of expensive in SF and I already live here.  If you want
a place to stay so that you can save $ let me know.  I am fairly sure I can
convince my roommates.  :)

Kevin

P.S.  No Microsoft employees or FBI Agents :)

- -- 
Kevin A. Burton ( burton@apache.org, burton@openprivacy.org, burtonator@acm.org )
             Location - San Francisco, CA, Cell - 415.595.9965
        Jabber - burtonator@jabber.org,  Web - http://relativity.yi.org/

How are you gentleman?


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Get my public key at: http://relativity.yi.org/pgpkey.txt

iD8DBQE8TdgnAwM6xb2dfE0RAq16AKCtg8yXDKqwTD1J5vJBh37a48GSOgCfbxSt
vMKNo+WNuQ0rrkNdEWJ3slc=
=RiLB
-----END PGP SIGNATURE-----

From jbone at jump.net  Sun Jan 27 16:24:01 2002
From: jbone at jump.net (Jeff Bone)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Centralized Versus Decentralized Indexing
Message-ID: <3C549976.DC380E62@jump.net>

Found a few interesting resources over the last week or so re:
quantifying the many and different costs of decentralized indexes
for non-specific (semantic name) searches.  Additional pointers
welcome.

http://citeseer.nj.nec.com/basu97performance.html
http://citeseer.nj.nec.com/40188.html
http://dblp.uni-trier.de/db/journals/tkde/LiebeherrOA93.html
http:/ www.soi.city.ac.uk/~andym

jb


From greg at electricrain.com  Wed Jan 30 12:32:02 2002
From: greg at electricrain.com (Gregory P. Smith)
Date: Sat Dec  9 22:11:44 2006
Subject: [p2p-hackers] Fwd: Stanford Networking Seminar, Thu 1/31, Pei Cao
Message-ID: <20020130123108.A8854@zot.electricrain.com>

        Stanford Networking Seminar

When:   12:45PM, Thursday, January 31st, 2002
Where:  Room 104, Gates Computer Science Building
URL:    http://netseminar.stanford.edu/sessions/2002-01-31.html

-----------------------------------------------------------------

Title:   Search and Replication in Unstructured Peer-to-Peer Networks

Speaker: Pei Cao
         Cisco Systems, Inc.

Abstract:

File sharing in Peer-to-Peer(P2P) networks are popular applications on
the Internet today. In this talk, we briefly survey architectures of
existing P2P systems, then focus on decentralized and unstructured
networks such as Gnutella. We study two aspects of the system: file
search efficiency and replication strategies. We show that simple
flooding-based search algorithm scales poorly, especially for power-law
random graphs. A multiple-walker random-walk search algorithm can
improve upon the simple flooding by two orders of magnitude. We also
show that, in unstructured networks, an object's replication ratio
should be proportional to the square root of its popularity in order to
minimize overall network search traffic. With simulations, we show that
the path replication strategy, such as the one used in FreeNet, leads to
the close-to-optimal replication ratios.

This project is joint work with Christine Lv, Edith Cohen, Kai Li and
Scott Shenker.


Bio:

Pei Cao is currently a system architect at Cisco systems. Prior to that
she is an assistant professor at the CS Dept. at Univ.
Wisconsin-Madison. Her research interests include operating systems, Web
caching and content delivery, and storage systems.


Notes:

Lunch will be available at 12:15. A vegetarian selection will be
available.  No drinks will be provided.  The talk itself will begin at
12:45

+----------------------------------------------------------------------------+
| This message was sent via the Stanford Computer Science Department         |
| colloquium mailing list.  To be added to this list send an arbitrary       |
| message to colloq-subscribe@cs.stanford.edu.  To be removed from this list,|
| send a message to colloq-unsubscribe@cs.stanford.edu. For more information,|
| send an arbitrary message to colloq-request@cs.stanford.edu. For directions|
| to Stanford, check out http://www-forum.stanford.edu                       |
+-------------------------------------------------------------------------xcl+

-- 
Gregory P. Smith