From: Eugene Leitl (eugene.leitl@lrz.uni-muenchen.de)
Date: Tue Dec 21 1999 - 14:35:45 MST
Trouble is
1) web growth eclipses spider scan rate
2) most web is database-driven, and is not a document tree
To keep web recent and well-indexed each web site should create a site
index and keep it current, and notify the spider to pick it up when
the accumulated changes are significant. This is also possible with
database-driven sites, which can mirror parts or whole of their
database into a virtual document tree. This eminates periodic spider
scans, reduces the system load (Euroferret crashed my box once) and
vastly reduces the network bandwidth (full-text indices are damn
small). This can be expanded to a fully distributed database scheme,
with few-hop separated nodes mutually exchange their indexes and form
a huge distributed search engine.
To start with this someone should put webglimpse or similiar into
Apache, including a document submission mechanism triggering
reindexing. As a side effect this gives every site a native search
capability. Establish a naming convention for the index file,
something like index.txt, a notification mechanism, and once that
particular Apache derivate is widespread enough, get Google, Altavista
et al. make use of that functionality.
Bryan Moss writes:
> Brian Atkins wrote:
>
> > Ebay is attempting to ward off centralized auction search
> > sites by blocking those sites from accessing its servers.
> > It is quite easy for a petfood.com or whatever to block
> > your search bot and not allow it to index your content. So
> > in effect they can try to make people still come directly
> > to their site, or as ebay is trying to do force the search
> > bot operators to license the content.
>
> I think good search bots will replace the URL field as the
> standard way of navigating the web. In which case dot-com's
> that deter search bots would only succeed in losing
> customers. Ebay's got nothing you can't do with XML and a
> good search engine anyway.
>
> BM
>
>
This archive was generated by hypermail 2.1.5 : Fri Nov 01 2002 - 15:06:10 MST