[7426] in www-talk@info.cern.ch

home help back first fref pref prev next nref lref last post

Re: Redundancy in links, Davenport Prososal [long]

daemon@ATHENA.MIT.EDU (Jim Davis)
Mon Jan 30 11:54:07 1995

Date: Mon, 30 Jan 1995 17:07:23 +0100
Errors-To: listmaster@www0.cern.ch
Reply-To: davis@DRI.cornell.edu
From: Jim Davis <davis@DRI.cornell.edu>
To: Multiple recipients of list <www-talk@www0.cern.ch>

   Date: Sat, 28 Jan 1995 01:03:10 +0100
   From: "Daniel W. Connolly" <connolly@hal.com>


   ...From the evidence that I have studied, the way to make links more
   reliable is not to deploy some new centralized namespace (ala URNs
   with publisher id's), but to put more redundant info in links.

   Rather than looking at the web as documents addressed by an
   identifier, I think we should look at it as a great big
   content-addressable-memory.  "Give me the document written by Fred in
   1992 whose title is 'authentication in distributed systems'."

   I think the same sort of thing that makes for a high-quality citation
   in written materials will make for a reliable link in a distributed
   hypermedia system. A robust _link_ should look like a BibTex entry
   (MARC record, etc.)


   strategy to increase the quality of service in information retrieval.


   Theory of Operation
   ===================

   The body of information offered by these vendors can be regarded as a
   sort of distributed relational database, the rows being individual
   documents (retrievable entities, to be precise), and the columns being
   attributes of those documents, such as content, publisher, author,
   title, date of publication, etc.

   The pattern of access on this database is much like many databases:
   some columns are searched, and then the relavent row is selected. This
   motivates keeping a certain portion of this data, sometimes referred
   to as "meta-data," or indexing information, highly available.

   The harvest system is a natural match. Each vendor or publisher would
   operate a gatherer, which culls the indexing information from the rows
   of the database that it maintains. A harvest broker would collect the
   indexing information into an aggregate index. This gatherer/broker
   collection interaction is very efficient, and the load on a
   publisher's server would be minimal. The broker can be replicated to
   provide sufficiently high availability.

   Typically, a harvest broker exports a forms-based HTTP searching
   interface. But locating documents in the davenport database is a
   non-interactive process in this system. Ultimately, smart browsers
   can be deployed to conduct the search of the nearest broker and
   select the appropriate document automatically. But the system should
   interoperate with existing web clients.

   Hence the typical HTTP/harvest proxy will have to be modified to not
   only search the index, but also select the appropriate document and
   retrieve it. To decrease latency, a harvest cache should be collocated
   with each such proxy.

   Ideally, links would be represented in the harvest query syntax, or a
   simple s-expression syntax. (Wow! In surfing around for references, I
   just found an example of how these links could be implemented. See the
   PRDM project[2].) But since the only information passed from
   contemporary browsers to proxy servers is a URL, the query syntax will
   have to be embedded in the URL syntax.

   I'll leave the details aside for now, but for example, the query:

	   (Publisher-ISBN: 1232) AND (Title: "Mircosoft Windows User Guide")
		   AND (Edition: Second)

   might be encoded as:

	   harvest:/davenport?publisher-isbn=1232;title=Microsoft%20Windows%20Users%20Guide;edition=Second

   Each client browser is configured with the host and port of the
   nearest davenport broker/HTTP proxy. The reason for the "//davenport"
   in the above URL is that such a proxy could serve other application
   indices as well. Ultimately, browsers might implement the harvest:
   semantics natively, and the browser could use the Harvest Server
   Registry to resolve the "davenport" keyword to the address of a
   suitable broker.

   To resolve the above link, the browser client contacts the proxy and
   sends the full URL. The proxy contacts a nearby davenport broker,
   which processes the query and returns results. The broker then selects
   any match from those results.

   Through careful administration of the links and the index, all the
   matches should identify replicas of the same entity, possibly on
   different ftp/http/gopher servers. An alternative to manually
   replicating the data on these various servers would be to let the
   harvest cache collocated with the broker provide high availability of
   the document content.


   Security Considerations
   =======================

   The main considerations are authenticity and access control for the
   distributed database.

   Securely-obtained links (from a CD-ROM, for example) could include the
   MD5 checksum of the target document. If the target document changes, a
   digital signature providing a secure override to the MD5 could be
   transmitted in the HTTP header. Assuming the publishers' public keys
   are made available to the cache/proxies in a secure fashion, this
   would allow the cache/proxy to detect a forgery. But the link from the
   cache/proxy to the client is insecure until clients are enhanced to
   implement more of this functionality natively. At that point, the
   problem of key distribution becomes more complex.

   This proposal does not address access control. As long as all
   information distributed over the web is public, this solution is
   complete. But over time, the publishers will expect to be able
   to control access to their information.

   If the publishers were willing to trust the cache/proxy servers to
   implement access control, I expect an access control mechanism could
   be added to this system. If the publishers are willing to allow the
   indexing information to remain public, I believe that performance
   would not suffer tremendously. The primary difficulty would be
   distributing a copy of the access control database among the proxies
   in a secure fashion.


   Conclusions
   ===========

   I believe this solution scales well in many ways. It allows the
   publishers to be responsible for the quality of the index and the
   links, while delegating the responsibility of high-availability to
   broker and cache/proxy servers. The publishers could reach agreements
   with network providers to distribute those brokers among the client
   population (much like the GNN is available through various sites.)

   It allows those cache/proxy servers to provide high-availability to
   other applications as well as the davenport community. (The Linux
   community and the Computer Science Technical reports community already
   operate harvest brokers.)

   The impact on clients is minimal -- a one-time configuration of the
   address of the nearest proxy. I believe that the benefits to the
   respective parties outweigh the cost of deployment, and that this
   solution is very feasible.



   [1] http://www.acl.lanl.gov/URI/archive/uri-95q1.messages/0080.html
   Sun, 22 Jan 1995 12:41:10 PST 

   [2] PRDM
   http://www-pcd.stanford.edu/ANNOT_DOC/annotations.html

   [3] http://www.research.digital.com/SRC/larch/larch-home.html

   [4] http://www.cs.utexas.edu/~qr/algernon.html


home help back first fref pref prev next nref lref last post