[5248] in www-talk@info.cern.ch

home help back first fref pref prev next nref lref last post

Re: Caching Servers Considered Harmful (was: Re: Finger URL)

daemon@ATHENA.MIT.EDU (Chris Lilley, Computer Graphics Un)
Mon Aug 22 17:24:32 1994

Date: Mon, 22 Aug 1994 23:19:19 +0200
Errors-To: listmaster@www0.cern.ch
Errors-To: listmaster@www0.cern.ch
Reply-To: lilley@v5.cgu.mcc.ac.uk
From: lilley@v5.cgu.mcc.ac.uk (Chris Lilley, Computer Graphics Unit)
To: Multiple recipients of list <www-talk@www0.cern.ch>

In message <199408221645.JAA01900@rock> John Labovitz said:

> One solution would be for caching servers to generate
> a summary of hits on URLs `belonging' to particular 
> servers, and to email that summary to a standard 
> email address at those servers.  

Well, great minds think alike! (Or as my mum would say, weak ones seldom 
differ). In my response to Robs statement, I proposed something similar.

> So even though we 
> at GNN may not receive the level of detail that we 
> get from our own logs (timestamp, hostnames, URLs), 
> we could at least receive from the caching servers 
> an approximation which we could integrate into our 
> reports back to our advertisers.

My suggestion would give you precisely the same level of detail, ie common log 
format entries, just lagged a little in the updates.

> In Neil Smith's paper `What can Archives offer the World 
> Wide Web,' there's a table (fig. 7) that lists `the most 
> popular remote sites accessed via the UNIX HENSA cache.'

I refuse to use the HENSA cache for precisely that reason.Well, that and the 
need to edit URLS to meet their syntax. CERN proxy cacheing on the other hand, 
is a different matter - as well as being transparent to the user.

> Here's
> a real-life ramification of caching: for those using 
> the HENSA server, our daily Dilbert comic strip is 
> available only once every two weeks.)

I agree that this is pathetic. But it is a consequence of using the HENSA cache 
(or indeed the Lagoon Cache) rather than caching in general.

There is nothing stopping you putting an expires date on each Dilbert gif so it 
lasts 24 hours. Proxy caches could (should) query your server, if reachable, to 
see if current.gif or whatever has changed since the cache copy.

Actually, I just tried this, using our nearest GNN server at Ireland Online. 
Doing a head on /gnn/arcade/comix/dilbert.html told me that the enclosing html 
file last changed on Saturday 09-Jul-94 - fair enough, its just trimmings, 
warning notices and icons that don't change. Doing a conditional GET on 
/gnn/arcade/comix/graphics/Dilbert.gif with the current date and time gave me a 
304 Not modified with a Last-Modified of Sunday, 21-Aug-94 03:59:00 GMT (but no 
Expires: header)

So all/most of the mechanisms are in place for a proxy cache to negotiate to 
serve the current Dilbert while keeping the GNN server lightly loaded.

Incidentally, how do ORA handle cach coherency among their various servers? What 
is the latency in propogating new files - ie, when the main server gets an 
updated file, how liong (in minutes, hours etc) before the sewrvers in Russia 
and Japan are serving the same information from the same relative path?

Thinking about what I just said (it happens sometimes ;-) ) _cache_coherency_. 
Of course. There is a whole existing literature on cache coherency from the 
whizzy computer world. (And me sitting just yards from a KSR64, which is 
essentially 64 processors with coherent caches pretending to be one global 
address space). So using that model, we have a multi-cache, multi-processor 
setup. Our caches are fully associative - no sets or lines to worry about. And 
in that set up, to guarantee that processor B sees the correct things from a 
particular memory location, it does not go round asking all the other processors 
if they are working there and have a more up to date value in their caches or in 
the registers. No, it is up to the originating processor to send out cache 
invalidation signals to the others.

Chew on this. Suppose proxy caches identify themselves to servers (there has to 
be a better way than user agent fields). When the originating server updates a 
document it sends out invalidating signals to the proxy caches that grabbed that 
document during its non-expired time. (Erm ... so if the document has an expiry 
of one week, the server need not tell a proxy that grabbed the document with 
that name two years ago).

The proxy caches invalidate the cache entry, and either procatively get the new 
version or more likely wait for the next hit to come in, depending how they are 
set up, whether they are clever enough to have analysed their own usage 
patterns, and so on.

Cascaded caches send out invalidation signals to the caches that accessed them, 
and so on.

Comments?

--
Chris

home help back first fref pref prev next nref lref last post