C. Olston, J. Widom, "Best-Effort Cache Synchronization with Source Cooperation", SIGMOD 2002
Group Comments
1. general/generic approach
2. mathematical soundness
3. adaptivity and positive feeback mechanism
4. source cooperation (better prioritization and better source-side
resource utilization)
1. macro vs. micro (but, latency is not their problem)
2. single cache (central cache is not practical, hierarchy of caches?)
3. assumptions about the change rate
4. cost of maintaining priority queues
5. how does number of sources impact the result?
6. restricted usage, not realistic
1. single cache -> multiple cache
2. t*d*w - \integral(t*d*w) formula vs. t*d*w formula (Ying brings
up a discussion of why
Ti are constant)
3. How t_now is determined affects the decisions...
Jeong's Comments
The Speculations that I got while reading the paper include:
1. While the goal of the paper is to minimize the (weighted) time-averaged divergence between the sources and cached copies, the MACRO GOAL in data caching (replication) is to reduce the latency to access data from remote sources. How to seamlessly bridge these two metrics which look isolated in the paper? I agree that the paper didn't explicitly mention the access delay and actually there was no notion of it. However, if we know empirically or by some analysis the tolerance level of the object-cache divergence (e.g, as for human vision, upto 10% of image diverence is tolerable) , we will be able to identify untolerable stale copies. Moreover, in some cases, we may have to ignore the meaningless stale copies and wait for the next refresh messages flowing from the sources. Here comes the magic of access latency as the clients would not be serviced during the waiting times! In fact, I sense that keeping such stale copies also results in space overhead. Basically, I am not arguing that my idea is correct or productive. I just want to hear your opinions if any.
2. The plausibility of the refresh priority function is solely dependent on the validity of the idea which regards the time-averaged divergence as the most important analysis metric (Note that the refresh priority function is gracefully derived from the time-averaged divergence metric using Lagrange Multiplier. Personally, it looked very beautiful!). However, is the averaged divergence a valid metric? What do you think?
3. The entire structure of the model introduced in the paper assumes a single cache pool which stores all the copies from all the sources. I.e., a source object has its own single cached copy, so update propagation is easy. However, for many reasons (performance, load balancing, robustness (single point of failure), physical and geographical differences), a source object may have to have many corresponding cached copies. If we assume multile cached copies, we need to have a sort of highly-scalable multicast communication mechanism and at a time, a source may have different types (negative or positive) of feedbacks from different copies, which looks very confusing - if we favored only the postive feedbacks, the caches issued negative feedbacks will have seious problems. However, if we regard only the negative messages, the idle caches will become more idle. .........
4. Practically, can the cache pool really avoid the situation where at first it sends positive feedbacks excessively as it thinks it has surplus bandwidth and then the network is congested by the resulting update propagation messages from the sources. Note that each source doesn't have a global view, only the cache pool has. I am not sure if it's a big deal though :).
5. 1) how to manage bandwidth usage when objects have non-uniform refresh cost, 2) partial refreshing, 3) group refreshing, etc. as mentioned in the paper.
%% If there's a guy you want to know something about the Lagrange Multiplier technique, refer to the web site http://mathworld.wolfram.com/LagrangeMultiplier.html, or may contact me.
Nesime's Comments:
1. The approach is not specific to a divergence
metric, but
can employ any metric with D=0 when the copies
are consistent
and monotonically increasing in time after that.
2. It can also employ factors other than importance
and priority into the
weight function if needed.
3. There is adaptivity against fluctuations in
bandwidth and refresh rates using
a feedback mechanism between cache and source
4. With source cooperation, more effective decisions
can be made on which
refreshes are more important to make and bandwidth
is better utilized also
at the source side.
5. It approximates a global priority scheme by
local adjustments of thresholds
at sources coordinated by the central cache -
avoids the expensive calculation
of the global optimal threshold and extra communication
needed for that
Experimentation Methodology:
1. Setting the parameters
2. Show that the approach is effective within
itself using synthetic data
by comparing against ideal scenario
under different metrics
3. Show that it is effective also with real-life
data by repeating 2 with real-data
4. Show that source cooperation approach is effective
by
comparing against an approach without
source cooperation