Caching Humans repeat stuff. Caching helps.
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
tl;dr: Caching improves performance under repetitive workloads. Traditional LRU policies don’t fit data science well. We propose a new caching policy.
Humans Repeat Stuff
Consider the dataset that you’ve worked on most recently. How many times have you loaded it from disk into memory? How many times have you repeated almost the same computations on that data?
Exploratory data science workloads involve repetition of very similar computations. These computations share structure. By caching frequently used results (or parts of results) we may be able to considerably speed up exploratory data analysis.
Caching in other contexts
The web community loves caching. Database backed web applications almost
always guard their database lookups with a system like memcached
which
devotes some amount of memory to caching frequent and recent queries.
Because humans visit mostly the same pages over and over again this can reduce
database load by an order of magnitude. Even if humans visit different pages
with different inputs these pages often share many elements.
Limited caching resources
Given infinite memory we would cache every result that we’ve ever computed. This would give us for instant recall on anything that wasn’t novel. Sadly our memory resources are finite and so we evict cached results that don’t seem to be worth keeping around.
Traditionally we use a policy like Least Recently Used (LRU). This policy evicts results that have not been requested for a long time. This is cheap and works well for web and systems applications.
LRU doesn’t fit analytic workloads
Unfortunately LRU doesn’t fit analytic workloads well. Analytic workloads have a large spread computation times and of storage costs. While most web application database queries take roughly the same amount of time (100ms-1000ms) and take up roughly the same amount of space to store (1-10kb), the computation and storage costs of analytic computations can easily vary by many orders of magnitude (spreads in the millions or billions are common.)
Consider the following two common computations of a large NumPy array:
x.std() # costly to recompute, cheap to store
x.T # cheap to recompute, costly to store
In the first case, x.std()
, this might take a second on a large array
(somewhat expensive) but takes only a few bytes to store. This result is so
cheap to store that we’re happy to keep it in our cache for a long time, even
if its infrequently requested.
In the second case, x.T
this is cheap to compute (just a metadata change in
the array) and executes in microseconds. However the result might take
gigabytes of memory to store. We don’t want to keep this in our cache, even if
it’s very frequently requested; it takes up all of the space for other
potentially more useful (and smaller) results and we can recompute it trivially
anyway.
So we want to keep cached results that have the following properties:
- Costly to recompute (in seconds)
- Cheap to store (in bytes)
- Frequently used
- Recently used
Proposed Caching Policy
Here is an alternative to LRU that respects the objectives stated above.
Every time someone accesses an entry in our cache, we increment the score associated to the entry with the following value
\[\textrm{score} = \textrm{score} + \frac{\textrm{compute time}}{\textrm{nbytes}} (1 + \epsilon)^{t}\]Where compute time
is the time it took to compute the result in the first
place, nbytes
is the number of bytes that it takes to store the result,
epsilon
is a small number that determines the halflife of what “recently”
means, and t
is an auto-incrementing tick time increased at every access.
This has units of inverse bandwidth (s/byte), gives more importance to new results with a slowly growing exponential growth, and amplifies the score of frequently requested results in a roughly linear fashion.
We maintain these scores in a heap, keep track of the total number of bytes,
and cull the cache as necessary to keep storage costs beneath a fixed budget.
Updates cost O(log(k))
for k
the number of elements in the cache.
Cachey
I wrote this up into a tiny library called cachey
. This is experimental
code and subject to wild API changes (including renaming.)
The central object is a Cache
that includes asks for the following:
- Number of available bytes to devote to the cache
- Halflife on importance (the number of access that occur to reduce the importance of a cached result by half) (default 1000)
- A lower limit on costs to consider entry to the cache (default 0)
Example
So here is the tiny example
>>> from cachey import Cache
>>> c = Cache(available_bytes=1e9) # 1 GB
>>> c.put('Hello', 'world!', cost=3)
>>> c.get('Hello')
'world!'
More interesting example
The cache includes a memoize
decorator. Lets memoize pd.read_csv
.
In [1]: import pandas as pd
In [2]: from cachey import Cache
In [3]: c = Cache(1e9)
In [4]: read_csv = c.memoize(pd.read_csv)
In [5]: %time df = read_csv('accounts.csv')
CPU times: user 262 ms, sys: 27.7 ms, total: 290 ms
Wall time: 303 ms
In [6]: %time df = read_csv('accounts.csv') # second read is free
CPU times: user 77 µs, sys: 16 µs, total: 93 µs
Wall time: 93 µs
In [7]: c.total_bytes / c.available_bytes
Out[7]: 0.096
So we create a new function read_csv
that operates exactly like
pandas.read_csv
except that it holds on to recent results in c
, a cache.
This particular CSV file created a dataframe that filled a tenth of our
cache space. The more often we request this CSV file the more its score will
grow and the more likely it is to remain in the cache into the future. If
other memoized functions using this same cache produce more valuable results
(costly to compute, cheap to store) and we run out of space then this result
will be evicted and we’ll have to recompute our result if we ask for
read_csv('accounts.csv')
again.
Memoize everything
Just memoizing read_csv
isn’t very interesting. The pd.read_csv
function operates at a constant data bandwidth of around 100 MB/s. The caching
policies around cachey
really shine when they get to see all of our
computations. For example it could be that we don’t want to hold on to the
results of read_csv
because these take up a lot of space. If we find
ourselves doing the same groupby
computations then we might prefer to use our
gigabyte of caching space to store these both because
- groupby computations take a long time to compute
- groupby results are often very compact in memory
Cachey and Dask
I’m slowly working on integrating cachey into dask’s shared memory scheduler.
Dask is in a good position to apply cachey
to many computations. It can look
at every task in a task graph and consider the result for inclusion into the
cache. We don’t need to explicitly memoize
every function we want to use,
dask
can do this for us on the fly.
Additionally dask has a nice view of our computation as a collection of sub-tasks. Similar computations (like mean and variance) often share sub-tasks.
Future Work
Cachey is new and untested but potentially useful now, particularly through the
memoize
method shown above.
It’s a small and simple codebase.
I would love to hear if people find this kind of caching policy valuable.
I plan to think about the following in the near future:
- How do we build a hierarchy of caches to share between memory and disk?
- How do we cleanly integrate cachey into dask (see PR #502) and how do we make dask amenable to caching (see PR #510)?
blog comments powered by Disqus