This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: Caching improves performance under repetitive workloads. Traditional LRU policies don’t fit data science well. We propose a new caching policy.

Humans Repeat Stuff

Consider the dataset that you’ve worked on most recently. How many times have you loaded it from disk into memory? How many times have you repeated almost the same computations on that data?

Exploratory data science workloads involve repetition of very similar computations. These computations share structure. By caching frequently used results (or parts of results) we may be able to considerably speed up exploratory data analysis.

Caching in other contexts

The web community loves caching. Database backed web applications almost always guard their database lookups with a system like memcached which devotes some amount of memory to caching frequent and recent queries. Because humans visit mostly the same pages over and over again this can reduce database load by an order of magnitude. Even if humans visit different pages with different inputs these pages often share many elements.

Limited caching resources

Given infinite memory we would cache every result that we’ve ever computed. This would give us for instant recall on anything that wasn’t novel. Sadly our memory resources are finite and so we evict cached results that don’t seem to be worth keeping around.

Traditionally we use a policy like Least Recently Used (LRU). This policy evicts results that have not been requested for a long time. This is cheap and works well for web and systems applications.

LRU doesn’t fit analytic workloads

Unfortunately LRU doesn’t fit analytic workloads well. Analytic workloads have a large spread computation times and of storage costs. While most web application database queries take roughly the same amount of time (100ms-1000ms) and take up roughly the same amount of space to store (1-10kb), the computation and storage costs of analytic computations can easily vary by many orders of magnitude (spreads in the millions or billions are common.)

Consider the following two common computations of a large NumPy array:

  1. x.std() # costly to recompute, cheap to store
  2. x.T # cheap to recompute, costly to store

In the first case, x.std(), this might take a second on a large array (somewhat expensive) but takes only a few bytes to store. This result is so cheap to store that we’re happy to keep it in our cache for a long time, even if its infrequently requested.

In the second case, x.T this is cheap to compute (just a metadata change in the array) and executes in microseconds. However the result might take gigabytes of memory to store. We don’t want to keep this in our cache, even if it’s very frequently requested; it takes up all of the space for other potentially more useful (and smaller) results and we can recompute it trivially anyway.

So we want to keep cached results that have the following properties:

  1. Costly to recompute (in seconds)
  2. Cheap to store (in bytes)
  3. Frequently used
  4. Recently used

Proposed Caching Policy

Here is an alternative to LRU that respects the objectives stated above.

Every time someone accesses an entry in our cache, we increment the score associated to the entry with the following value

\[\textrm{score} = \textrm{score} + \frac{\textrm{compute time}}{\textrm{nbytes}} (1 + \epsilon)^{t}\]

Where compute time is the time it took to compute the result in the first place, nbytes is the number of bytes that it takes to store the result, epsilon is a small number that determines the halflife of what “recently” means, and t is an auto-incrementing tick time increased at every access.

This has units of inverse bandwidth (s/byte), gives more importance to new results with a slowly growing exponential growth, and amplifies the score of frequently requested results in a roughly linear fashion.

We maintain these scores in a heap, keep track of the total number of bytes, and cull the cache as necessary to keep storage costs beneath a fixed budget. Updates cost O(log(k)) for k the number of elements in the cache.


I wrote this up into a tiny library called cachey. This is experimental code and subject to wild API changes (including renaming.)

The central object is a Cache that includes asks for the following:

  1. Number of available bytes to devote to the cache
  2. Halflife on importance (the number of access that occur to reduce the importance of a cached result by half) (default 1000)
  3. A lower limit on costs to consider entry to the cache (default 0)


So here is the tiny example

>>> from cachey import Cache
>>> c = Cache(available_bytes=1e9)  # 1 GB

>>> c.put('Hello', 'world!', cost=3)

>>> c.get('Hello')

More interesting example

The cache includes a memoize decorator. Lets memoize pd.read_csv.

In [1]: import pandas as pd
In [2]: from cachey import Cache

In [3]: c = Cache(1e9)
In [4]: read_csv = c.memoize(pd.read_csv)

In [5]: %time df = read_csv('accounts.csv')
CPU times: user 262 ms, sys: 27.7 ms, total: 290 ms
Wall time: 303 ms

In [6]: %time df = read_csv('accounts.csv')  # second read is free
CPU times: user 77 µs, sys: 16 µs, total: 93 µs
Wall time: 93 µs

In [7]: c.total_bytes / c.available_bytes
Out[7]: 0.096

So we create a new function read_csv that operates exactly like pandas.read_csv except that it holds on to recent results in c, a cache. This particular CSV file created a dataframe that filled a tenth of our cache space. The more often we request this CSV file the more its score will grow and the more likely it is to remain in the cache into the future. If other memoized functions using this same cache produce more valuable results (costly to compute, cheap to store) and we run out of space then this result will be evicted and we’ll have to recompute our result if we ask for read_csv('accounts.csv') again.

Memoize everything

Just memoizing read_csv isn’t very interesting. The pd.read_csv function operates at a constant data bandwidth of around 100 MB/s. The caching policies around cachey really shine when they get to see all of our computations. For example it could be that we don’t want to hold on to the results of read_csv because these take up a lot of space. If we find ourselves doing the same groupby computations then we might prefer to use our gigabyte of caching space to store these both because

  1. groupby computations take a long time to compute
  2. groupby results are often very compact in memory

Cachey and Dask

I’m slowly working on integrating cachey into dask’s shared memory scheduler.

Dask is in a good position to apply cachey to many computations. It can look at every task in a task graph and consider the result for inclusion into the cache. We don’t need to explicitly memoize every function we want to use, dask can do this for us on the fly.

Additionally dask has a nice view of our computation as a collection of sub-tasks. Similar computations (like mean and variance) often share sub-tasks.

Future Work

Cachey is new and untested but potentially useful now, particularly through the memoize method shown above. It’s a small and simple codebase. I would love to hear if people find this kind of caching policy valuable.

I plan to think about the following in the near future:

  1. How do we build a hierarchy of caches to share between memory and disk?
  2. How do we cleanly integrate cachey into dask (see PR #502) and how do we make dask amenable to caching (see PR #510)?

blog comments powered by Disqus