Composing Dask Array with Numba Stencils
By Matthew Rocklin
In this post we explore four array computing technologies, and how they work together to achieve powerful results.
 Numba’s stencil decorator to craft localized compute kernels
 Numba’s JustInTime (JIT) compiler for array computing in Python
 Dask Array for parallelizing array computations across many chunks
 NumPy’s Generalized Universal Functions (gufuncs) to tie everything together
In the end we’ll show how a novice developer can write a small amount of Python to efficiently compute localized computation on large amounts of data. In particular we’ll write a simple function to smooth images and apply that in parallel across a large stack of images.
Here is the full code, we’ll dive into it piece by piece below.
import numba
@numba.stencil
def _smooth(x):
return (x[1, 1] + x[1, 0] + x[1, 1] +
x[ 0, 1] + x[ 0, 0] + x[ 0, 1] +
x[ 1, 1] + x[ 1, 0] + x[ 1, 1]) // 9
@numba.guvectorize(
[(numba.int8[:, :], numba.int8[:, :])],
'(n, m) > (n, m)'
)
def smooth(x, out):
out[:] = _smooth(x)
# If you want fake data
import dask.array as da
x = da.ones((1000000, 1000, 1000), chunks=('auto', 1, 1), dtype='int8')
# If you have actual data
import dask_image
x = dask_image.imread.imread('/path/to/*.png')
y = smooth(x)
# dask.array<transpose, shape=(1000000, 1000, 1000), dtype=int8, chunksize=(125, 1000, 1000)>
Note: the smooth
function above is more commonly referred to as the 2D mean filter in the image processing community.
Now, lets break this down a bit
Numba Stencils
Docs:: https://numba.pydata.org/numbadoc/dev/user/stencil.html
Many array computing functions operate only on a local region of the array. This is common in image processing, signals processing, simulation, the solution of differential equations, anomaly detection, time series analysis, and more. Typically we write code that looks like the following:
def _smooth(x):
out = np.empty_like(x)
for i in range(1, x.shape[0]  1):
for j in range(1, x.shape[1]  1):
out[i, j] = x[i + 1, j + 1] + x[i + 1, j + 0] + x[i + 1, j + 1] +
x[i + 0, j + 1] + x[i + 0, j + 0] + x[i + 0, j + 1] +
x[i + 1, j + 1] + x[i + 1, j + 0] + x[i + 1, j + 1]) // 9
return out
Or something similar. The numba.stencil
decorator makes this a bit easier to
write down. You just write down what happens on every element, and Numba
handles the rest.
@numba.stencil
def _smooth(x):
return (x[1, 1] + x[1, 0] + x[1, 1] +
x[ 0, 1] + x[ 0, 0] + x[ 0, 1] +
x[ 1, 1] + x[ 1, 0] + x[ 1, 1]) // 9
Numba JIT
Docs: http://numba.pydata.org/
When we run this function on a NumPy array, we find that it is slow, operating at Python speeds.
x = np.ones((100, 100))
timeit _smooth(x)
527 ms ± 44.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
But if we JIT compile this function with Numba, then it runs more quickly.
@numba.njit
def smooth(x):
return _smooth(x)
%timeit smooth(x)
70.8 µs ± 6.38 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
For those counting, that’s over 1000x faster!
Note: this function already exists as scipy.ndimage.uniform_filter
, which
operates at the same speed.
Dask Array
Docs: https://docs.dask.org/en/latest/array.html
In these applications people often have many such arrays and they want to apply this function over all of them. In principle they could do this with a for loop.
from glob import glob
import skimage.io
for fn in glob('/path/to/*.png'):
img = skimage.io.imread(fn)
out = smooth(img)
skimage.io.imsave(fn.replace('.png', '.out.png'), out)
If they wanted to then do this in parallel they would maybe use the multiprocessing or concurrent.futures modules. If they wanted to do this across a cluster then they could rewrite their code with PySpark or some other system.
Or, they could use Dask array, which will handle both the pipelining and the parallelism (single machine or on a cluster) all while still looking mostly like a NumPy array.
import dask_image
x = dask_image.imread('/path/to/*.png') # a large lazy array of all of our images
y = x.map_blocks(smooth, dtype='int8')
And then because each of the chunks of a Dask array are just NumPy arrays, we
can use the map_blocks
function to apply this function across all of our
images, and then save them out.
This is fine, but lets go a bit further, and discuss generalized universal functions from NumPy.
Generalized Universal Functions
Numba Docs: https://numba.pydata.org/numbadoc/dev/user/vectorize.html
NumPy Docs: https://docs.scipy.org/doc/numpy1.16.0/reference/capi.generalizedufuncs.html
A generalized universal function (gufunc) is a normal function that has been
annotated with typing and dimension information. For example we can redefine
our smooth
function as a gufunc as follows:
@numba.guvectorize(
[(numba.int8[:, :], numba.int8[:, :])],
'(n, m) > (n, m)'
)
def smooth(x, out):
out[:] = _smooth(x)
This function knows that it consumes a 2d array of int8’s and produces a 2d array of int8’s of the same dimensions.
This sort of annotation is a small change, but it gives other systems like Dask
enough information to use it intelligently. Rather than call functions like
map_blocks
, we can just use the function directly, as if our Dask Array was
just a very large NumPy array.
# Before gufuncs
y = x.map_blocks(smooth, dtype='int8')
# After gufuncs
y = smooth(x)
This is nice. If you write library code with gufunc semantics then that code just works with systems like Dask, without you having to build in explicit support for parallel computing. This makes the lives of users much easier.
Finished result
Lets see the full example one more time.
import numba
import dask.array as da
@numba.stencil
def _smooth(x):
return (x[1, 1] + x[1, 0] + x[1, 1] +
x[ 0, 1] + x[ 0, 0] + x[ 0, 1] +
x[ 1, 1] + x[ 1, 0] + x[ 1, 1]) // 9
@numba.guvectorize(
[(numba.int8[:, :], numba.int8[:, :])],
'(n, m) > (n, m)'
)
def smooth(x, out):
out[:] = _smooth(x)
x = da.ones((1000000, 1000, 1000), chunks=('auto', 1, 1), dtype='int8')
smooth(x)
This code is decently approachable by novice users. They may not understand the internal details of gufuncs or Dask arrays or JIT compilation, but they can probably copypasteandmodify the example above to suit their needs.
The parts that they do want to change are easy to change, like the stencil computation, and creating an array of their own data.
This workflow is efficient and scalable, using lowlevel compiled code and potentially clusters of thousands of computers.
What could be better
There are a few things that could make this workflow nicer.

It would be nice not to have to specify dtypes in
guvectorize
, but instead specialize to types as they arrive. numba/numba #2979 
Support GPU accelerators for the stencil computations using numba.cuda.jit. Stencil computations are obvious candidates for GPU acceleration, and this is a good accessible point where novice users can specify what they want in a way that is sufficiently constrained for automated systems to rewrite it as CUDA somewhat easily. numba/numba 3915

It would have been nicer to be able to apply the
@guvectorize
decorator directly on top of the stencil function like this.@numba.guvectorize(...) @numba.stencil def average(x): ...
Rather than have two functions. numba/numba #3914

You may have noticed that our guvectorize function had to assign its result into an out parameter.
@numba.guvectorize( [(numba.int8[:, :], numba.int8[:, :])], '(n, m) > (n, m)' ) def smooth(x, out): out[:] = _smooth(x)
It would have been nicer, perhaps, to just return the output
def smooth(x): return _smooth(x)

The daskimage library could use a
imsave
function
Aspirational Result
With all of these, we might then be able to write the code above as follows
# This is aspirational
import numba
import dask_image
@numba.guvectorize(
[(numba.int8[:, :], numba.int8[:, :])],
signature='(n, m) > (n, m)',
target='gpu'
)
@numba.stencil
def smooth(x):
return (x[1, 1] + x[1, 0] + x[1, 1] +
x[ 0, 1] + x[ 0, 0] + x[ 0, 1] +
x[ 1, 1] + x[ 1, 0] + x[ 1, 1]) // 9
x = dask_image.io.imread('/path/to/*.png')
y = smooth(x)
dask_image.io.imsave(y, '/path/to/out/*.png')
Update: Now with GPUs!
After writing this blogpost I did a small update where I used numba.cuda.jit to implement the same smooth function on a GPU to achieve a 200x speedup with only a modest increase to code complexity.
blog comments powered by Disqus