# Towards Out-of-core ND-Arrays -- Benchmark MatMul

*This work is supported by Continuum Analytics
and the XDATA Program
as part of the Blaze Project*

**tl;dr** We benchmark dask on an out-of-core dot product. We also compare and
motivate the use of an optimized BLAS.

**Here are the results**

Performance (GFLOPS) | In-Memory | On-Disk |
---|---|---|

Reference BLAS | 6 | 18 |

OpenBLAS one thread | 11 | 23 |

OpenBLAS four threads | 22 | 11 |

*Disclaimer: This post is on experimental buggy code. This is not ready for public use.*

## Introduction

This is the fourth in a sequence of posts constructing an out-of-core nd-array using NumPy, Blaze, and dask. You can view these posts here:

We now give performance numbers on out-of-core matrix-matrix multiplication.

## Matrix-Matrix Multiplication

Dense matrix-matrix multiplication is compute-bound, not I/O bound.
We spend most of our time doing arithmetic and relatively little time shuffling
data around. As a result we may be able to read *large* data from disk without
performance loss.

When multiplying two $n\times n$ matrices we read $n^2$ bytes but perform $n^3$ computations. There are $n$ computations to do per byte so, relatively speaking, I/O is cheap.

We normally measure speed for single CPUs in Giga Floating Point Operations Per Second (GFLOPS). Lets look at how my laptop does on single-threaded in-memory matrix-matrix multiplication using NumPy.

```
>>> import numpy as np
>>> x = np.ones((1000, 1000), dtype='f8')
>>> %timeit x.dot(x) # Matrix-matrix multiplication
10 loops, best of 3: 176 ms per loop
>>> 1000 ** 3 / 0.176 / 1e9 # n cubed computations / seconds / billion
>>> 5.681818181818182
```

OK, so NumPy’s matrix-matrix multiplication clocks in at 6 GFLOPS more or
less. The `np.dot`

function ties in to the `GEMM`

operation in the `BLAS`

library on my machine. Currently my `numpy`

just uses reference BLAS. (you can
check this with `np.show_config()`

.)

## Matrix-Matrix Multiply From Disk

For matrices too large to fit in memory we compute the solution one part at a time, loading blocks from disk when necessary. We parallelize this with multiple threads. Our last post demonstrates how NumPy+Blaze+Dask automates this for us.

We perform a simple numerical experiment, using HDF5 as our on-disk store.

We install stuff

```
conda install -c blaze blaze
pip install git+https://github.com/ContinuumIO/dask
```

We set up a fake dataset on disk

```
>>> import h5py
>>> f = h5py.File('myfile.hdf5')
>>> A = f.create_dataset(name='A', shape=(200000, 4000), dtype='f8',
... chunks=(250, 250), fillvalue=1.0)
>>> B = f.create_dataset(name='B', shape=(4000, 4000), dtype='f8',
... chunks=(250, 250), fillvalue=1.0)
```

We tell Dask+Blaze how to interact with that dataset

```
>>> from dask.obj import Array
>>> from blaze import Data, into
>>> a = into(Array, 'myfile.hdf5::/A', blockshape=(1000, 1000)) # dask things
>>> b = into(Array, 'myfile.hdf5::/B', blockshape=(1000, 1000))
>>> A = Data(a) # Blaze things
>>> B = Data(b)
```

We compute our desired result, storing back onto disk

```
>>> %time into('myfile.hdf5::/result', A.dot(B))
2min 49s
>>> 200000 * 4000 * 4000 / 169 / 1e9
18.934911242
```

18.9 GFLOPS, roughly 3 times faster than the in-memory solution. At first glance this is confusing - shouldn’t we be slower coming from disk? Our speedup is due to our use of four cores in parallel. This is good, we don’t experience much slowdown coming from disk.

It’s as if all of our hard drive just became memory.

## OpenBLAS

Reference BLAS is slow; it was written long ago. OpenBLAS is a modern
implementation. I installed OpenBLAS with my system installer (`apt-get`

) and
then reconfigured and rebuilt numpy. OpenBLAS supports many cores. We’ll show
timings with one and with four threads.

```
export OPENBLAS_NUM_THREADS=1
ipython
```

```
>>> import numpy as np
>>> x = np.ones((1000, 1000), dtype='f8')
>>> %timeit x.dot(x)
10 loops, best of 3: 89.8 ms per loop
>>> 1000 ** 3 / 0.0898 / 1e9 # compute GFLOPS
11.135857461024498
```

```
export OPENBLAS_NUM_THREADS=4
ipython
```

```
>>> %timeit x.dot(x)
10 loops, best of 3: 46.3 ms per loop
>>> 1000 ** 3 / 0.0463 / 1e9 # compute GFLOPS
21.598272138228943
```

This is about four times faster than reference. If you’re not already
parallelizing in some other way (like with `dask`

) then you should use a modern
BLAS like OpenBLAS or MKL.

## OpenBLAS + dask

Finally we run on-disk our experiment again, now with OpenBLAS. We do this both with OpenBLAS running with one thread and with many threads.

We’ll skip the code (it’s identical to what’s above) and give a comprehensive table of results below.

Sadly the out-of-core solution doesn’t improve much by using OpenBLAS.
Acutally when both OpenBLAS and dask try to parallelize we *lose* performance.

## Results

Performance (GFLOPS) | In-Memory | On-Disk |
---|---|---|

Reference BLAS | 6 | 18 |

OpenBLAS one thread | 11 | 23 |

OpenBLAS four threads | 22 | 11 |

**tl:dr** When doing compute intensive work, don’t worry about using disk, just
don’t use two mechisms of parallelism at the same time.

## Main Take-Aways

- We don’t lose much by operating from disk in compute-intensive tasks
- Actually we can improve performance when an optimized BLAS isn’t avaiable.
- Dask doesn’t benefit much from an optimized BLAS. This is sad and surprising. I expected performance to scale with single-core in-memory performance. Perhaps this is indicative of some other limiting factor
- One shouldn’t extrapolate too far with these numbers. They’re only relevant for highly compute-bound operations like matrix-matrix multiply

Also, thanks to Wesley Emeneker for finding where we were leaking memory, making results like these possible.

blog comments powered by Disqus