Towards Out-of-core ND-Arrays -- Frontend
tl;dr Blaze adds usability to our last post on out-of-core ND-Arrays
Disclaimer: This post is on experimental buggy code. This is not ready for public use.
This follows my last post designing a simple task scheduler for use with out-of-core (or distributed) nd-arrays. We encoded tasks-with-data-dependencies as simple dictionaries. We then built functions to create dictionaries that describe blocked array operations. We found that this was an effective-but-unfriendly way to solve some important-but-cumbersome problems.
This post sugars the programming experience with
into to give a
numpy-like experience out-of-core.
Old low-level code
Here is the code we wrote for an out-of-core transpose/dot-product (actually a symmetric rank-k update).
Create random array on disk
A.T * A
New pleasant feeling code with Blaze
The last section “Define computation” is written in a style that is great for library writers and automated systems but is challenging to users accustomed to Matlab/NumPy or R/Pandas style.
We wrap this process with Blaze, an extensible front-end for analytic computations
A.T * A with Blaze
Under the hood
Under the hood, Blaze creates the same dask dicts we created by hand last time. I’ve doctored the result rendered here to include suggestive names.
We then compute this sequentially on a single core. However we could have passed this on to a distributed system. This result contains all necessary information to go from on-disk arrays to computed result in whatever manner you choose.
Separating Backend from Frontend
Recall that Blaze is an extensible front-end to data analytics technologies. It lets us wrap messy computational APIs with a pleasant and familiar user-centric API. Extending Blaze to dask dicts was the straightforward work of an afternoon. This separation allows us to continue to build out dask-oriented solutions without worrying about user-interface. By separating backend work from frontend work we allow both sides to be cleaner and to progress more swiftly.
I’m on vacation right now. Work for recent posts has been done in evenings while watching TV with the family. It isn’t particularly robust. Still, it’s exciting how effective this approach has been with relatively little effort.
Perhaps now would be a good time to mention that Continuum has ample grant funding. We’re looking for people who want to create usable large-scale data analytics tools. For what it’s worth, I quit my academic postdoc to work on this and couldn’t be happier with the switch.
This code is experimental and buggy. I don’t expect it to stay around for forever in it’s current form (it’ll improve). Still, if you’re reading this when it comes out then you might want to check out the following:
blog comments powered by Disqus