Dask Development Log

This work is supported by Anaconda Inc

To increase transparency I’m trying to blog more often about the current work going on around Dask and related projects. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Since the last update in the 0.19.0 release blogpost two weeks ago we’ve seen activity in the following areas:

Update Dask examples to use JupyterLab on Binder
Render Dask examples into static HTML pages for easier viewing
Consolidate and unify disparate documentation
Retire the hdfs3 library in favor of the solution in Apache Arrow.
Continue work on hyper-parameter selection for incrementally trained models
Publish two small bugfix releases
Blogpost from the Pangeo community about combining Binder with Dask
Skein/Yarn Update

1: Update Dask Examples to use JupyterLab extension

The new dask-labextension embeds Dask’s dashboard plots into a JupyterLab session so that you can get easy access to information about your computations from Jupyter directly. This was released a few weeks ago as part of the previous release post.

However since then we’ve hooked this up to our live examples system that lets users try out Dask on a small cloud instance using mybinder.org. If you want to try out Dask and JupyterLab together then head here:

Thanks to Ian Rose for managing this.

2: Render Dask Examples as static documentation

Using the nbsphinx Sphinx extension to automatically run and render Jupyter Notebooks we’ve turned our live examples repository into static documentation for easy viewing.

These examples are currently available at https://dask.org/dask-examples/ but will soon be available at examples.dask.org and from the navbar at all dask pages.

Thanks to Tom Augspurger for putting this together.

3: Consolidate documentation under a single org and style

Dask documentation is currently spread out in many small hosted sites, each associated to a particular subpackage like dask-ml, dask-kubernetes, dask-distributed, etc.. This eases development (developers are encouraged to modify documentation as they modify code) but results in a fragmented experience because users don’t know how to discover and efficiently explore our full documentation.

To resolve this we’re doing two things:

Moving all sites under the dask.org domain

Anaconda Inc, the company that employs several of the Dask developers (myself included) recently donated the domain dask.org to NumFOCUS. We’ve been slowly moving over all of our independent sites to use that location for our documentation.
Develop a uniform Sphinx theme dask-sphinx-theme

This has both uniform styling and also includes a navbar that gets automatically shared between the projects. The navbar makes it easy to discover and explore content and is something that we can keep up-to-date in a single repository.

You can see how this works by going to any of the Dask sites, like docs.dask.org.

Thanks to Tom Augspurger for managing this work and Andy Terrel for patiently handling things on the NumFOCUS side and domain name side.

4: Retire the hdfs3 library

For years the Dask community has maintained the hdfs3 library that allows for native access to the Hadoop file system from Python. This used Pivotal’s libhdfs3 library written in C++ and was, for a long while the only performant way to maturely manipulate HDFS from Python.

Since then though PyArrow has developed efficient bindings to the standard libhdfs library and exposed it through their Pythonic file system interface, which is fortunately Dask-compatible.

We’ve been telling people to use the Arrow solution for a while now and thought we’d now do so officially (see dask/hdfs3 #170). As of the last bugfix release Dask will use Arrow by default and, while the hdfs3 library is still available, Dask maintainers probably won’t spend much time on it in the future.

Thanks to Martin Durant for building and maintaining HDFS3 over all this time.

5: Hyper-parameter selection for incrementally trained models

In Dask-ML we continue to work on hyper-parameter selection for models that implement the partial_fit API. We’ve built algorithms and infrastructure to handle this well, and are currently fine tuning API, parameter names, etc..

If you have any interest in this process, come on over to dask/dask-ml #356.

Thanks to Tom Augspurger and Scott Sievert for this work.

6: Two small bugfix releases

We’ve been trying to increase the frequency of bugfix releases while things are stable. Since our last writing there have been two minor bugfix releases. You can read more about them here:

7: Binder + Dask

The Pangeo community has done work to integrate Binder with Dask and has written about the process here: Pangeo meets Binder

Thanks to Joe Hamman for this work and the blogpost.

8: Skein/Yarn Update

The Dask-Yarn connection to deploy Dask on Hadoop clusters uses a library Skein to easily manage Yarn jobs from Python.

Skein has seen a lot of activity over the last few weeks, including the following:

A Web UI for the project. See jcrist/skein #68
A Tensorflow on Yarn project from Criteo that uses Skein. See github.com/criteo/tf-yarn

This work is mostly managed by Jim Crist and other Skein contributors.