Dask Development Log
To increase transparency I’m trying to blog more often about the current work going on around Dask and related projects. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.
Current development in Dask and Dask-related projects includes the following efforts:
- A possible change to our community communication model
- A rewrite of the distributed scheduler
- Kubernetes and Helm Charts for Dask
- Adaptive deployment fixes
- Continued support for NumPy and Pandas’ growth
- Spectral Clustering in Dask-ML
Dask community communication generally happens in Github issues for bug and feature tracking, the Stack Overflow #dask tag for user questions, and an infrequently used Gitter chat.
Separately, Dask developers who work for Anaconda Inc (there are about five of us part-time) use an internal company chat and a closed weekly video meeting. We’re now trying to migrate away from closed systems when possible.
Details about future directions are in dask/dask #2945. Thoughts and comments on that issue would be welcome.
When you start building clusters with 1000 workers the distributed scheduler can become a bottleneck on some workloads. After working with PyPy and Cython development teams we’ve decided to rewrite parts of the scheduler to make it more amenable to acceleration by those technologies. Note that no actual acceleration has occurred yet, just a refactor of internal state.
Previously the distributed scheduler was focused around a large set of Python dictionaries, sets, and lists that indexed into each other heavily. This was done both for low-tech code technology reasons and for performance reasons (Python core data structures are fast). However, compiler technologies like PyPy and Cython can optimize Python object access down to C speeds, so we’re experimenting with switching away from Python data structures to Python objects to see how much this is able to help.
This change will be invisible operationally (the full test suite remains virtually unchanged), but will be a significant change to the scheduler’s internal state. We’re keeping around a compatibility layer, but people who were building their own diagnostics around the internal state should check out with the new changes.
Ongoing work by Antoine Pitrou in dask/distributed #1594
Kubernetes and Helm Charts for Dask
In service of the Pangeo project to enable scalable data analysis of atmospheric and oceanographic data we’ve been improving the tooling around launching Dask on Cloud infrastructure, particularly leveraging Kubernetes.
To that end we’re making some flexible Docker containers and Helm Charts for Dask, and hope to combine them with JupyterHub in the coming weeks.
Work done by myself in the following repositories. Feedback would be very welcome. I am learning on the job with Helm here.
If you use Helm on Kubernetes then you might want to try the following:
helm repo add dask https://dask.github.io/helm-chart helm update helm install dask/dask
This installs a full Dask cluster and a Jupyter server. The Docker containers contain entry points that allow their environments to be updated with custom packages easily.
This work extends prior work on the previous package, dask-kubernetes, but is slightly more modular for use alongside other systems.
Adaptive deployment fixes
Adaptive deployments, where a cluster manager scales a Dask cluster up or down based on current workloads recently got a makeover, including a number of bug fixes around odd or infrequent behavior.
Work done by Russ Bubley here:
Keeping up with NumPy and Pandas
NumPy 1.14 is due to release soon. Dask.array had to update how it handled structured dtypes in dask/dask #2694 (Work by Tom Augspurger).
Dask.dataframe is gaining the ability to merge/join simultaneously on columns and indices, following a similar feature released in Pandas 0.22. Work done by Jon Mease in dask/dask #2960
Spectral Clustering in Dask-ML
blog comments powered by Disqus