Last Year in Review

We recently enjoyed the 2020 SciPy conference from the comfort of our own homes this year. The 19th annual Scientific Computing with Python conference was a virtual conference this year due to the global pandemic. The annual SciPy Conference brought together over 1500 participants from industry, academia, and government to showcase their latest projects, learn from skilled users and developers, and collaborate on code development.

As part of the maintainers track we presented an update on Dask.

Video

You can find the video on the SciPy YouTube channel. The Dask update runs from 0:00-19:30.

Slides

Talk Summary

Here’s a summary of the main topics covered in the talk. You can also check out the original thread on Twitter.

Community overview

We’ve been trying to gauge the size of our community lately. The best proxy we have right now is the number of weekly visitors to the Dask documentation. Which currently stands at around 10,000.

Dask documentation analytics showing growth to 10,000 weekly users over the last four years

Dask also came up in the Jetbrains Python developer survey. We were excited to see 5% of all the Python developers who filled out the survey said they use Dask. Which shows health in the PyData community as well as Dask.

Jetbrains survey results showing Dask used by 5% of Python users, beaten only by the Spark/hadoop ecosystem

We are running our own survey at the moment. If you are a Dask user please take a few minutes to fill it out. We would really appreciate it.

Link to the Dask survey

Community events

In February we had an in-person Dask Summit where a mixture of OSS maintainers and institutional users met. We had talks and workshops to help figure out our challenges and set our direction.

A room of attendees at the Dask summit

The Dask community also has a monthly meeting! It is held on the first Thursday of the month at 10:00 US Central Time. If you’re a Dask user you are welcome to come to hear updates from maintainers and share what you’re working on.

Community projects

There are many projects built on Dask. Looking at the preliminary results from the 2020 Dask survey shows some that are especially popular.

Graph showing the most popular projects built on Dask; Xarray, RAPIDS, XGBoost, Prefect and Iris

Let’s take a look at each of those.

Xarray

Xarray allows you to work on multi-dimensional datasets that have supporting metadata arrays in a Pandas-like way.

Slide showing xarray code example

RAPIDS

RAPIDS is an open-source suite of GPU accelerated Python libraries. Using these tools you can execute end-to-end data science and analytics pipelines entirely on GPUs. All using familiar PyData APIs.

Slide showing RAPIDS dataframe code example

BlazingSQL

BlazingSQL builds on RAPIDS and Dask to provide an open-source distributed, GPU accelerated SQL engine.

Slide showing BlazingSQL code example

XGBoost

While XGBoost has been around for a long time you can now prepare your data on your Dask cluster and then bootstrap your XGBoost cluster on top of Dask and hand the distributed dataframes straight over.

Slide showing XGBoost code example

Prefect

Prefect is a workflow manager which is built on top of Dask’s scheduling engine. “Users organize Tasks into Flows, and Prefect takes care of the rest.”

Slide showing Prefect code example

Iris

Iris, part of the SciTools suite of tools, uses the CF data model giving you a format-agnostic interface for working with your data. It excels when working with multi-dimensional Earth Science data, where tabular representations become unwieldy and inefficient.

Slide showing Iris code example

More tools

These are the tools our community have told us they like so far. But if you use something which didn’t make the list then head to our survey and let us know! According to PyPI there are many more out there.

Screenshot of PyPI showing 239 packages with Dask in their name

User groups

There are many user groups who use Dask. Everything from life sciences, geophysical sciences and beamline facilities to finance, retail and logistics. Check out the great “Who uses Dask?” talk from Matthew Rocklin for more info.

Screenshot 'Who uses Dask?' YouTube video

For profit companies

There has been an increase in for-profit companies building tools with Dask. Including Coiled Computing, Prefect and Saturn Cloud.

Slide describing the for-profit companies Coiled, Prefect and Saturn Cloud

We’ve also seen large companies like Microsoft’s Azure ML team contributing a cluster manager to Dask Cloudprovider. This helps folks get up and running with Dask on AzureML quicker and easier.

Recent improvements

Communications

Moving on to recent improvements there has been a lot of work to get Open UCX supported as a protocol in Dask. Which allows worker-worker communication to be accelerated vastly with hardware that supports Infiniband or NVLink.

Slide showing worker communication comparison between UCX/Infiniband and TCP with UCX being much faster

There have also been some recent announcements around NVIDIA blowing away the TPCx-BB benchmark by outperforming the current leader by 20x. This is a huge success for all the open-source projects that were involved, including Dask.

Slide showing TPCx-BB benchmark results

Dask Gateway

We’ve seen increased adoption of Dask Gateway. Many institutions are using it as a way to provide their staff with on-demand Dask clusters.

Slide showing Dask Gateway overview

Cluster map plot (aka ‘pew pew pew’)

The update that got the most 👏 feedback from the SciPy 2020 attendees was the Cluster Map Plot (known to maintainers as the “pew pew pew” plot). This plot shows a high-level overview of your Dask cluster scheduler and workers and the communication between them.

Next steps

High-level graph optimization

To wrap up with what Dask is going to be doing next we are going to be continuing to work on high-level graph optimization.

Slide showing High Level Graph documentation page

Scheduler performance

With feedback from our community we are also going to be focussing on making the Dask scheduler more performant. There are a few things happening including a Rust implementation of the scheduler, dynamic task creation and ongoing benchmarking.

Scheduler performance tasks including a Rust implementation, benchmarking, dynamic tasks and Cython, PyPy and C experiments

Chan Zuckerberg Foundation maintainer post

Lastly I’m excited to share that with funding from the Chan Zuckerberg Foundation, Dask will be hiring a maintainer who will focus on growing usage in the biological sciences field. If that is of interest to you keep an eye on our twitter account for more announcements.

Last Year in Review From the SciPy 2020 Maintainers Track