2020 Dask User Survey

This post presents the results of the 2020 Dask User Survey, which ran earlier this summer. Thanks to everyone who took the time to fill out the survey! These results help us better understand the Dask community and will guide future development efforts.

The raw data, as well as the start of an analysis, can be found in this binder:

Let us know if you find anything in the data.

Highlights

We had 240 responses to the survey (slightly fewer than last year, which had about 260).
Overall, results look mostly similar to last year’s.
Our documentation has probably improved relative to last year
Respondents care more about performance relative to last year.

New Questions

Most of the questions are the same as in 2019. We added a couple questions about deployment and dashboard usage. Let’s look at those first.

Among respondents who use a Dask package to deploy a cluster (about 53% of respondents), there’s a wide spread of methods.

Most people access the dashboard through a web browser. Those not using the dashboard are likely (hopefully) just using Dask on a single machine with the threaded scheduler (though the dashboard works fine on a single machine as well).

Learning Resources

Respondents’ learning material usage is farily similar to last year. The most notable differences are from our survey form providing more options (our YouTube channel and “Gitter chat”). Other than that, examples.dask.org might be relatively more popular.

Just like last year, we’ll look at resource usage grouped by how often they use Dask.

A few observations

GitHub issues are becoming relatively less popular, which perhaps reflects better documentation or stability (assuming people go to the issue tracker when they can’t find the answer in the docs or they hit a bug).
https://examples.dask.org is notably now more popular among occasinal users.
In response to last year’s survey, we invested time in making https://tutorial.dask.org better, which we previously felt was lacking. Its usage is still about the same as last year’s (pretty popular), so it’s unclear whether we should dedicate additional focus there.

How do you use Dask?

API usage remains about the same as last year (recall that about 20 fewer people took the survey and people can select multiple, so relative differences are most interesting). We added new choices for RAPIDS, Prefect, and XGBoost, each of which are somewhat popular (in the neighborhood of dask.Bag).

About 65% of our users are using Dask on a cluster at least some of the time, which is similar to last year.

How can Dask improve?

Respondents continue to say that more documentation and examples would be the most valuable improvements to the project.

One interesting change comes from looking at “Which would help you most right now?” split by API group (dask.dataframe, dask.array, etc.). Last year showed that “More examples” in my field was the most important for all API groups (first table below). But in 2020 there are some differences (second table below).

2019 normalized by row. Darker means that a higher proporiton of users of that API prefer that priority.
Which would help you most right now?	Bug fixes	More documentation	More examples in my field	New features	Performance improvements
Dask APIs
Array	10	24	62	15	25
Bag	3	11	16	10	7
DataFrame	16	32	71	39	26
Delayed	16	22	55	26	27
Futures	12	9	25	20	17
ML	5	11	23	11	7
Xarray	8	11	34	7	9

2020 normalized by row. Darker means that a higher proporiton of users of that API prefer that priority.
Which would help you most right now?	Bug fixes	More documentation	More examples in my field	New features	Performance improvements
Dask APIs
Array	12	16	56	15	23
Bag	7	5	24	7	16
DataFrame	24	21	67	22	41
Delayed	15	19	46	17	34
Futures	9	10	21	13	24
ML	6	4	21	9	12
Xarray	3	4	25	9	13

Examples are again the most important (for all API groups except Futures). But “Performance improvements” is now the second-most important improvement (except for Futures where it’s most important). How should we interpret this? A charitable interpretation is that Dask’s users are scaling to larger problems and are running into new scaling challenges. A less charitable interpretation is that our user’s workflows are the same but Dask is getting slower!

What other systems do you use?

SSH continues to be the most popular “cluster resource mananger”. This was the big surprise last year, so we put in some work to make it nicer. Aside from that, not much has changed.

And Dask users are about as happy with its stability as last year.

Takeaways

Overall, most things are similar to last year.
Documentation, especially domain-specific examples, continues to be important. That said, our documentation is probably better than it was last year.
More users are pushing Dask further. Investing in performance is likely to be valuable.

Thanks again to all the respondents!