cuML and Dask hyperparameter optimization

Setup

DGX-1 Workstation
Host Memory: 512 GB
GPU Tesla V100 x 8
cudf 0.6
cuml 0.6
dask 1.1.4
Jupyter notebook

TLDR; Hyper-parameter Optimization is functional but slow with cuML

cuML and Dask Hyper-parameter Optimization

cuML is an open source GPU accelerated machine learning library primarily developed at NVIDIA which mirrors the Scikit-Learn API. The current suite of algorithms includes GLMs, Kalman Filtering, clustering, and dimensionality reduction. Many of these machine learning algorithms use hyper-parameters. These are parameters used during the model training process but are not “learned” during the training. Often these parameters are coefficients or penalty thresholds and finding the “best” hyper parameter can be computationally costly. In the PyData community, we often reach to Scikit-Learn’s GridSearchCV or RandomizedSearchCV for easy definition of the search space for hyper-parameters – this is called hyper-parameter optimization. Within the Dask community, Dask-ML has incrementally improved the efficiency of hyper-parameter optimization by leveraging both Scikit-Learn and Dask to use multi-core and distributed schedulers: Grid and RandomizedSearch with DaskML.

With the newly created drop-in replacement for Scikit-Learn, cuML, we experimented with Dask’s GridSearchCV. In the upcoming 0.6 release of cuML, the estimators are serializable and are functional within the Scikit-Learn/dask-ml framework, but slow compared with Scikit-Learn estimators. And while speeds are slow now, we know how to boost performance, have filed several issues, and hope to show performance gains in future releases.

All code and timing measurements can be found in this Jupyter notebook

Fast Fitting!

cuML is fast! But finding that speed requires developing a bit of GPU knowledge and some intuition. For example, there is a non-zero cost of moving data from device to GPU and, when data is “small” there are little to no performance gains. “Small”, currently might mean less than 100MB.

In the following example we use the diabetes data set provided by sklearn and linearly fit the data with RidgeRegression

\[\min\limits_w ||y - Xw||^2_2 + alpha \* ||w||^2_2\]

alpha is the hyper-parameter and we initially set to 1.

import numpy as np
from cuml import Ridge as cumlRidge
import dask_ml.model_selection as dcv
from sklearn import datasets, linear_model
from sklearn.externals.joblib import parallel_backend
from sklearn.model_selection import train_test_split, GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2)

fit_intercept = True
normalize = False
alpha = np.array([1.0])

ridge = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
cu_ridge = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

ridge.fit(X_train, y_train)
cu_ridge.fit(X_train, y_train)=

The above ran with a single timing measurement of:

Scikit-Learn Ridge: 28 ms
cuML Ridge: 1.12 s

But the data is quite small, ~28KB. Increasing the size to ~2.8GB and re-running we see significant gains:

dup_ridge = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
dup_cu_ridge = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

# move data from host to device
record_data = (('fea%d'%i, dup_data[:,i]) for i in range(dup_data.shape[1]))
gdf_data = cudf.DataFrame(record_data)
gdf_train = cudf.DataFrame(dict(train=dup_train))

#sklearn
dup_ridge.fit(dup_data, dup_train)

# cuml
dup_cu_ridge.fit(gdf_data, gdf_train.train)

With new timing measurements of:

Scikit-Learn Ridge: 4.82 s ± 694 ms
cuML Ridge: 450 ms ± 47.6 ms

With more data we clearly see faster fitting times, but the time to move data to the GPU (through CUDF) was 19.7s. This cost of data movement is one of the reasons why RAPIDS/cuDF was developed – keep data on the GPU and avoid having to move back and forth.

Hyper-Parameter Optimization Experiments

So moving to the GPU can be costly, but once there, with larger data sizes, we gain significant performance optimizations. Naively, we thought, “well, we have GPU machine learning, we have distributed hyper-parameter optimization… we should have distributed, GPU-accelerated, hyper-parameter optimization!”

Scikit-Learn assumes a specific, but well defined API for estimators over which it will perform hyper-parameter optimization. Most estimators/classifiers in Scikit-Learn look like the following:

class DummyEstimator(BaseEstimator):
    def __init__(self, params=...):
        ...

    def fit(self, X, y=None):
        ...

    def predict(self, X):
        ...

    def score(self, X, y=None):
        ...

    def get_params(self):
        ...

    def set_params(self, params...):
        ...

When we started experimenting with hyper-parameter optimization, we found a few API holes missing, these were resolved, mostly handling matching argument structure and various getters/setters.

get_params and set_params (#271)
fix/clf-solver (#318)
map fit_transform to sklearn implementation (#330)
Fea get params small changes (#322)

With holes plugged up we tested again. Using the same diabetes data set, we are now performing hyper-parameter optimization and searching over many alpha parameters for the best scoring alpha.

params = {'alpha': np.logspace(-3, -1, 10)}
clf = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
cu_clf = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

grid = GridSearchCV(clf, params, scoring='r2')
grid.fit(X_train, y_train)

cu_grid = GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(X_train, y_train)

Again, reminding ourselves that the data is small ~28KB, we don’t expect to observe cuml performing faster than sklearn. Instead, we want to demonstrate functionality.

Again, reminding ourselves that the data is small ~28KB, we don’t expect to observe cuml performing faster than Scikit-Learn. Instead, we want to demonstrate functionality. Additionally, we also tried swapping out Dask-ML’s implementation of GridSearchCV (which adheres to the same API as Scikit-Learn) to use all of the GPUs we have available in parallel.

params = {'alpha': np.logspace(-3, -1, 10)}
clf = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver='cholesky')
cu_clf = cumlRidge(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, solver="eig")

grid = dcv.GridSearchCV(clf, params, scoring='r2')
grid.fit(X_train, y_train)

cu_grid = dcv.GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(X_train, y_train)

Timing Measurements:

GridSearchCV	sklearn-Ridge	cuml-ridge
Scikit-Learn	88.4 ms ± 6.11 ms	6.51 s ± 132 ms
Dask-ML	873 ms ± 347 ms	740 ms ± 142 ms

Unsurprisingly, we see that GridSearchCV and Ridge Regression from Scikit-Learn is the fastest in this context. There is cost to distributing work and data, and as we previously mentioned, moving data from host to device.

How does performance scale as we scale data?

two_dup_data = np.array(np.vstack([X_train]*int(1e2)))
two_dup_train = np.array(np.hstack([y_train]*int(1e2)))
three_dup_data = np.array(np.vstack([X_train]*int(1e3)))
three_dup_train = np.array(np.hstack([y_train]*int(1e3)))

cu_grid = dcv.GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(two_dup_data, two_dup_train)

cu_grid = dcv.GridSearchCV(cu_clf, params, scoring='r2')
cu_grid.fit(three_dup_data, three_dup_train)

grid = dcv.GridSearchCV(clf, params, scoring='r2')
grid.fit(three_dup_data, three_dup_train)

Timing Measurements:

Data (MB)	cuML+Dask-ML	sklearn+Dask-ML
2.8 MB	13.8s
28 MB	1min 17s	4.87 s

cuML + dask-ml (Distributed GridSearchCV) does significantly worse as data sizes increase! Why? Primarily, two reasons:

Non optimized movement of data between host and device compounded by N devices and the size of the parameter space
Scoring methods are not implemented in with cuML

Below is the Dask graph for the GridSearch

There are 50 (cv=5 times 10 parameters for alpha) instances of chunking up our test data set and scoring performance. That means 50 times we are moving data back forth between host and device for fitting and 50 times for scoring. That’s not great, but it’s also very solvable – build scoring functions for GPUs!

Immediate Future Work

We know the problems, GH Issues have been filed, and we are working on these issues – come help!

Built In Scorers (#242)
DeviceNDArray as input data (#369)
Communication with UCX (#2344)

cuML and Dask hyperparameter optimization hyperparameter optimization and dask

Setup