Executive Summary

We look at how to create a basic image segmentation pipeline, using the dask-image library.


The content of this blog post originally appeared as a conference talk in 2020.

Just show me the code

If you want to run this yourself, you’ll need to download the example data from the Broad Bioimage Benchmark Collection: https://bbbc.broadinstitute.org/BBBC039

And install these requirements:

pip install dask-image>=0.4.0 tifffile

Here’s our full pipeline:

import numpy as np
from dask_image.imread import imread
from dask_image import ndfilters, ndmorph, ndmeasure

images = imread('data/BBBC039/images/*.tif')
smoothed = ndfilters.gaussian_filter(images, sigma=[0, 1, 1])
thresh = ndfilters.threshold_local(smoothed, blocksize=images.chunksize)
threshold_images = smoothed > thresh
structuring_element = np.array([[[0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0, 1, 0], [1, 1, 1], [0, 1, 0]], [[0, 0, 0], [0, 0, 0], [0, 0, 0]]])
binary_images = ndmorph.binary_closing(threshold_image, structure=structuring_element)
label_images, num_features = ndmeasure.label(binary_image)
index = np.arange(num_features)
area = ndmeasure.area(images, label_images, index)
mean_intensity = ndmeasure.mean(images, label_images, index)

You can keep reading for a step by step walkthrough of this image segmentation pipeline, or you can skip ahead to the sections on custom functions, scaling up computation, or GPU acceleration.

Image segmentation pipeline

Our basic image segmentation pipeline has five steps:

  1. Reading in data
  2. Filtering images
  3. Segmenting objects
  4. Morphological operations
  5. Measuring objects

Set up your python environment

Before we begin, we’ll need to set up our python virtual environment.

At a minimum, you’ll need:

pip install dask-image>=0.4.0 tifffile matplotlib

Optionally, you can also install the napari image viewer to visualize the image segmentation.

pip install "napari[all]"

To use napari from IPython or jupyter, run the %gui qt magic in a cell before calling napari. See the napari documentation for more details.

Download the example data

We’ll use the publically available image dataset BBBC039 Caicedo et al. 2018, available from the Broad Bioimage Benchmark Collection Ljosa et al., Nature Methods, 2012. You can download the dataset here: https://bbbc.broadinstitute.org/BBBC039

Example image from the BBBC039 dataset, Broad Bioimage Benchmark Collection

These are fluorescence microscopy images, where we see the nuclei in individual cells.

Step 1: Reading in data

Step one in our image segmentation pipeline is to read in the image data. We can do that with the dask-image imread function.

We pass the path to the folder full of *.tif images from our example data.

from dask_image.imread import imread

images = imread('data/BBBC039/images/*.tif')

HTML reprsentation of a Dask array

By default, each individual .tif file on disk has become one chunk in our Dask array.

Step 2: Filtering images

Denoising images with a small amount of blur can improve segmentation later on. This is a common first step in a lot of image segmentation pipelines. We can do this with the dask-image gaussian_filter function.

from dask_image import ndfilters

smoothed = ndfilters.gaussian_filter(images, sigma=[0, 1, 1])

Step 3: Segmenting objects

Next, we want to separate the objects in our images from the background. There are lots of different ways we could do this. Because we have fluorescent microscopy images, we’ll use a thresholding method.

Absolute threshold

We could set an absolute threshold value, where we’d consider pixels with intensity values below this threshold to be part of the background.

absolute_threshold = smoothed > 160

Let’s have a look at these images with the napari image viewer. First we’ll need to use the %gui qt magic:

%gui qt

And now we can look a the images with napari:

import napari

viewer = napari.Viewer()
viewer.add_image(images, contrast_limits=[0, 2000])

Absolute threshold napari screenshot

But there’s a problem here.

When we look at the results for different image frames, it becomes clear that there is no “one size fits all” we can use for an absolute threshold value. Some images in the dataset have quite bright backgrounds, others have fluorescent nuclei with low brightness. We’ll have to try a different kind of thresholding method.

Local threshold

We can improve the segmentation using a local thresholding method.

If we calculate a threshold value independently for each image frame then we can avoid the problem caused by fluctuating background intensity between frames.

thresh = ndfilters.threshold_local(smoothed, images.chunksize)
threshold_images = smoothed > thresh
# Let's take a look at the images with napari

Local threshold napari screenshot

The results here look much better, this is a much cleaner separation of nuclei from the background and it looks good for all the image frames.

Step 4: Morphological operations

Now that we have a binary mask from our threshold, we can clean it up a bit with some morphological operations.

Morphological operations are changes we make to the shape of structures a binary image. We’ll briefly describe some of the basic concepts here, but for a more detailed reference you can look at this excellent page of the OpenCV documentation.

Erosion is an operation where the edges of structures in a binary image are eaten away, or eroded.

Example: Erosion of a binary image

Image credit: OpenCV documentation

Dilation is the opposite of an erosion. With dilation, the edges of structures in a binary image are expanded.

Example: Dilation of a binary image

Image credit: OpenCV documentation

We can combine morphological operations in different ways to get useful effects.

A morphological opening operation is an erosion, followed by a dilation.

Example: Morphological opening of a binary image

Image credit: OpenCV documentation

In the example image above, we can see the left hand side has a noisy, speckled background. If the structuring element used for the morphological operations is larger than the size of the noisy speckles, they will disappear completely in the first erosion step. Then when it is time to do the second dilation step, there’s nothing left of the noise in the background to dilate. So we have removed the noise in the background, while the major structures we are interested in (in this example, the J shape) are restored almost perfectly.

Let’s use this morphological opening trick to clean up the binary images in our segmentation pipeline.

from dask_image import ndmorph
import numpy as np

structuring_element = np.array([
    [[0, 0, 0], [0, 0, 0], [0, 0, 0]],
    [[0, 1, 0], [1, 1, 1], [0, 1, 0]],
    [[0, 0, 0], [0, 0, 0], [0, 0, 0]]])
binary_images = ndmorph.binary_opening(threshold_images, structure=structuring_element)

You’ll notice here that we need to be a little bit careful about the structuring element. All our image frames are combined in a single Dask array, but we only want to apply the morphological operation independently to each frame. To do this, we sandwich the default 2D structuring element between two layers of zeros. This means the neighbouring image frames have no effect on the result.

# Default 2D structuring element

[[0, 1, 0],
 [1, 1, 1],
 [0, 1, 0]]

Step 5: Measuring objects

The last step in any image processing pipeline is to make some kind of measurement. We’ll turn our binary mask into a label image, and then measure the intensity and size of those objects.

For the sake of keeping the computation time in this tutorial nice and quick, we’ll measure only a small subset of the data. Let’s measure all the objects in the first three image frames (roughly 300 nuclei).

from dask_image import ndmeasure

# Create labelled mask
label_images, num_features = ndmeasure.label(binary_images[:3], structuring_element)
index = np.arange(num_features - 1) + 1  # [1, 2, 3, ...num_features]

Here’s a screenshot of the label image generated from our mask.

Label image napari screenshot

>>> print("Number of nuclei:", num_features.compute())

Number of nuclei: 271

Measure objects in images

The dask-image ndmeasure subpackage includes a number of different measurement functions. In this example, we’ll choose to measure:

  1. The area in pixels of each object, and
  2. The average intensity of each object.
area = ndmeasure.area(images[:3], label_images, index)
mean_intensity = ndmeasure.mean(images[:3], label_images, index)

Run computation and plot results

import matplotlib.pyplot as plt

plt.scatter(area, mean_intensity, alpha=0.5)
plt.gca().update(dict(title="Area vs mean intensity", xlabel='Area (pixels)', ylabel='Mean intensity'))

Matplotlib graph of dask-image measurement results:

Custom functions

What if you want to do something that isn’t included in the dask-image API? There are several options we can use to write custom functions.

Dask map_overlap and map_blocks

The Dask array map_overlap and map_blocks are what is used to build most of the functions in dask-image. You can use them yourself too. They will apply a function to each chunk in a Dask array.

import dask.array as da

def my_custom_function(args):
    # ... does something really cool

result = da.map_overlap(my_custom_function, my_dask_array, args)

You can read more about overlapping computations here.

Dask delayed

If you want more flexibility and fine-grained control over your computation, then you can use Dask delayed. You can get started with the Dask delayed tutorial here.

scikit-image apply_parallel function

If you’re a person who does a lot of image processing in python, one tool you’re likely to already be using is scikit-image. I’d like to draw your attention to the apply_parallel function available in scikit-image. It uses map-overlap, and can be very helpful.

It’s useful not only when when you have big data, but also in cases where your data fits into memory but the computation you want to apply to the data is memory intensive. This might cause you to exceed the available RAM, and apply_parallel is great for these situations too.

Scaling up computation

When you want to scale up from a laptop onto a supercomputing cluster, you can use dask-distributed to handle that.

from dask.distributed import Client

# Setup a local cluster
# By default this sets up 1 worker per core
client = Client()

See the documentation here to get set up for your system.

Bonus content: using arrays on GPU

We’ve recently been adding GPU support to dask-image.

We’re able to add GPU support using a library called CuPy. CuPy is an array library with a numpy-like API, accelerated with NVIDIA CUDA. Instead of having Dask arrays which contain numpy chunks, we can have Dask arrays containing cupy chunks instead. This blogpost explains the benefits of GPU acceleration and gives some benchmarks for computations on CPU, a single GPU, and multiple GPUs.

GPU support available in dask-image

From dask-image version 0.6.0, there is GPU array support for four of the six subpackages:

  • imread
  • ndfilters
  • ndinterp
  • ndmorph

Subpackages of dask-image that do not yet have GPU support are.

  • ndfourier
  • ndmeasure

We hope to add GPU support to these in the future.

An example

Here’s an example of an image convolution with Dask on the CPU:

# CPU example
import numpy as np
import dask.array as da
from dask_image.ndfilters import convolve

s = (10, 10)
a = da.from_array(np.arange(int(np.prod(s))).reshape(s), chunks=5)
w = np.ones(a.ndim * (3,), dtype=np.float32)
result = convolve(a, w)

And here’s the same example of an image convolution with Dask on the GPU. The only thing necessary to change is the type of input arrays.

# Same example moved to the GPU
import cupy  # <- import cupy instead of numpy (version >=7.7.0)
import dask.array as da
from dask_image.ndfilters import convolve

s = (10, 10)
a = da.from_array(cupy.arange(int(cupy.prod(cupy.array(s)))).reshape(s), chunks=5)  # <- cupy dask array
w = cupy.ones(a.ndim * (3,))  # <- cupy array
result = convolve(a, w)

You can’t mix arrays on the CPU and arrays on the GPU in the same computation. This is why the weights w must be a cupy array in the second example above.

Additionally, you can transfer data between the CPU and GPU. So in situations where the GPU speedup is larger than than cost associated with transferring data, this may be useful to do.

Reading in images onto the GPU

Generally, we want to start our image processing by reading in data from images stored on disk. We can use the imread function with the arraytype=cupy keyword argument to do this.

from dask_image.imread import imread

images = imread('data/BBBC039/images/*.tif')
images_on_gpu = imread('data/BBBC039/images/*.tif', arraytype="cupy")

How you can get involved

Create and share your own segmentation or image processing workflows with Dask (join the current discussion on segmentation or propose a new blogpost topic here)

Contribute to adding GPU support to dask-image: https://github.com/dask/dask-image/issues/133

blog comments powered by Disqus