1. What is a Dask cluster?

Dask is a flexible library for parallel computing in Python. It works at the scale of a single laptop, enabling parallel processing and extending the size of convenient datasets from "fits in memory" to "fits on disk".

However, it can also work across a cluster of multiple machines. This is probably how you will use it on DataLabs, utilising the compute power of DataLabs.

A Dask cluster consists of:

  • A scheduler: this is responsible for deciding how to perform your calculation. It subdivides the work into chunks and co-ordinates how those chunks are performed across a number of workers.
  • A number of workers. Workers perform the chunks of calculation that they have been allocated.

In your lab notebook, you will start:

  • A Dask client: this is what lets your notebook talk to the scheduler of the Dask cluster, telling the scheduler what calculation you want to perform.
  • A Dask dashboard which will give you an indication of how your work is being chunked up, and how your cluster is being utilised.

Further reading: