Poster Abstract

P10.38 Eduardo Gonzalez Solares (Institute of Astronomy)

Theme: Data processing pipelines

Distributed and parallel Python pipelines using Docker and Dask

We present a simple and extendable architecture for running distributed and parallel pipelines in a cluster.

Owl is a framework to execute jobs written in Python and C/C++ in a compute cluster backed up by Docker Swarm and Dask. In essence Owl serves as a simple scheduler that accepts job descriptions, queues them and submits them for execution keeping track of progress.

Docker provices reproducible environments to run the code, and the Swarm mode allows us to distribute the containers across any number of computers. The software stack asks the swarm manager for the necessary resources and these are allocated in the form of containers spread across the different cluster nodes. With Dask running in the containers we can use resources beyond a single machine and spread computations across many nodes of the cluster running in parallel.

Pipelines can be developed and tested in a laptop and scaled to hundreds of nodes. They are distributed as standard Python packages and installed using pip in the Docker containers spawn by Owl workers.

Users write pipeline definition files that contain the name of the pip package, parameters needed for running the pipeline and resources to allocate in the cluster. Owl takes care of creating and monitoring resources needed to run each pipeline.

We will show examples of data results analysis done using Jupyter notebooks also running in Docker containers.

The complete architecture setup is running as a pilot in a medical project while transitioned to Astronomy workflows.