Scaling Notebooks as Re-configurable Cloud Workflows

Open Access
Authors
Publication date 2022
Journal Data Intelligence
Volume | Issue number 4 | 2
Pages (from-to) 409-425
Number of pages 17
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
  • Faculty of Science (FNWI) - Institute for Biodiversity and Ecosystem Dynamics (IBED)
Abstract

Literate computing environments, such as the Jupyter (i.e., Jupyter Notebooks, JupyterLab, and JupyterHub), have been widely used in scientific studies; they allow users to interactively develop scientific code, test algorithms, and describe the scientific narratives of the experiments in an integrated document. To scale up scientific analyses, many implemented Jupyter environment architectures encapsulate the whole Jupyter notebooks as reproducible units and autoscale them on dedicated remote infrastructures (e.g., high-performance computing and cloud computing environments). The existing solutions are still limited in many ways, e.g., 1) the workflow (or pipeline) is implicit in a notebook, and some steps can be generically used by different code and executed in parallel, but because of the tight cell structure, all steps in the Jupyter notebook have to be executed sequentially and lack of the flexibility of reusing the core code fragments, and 2) there are performance bottlenecks that need to improve the parallelism and scalability when handling extensive input data and complex computation. In this work, we focus on how to manage the workflow in a notebook seamlessly. We 1) encapsulate the reusable cells as RESTful services and containerize them as portal components, 2) provide a composition tool for describing workflow logic of those reusable components, and 3) automate the execution on remote cloud infrastructure. Empirically, we validate the solution’s usability via a use case from the Ecology and Earth Science domain, illustrating the processing of massive Light Detection and Ranging (LiDAR) data. The demonstration and analysis show that our method is feasible, but that it needs further improvement, especially on integrating distributed workflow scheduling, automatic deployment, and execution to develop as a mature approach.

Document type Article
Language English
Published at https://doi.org/10.1162/dint_a_00140
Other links https://www.scopus.com/pages/publications/85129621867
Downloads
dint_a_00140 (Final published version)
Permalink to this page
Back