Scaling Notebooks as Re-configurable Cloud Workflows

Y. Wang; S. Koulouzis; R. Bianchi; N. Li; Y. Shi; J. Timmermans; W.D. Kissling; Z. Zhao

doi:https://doi.org/10.1162/dint_a_00140

Scaling Notebooks as Re-configurable Cloud Workflows

Authors	Y. Wang S. Koulouzis R. Bianchi N. Li Y. Shi J. Timmermans W.D. Kissling Z. Zhao
Publication date	2022
Journal	Data Intelligence
Volume \| Issue number	4 \| 2
Pages (from-to)	409-425
Number of pages	17
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI) Faculty of Science (FNWI) - Institute for Biodiversity and Ecosystem Dynamics (IBED)
Abstract	Literate computing environments, such as the Jupyter (i.e., Jupyter Notebooks, JupyterLab, and JupyterHub), have been widely used in scientific studies; they allow users to interactively develop scientific code, test algorithms, and describe the scientific narratives of the experiments in an integrated document. To scale up scientific analyses, many implemented Jupyter environment architectures encapsulate the whole Jupyter notebooks as reproducible units and autoscale them on dedicated remote infrastructures (e.g., high-performance computing and cloud computing environments). The existing solutions are still limited in many ways, e.g., 1) the workflow (or pipeline) is implicit in a notebook, and some steps can be generically used by different code and executed in parallel, but because of the tight cell structure, all steps in the Jupyter notebook have to be executed sequentially and lack of the flexibility of reusing the core code fragments, and 2) there are performance bottlenecks that need to improve the parallelism and scalability when handling extensive input data and complex computation. In this work, we focus on how to manage the workflow in a notebook seamlessly. We 1) encapsulate the reusable cells as RESTful services and containerize them as portal components, 2) provide a composition tool for describing workflow logic of those reusable components, and 3) automate the execution on remote cloud infrastructure. Empirically, we validate the solution’s usability via a use case from the Ecology and Earth Science domain, illustrating the processing of massive Light Detection and Ranging (LiDAR) data. The demonstration and analysis show that our method is feasible, but that it needs further improvement, especially on integrating distributed workflow scheduling, automatic deployment, and execution to develop as a mature approach.
Document type	Article
Language	English
Published at	https://doi.org/10.1162/dint_a_00140
Other links	https://www.scopus.com/pages/publications/85129621867
Downloads	dint_a_00140 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Scaling Notebooks as Re-configurable Cloud Workflows