Declarative machine learning pipeline management via logical query plans

Open Access
Authors
Supervisors
Award date 17-09-2025
ISBN
  • 9789493431867
Number of pages 227
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
Machine learning (ML) systems are increasingly used to automate impactful decisions. However, the resulting ML pipelines suffer from many unsolved data management challenges, including ensuring correctness, reliability, and compliance with legal regulations. We argue that this is because current ML pipeline libraries and ML cloud services lack fundamental data-centric abstractions, similar to logical query plans in databases.
In this thesis, we propose a new approach for managing ML pipelines by extracting "logical query plans" from ML pipeline code and automatically inferring pipeline semantics. Based on this declarative pipeline abstraction, we show how to enhance ML applications and tooling with provenance tracking and automatic rewriting capabilities. This enables us to manage ML pipelines and their data artifacts in novel ways.
We present five contributions, algorithmic and methodological, each embodied in a library, and organize them into three main parts. The first part focuses on the extraction of logical query plans from ML pipelines. mlinspect enables lightweight inspection and efficient instrumentation of pipelines. Lester automatically rewrites messy imperative code to clean declarative pipelines before deployment, enabling high automation for production use cases such as compliance with the right-to-be-forgotten. The second part addresses automatic rewriting of ML pipelines. mlwhatif enables data-centric what-if analysis and optimises what-if workloads via multi-query optimisation. mlidea assists with interactively improving ML data preparation code via automatically generated "shadow pipelines" and incremental view maintenance. The final part covers provenance tracking and reasoning about the input and output data of ML pipelines: ArgusEyes provides provenance-based screening in continuous integration workflows.
Document type PhD thesis
Language English
Downloads
Permalink to this page
cover
Back