The ramifications of data handling for computational models

Open Access
Authors
Supervisors
Cosupervisors
Award date 04-12-2024
ISBN
  • 9789493391598
Series SIKS Dissertation series , 2024-37
Number of pages 196
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
Many computational models rely on real-world data, with the successful application of these models being dependent on access to accurate and representative datasets. With increasingly sophisticated models and data, the steps required in moving from data collection to model output are becoming more complex. The effects of data handling steps such as cleaning and integration on the modelling and simulation process have generally not been addressed in the literature. This thesis investigates these issues and introduces frameworks for how best to reason about such problems.
The first part of the thesis is focused on network diffusion models. These models are used to simulate spreading processes (such as disease or information) over networks. The outputs of such models are highly sensitive to the topology of the network on which they are run. From both theoretical and practical perspectives, we show the high model sensitivities to data handling that can be observed and suggest how results can be reported for transparent, holistic conclusions.
In the second part, we expand to other data handling problems and model types. We first illustrate how data preprocessing decisions can change the structure of word co-occurrence networks. Such networks are frequently used in the social sciences, where decisions behind network construction are often not justified. Second, we show how mismatched training and test data cleaning pipelines can affect the performance and selection of regression models. Such mismatches can have surprising consequences, which have strong implications for practice.
Document type PhD thesis
Language English
Downloads
Permalink to this page
cover
Back