Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making

S. Guha; F.A. Khan; J. Stoyanovich; S. Schelter

doi:https://doi.org/10.1109/ICDE55515.2023.00303

Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making

Authors	S. Guha F.A. Khan J. Stoyanovich S. Schelter
Publication date	2023
Book title	2023 IEEE 39th International Conference on Data Engineering
Book subtitle	ICDE 2023 : proceedings : 3-7 April 2023, Anaheim, California
ISBN	9798350322286
ISBN (electronic)	9798350322279
Event	IEEE 39th International Conference on Data Engineering
Pages (from-to)	3747-3754
Number of pages	8
Publisher	Los Alamitos, California: IEEE Computer Society
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	In this paper, we interrogate whether data quality issues track demographic characteristics such as sex, race and age, and whether automated data cleaning — of the kind commonly used in production ML systems — impacts the fairness of predictions made by these systems. To the best of our knowledge, the impact of data cleaning on fairness in downstream tasks has not been investigated in the literature.We first analyze the tuples flagged by common error detection strategies in five research datasets. We find that, while specific data quality issues, such as higher rates of missing values, are associated with membership in historically disadvantaged groups, poor data quality does not generally track demographic group membership. As a follow-up, we conduct a large-scale empirical study on the impact of automated data cleaning on fairness, involving more than 26,000 model evaluations on five datasets. We observe that, while automated data cleaning has an insignificant impact on both accuracy and fairness in the majority of cases, it is more likely to worsen fairness than to improve it, especially when the cleaning techniques are not carefully chosen. This finding is both significant and worrying, given that it potentially implicates many production ML systems. We make our code and experimental results publicly available.The analysis we conducted in this paper is difficult, primarily because it requires that we think holistically about disparities in data quality, disparities in the effectiveness of data cleaning methods, and impacts of such disparities on ML model performance for different demographic groups. Such holistic analysis can and should be supported with the help of data engineering research. Towards this goal, we envision the development of fairness-aware data cleaning methods, and their integration into complex pipelines for ML-based decision making.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1109/ICDE55515.2023.00303 (Final published version)
Published at	https://api.semanticscholar.org/CorpusID:259327473 (Final published version)
Other links	https://www.proceedings.com/69836.html
Downloads	demodq (Accepted author manuscript) Automated_Data_Cleaning_Can_Hurt_Fairness_in_Machine_Learning-based_Decision_Making (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making