Privacy-Preserving Record Linkage with Spark

Authors
Publication date 2019
Book title Proceedings 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Book subtitle CCGrid 2019, Cyprus
ISBN
  • 9781728109138
ISBN (electronic)
  • 9781728109121
Event 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019
Pages (from-to) 440-448
Number of pages 9
Publisher IEEE Computer Society
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
Privacy considerations obligate careful and secure processing of personal data. This is especially true when personal data is linked against databases from other organizations. During such endeavors, privacy-preserving record linkage (PPRL) can be utilized to prevent needless exposure of sensitive information to other organizations. With the increase of personal data that is being gathered and analyzed, scalable PPRL capable of handling massive databases is much desired. In this work, we evaluate Apache Spark as an option to scale PPRL. Not only is it valuable to have a scalable PPRL implementation, but one based on the Spark would also be commonly deployable and could take advantage of further development of the ecosystem. Our results show that a PPRL solution based on Spark outperforms alternatives when it comes to handling multiple millions of
records; can scale to dozens of nodes, and is on-par with regular record linkage implementations in terms of achieved results.
Document type Conference contribution
Language English
Published at https://doi.org/10.1109/CCGRID.2019.00058
Other links https://www.scopus.com/pages/publications/85069434168
Permalink to this page
Back