Table Representation Learning

Open Access
Authors
Supervisors
Cosupervisors
  • C. Demiralp
Award date 23-02-2024
ISBN
  • 9789464837438
Number of pages 153
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
The increasing amount of data being collected, stored, and analyzed, induces a need for efficient, scalable, and robust methods to handle the data. A large fraction of this data is stored in structured formats such as relational tables and spreadsheets. To automate data management and analysis tasks for such data, in this thesis, we investigate how the success of representation learning for data modalities like text and images, can be extended to tabular data, which we refer to as Table Representation Learning (TRL). First, we present the results of our exploration of neural embedding methods for automatic table comprehension. We contribute Sherlock, a deep learning model for detecting the semantic types of table columns in a scalable, robust and accurate manner. We also present a system, AdaTyper, that effectively and efficiently adapts such semantic type detection models towards unseen data distributions and semantic types. As existing TRL models need to be pre-trained on large-scale representative datasets, we introduce GitTables: a large corpus of relational tables extracted from CSV files stored on GitHub. The tables in GitTables better resemble typical database tables and are enriched with column semantics. Finally, we present Observatory, a framework and tool for analyzing what learned embeddings of tables capture with regards to structural and content characteristics of relational tables. With Observatory, we identify strengths and weaknesses of existing TRL models and the table embeddings they generate. The thesis concludes with a summary of our findings and a discussion around open challenges and future opportunities for Table Representation Learning.
Document type PhD thesis
Language English
Downloads
Permalink to this page
cover
Back