Filtered Corpus Training (FiCT) Shows that Language Models Can Generalize from Indirect Evidence

Authors	A. Patil J. Jumelet Y.Y. Chiu A. Lapastora P. Shen L. Wang C. Willrich S. Steinert-Threlkeld
Publication date	2024
Journal	Transactions of the Association for Computational Linguistics
Volume \| Issue number	12
Pages (from-to)	1597-1615
Number of pages	19
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	This paper introduces Filtered Corpus Training, a method that trains language models (LMs) on corpora with certain linguistic constructions filtered out from the training data, and uses it to measure the ability of LMs to perform linguistic generalization on the basis of indirect evidence. We apply the method to both LSTM and Transformer LMs (of roughly comparable size), developing filtered corpora that target a wide range of linguistic phenomena. Our results show that while transformers are better qua LMs (as measured by perplexity), both models perform equally and surprisingly well on linguistic generalization measures, suggesting that they are capable of generalizing from indirect evidence.
Document type	Article
Language	English
Published at	https://doi.org/10.1162/tacl_a_00720
Downloads	tacl_a_00720 (Final published version)
Permalink to this page

Back

UvA-DARE