The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives

E. Voita; R. Sennrich; I. Titov

doi:https://doi.org/10.18653/v1/D19-1448

The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives

Authors	E. Voita R. Sennrich I. Titov
Publication date	2019
Host editors	K. Inui J. Jiang V. Ng X. Wan
Book title	2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing
Book subtitle	EMNLP-IJCNLP 2019 : proceedings of the conference : November 3-7, 2019, Hong Kong, China
ISBN (electronic)	9781950737901
Event	2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing
Pages (from-to)	4396-4406
Publisher	Stroudsburg, PA: The Association for Computational Linguistics
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC) Faculty of Science (FNWI)
Abstract	We seek to understand how the representations of individual tokens and the structure of the learned feature space evolve between layers in deep neural networks under different learning objectives. We chose the Transformers for our analysis as they have been shown effective with various tasks, including machine translation (MT), standard left-to-right language models (LM) and masked language modeling (MLM). Previous work used black-box probing tasks to show that the representations learned by the Transformer differ significantly depending on the objective. In this work, we use canonical correlation analysis and mutual information estimators to study how information flows across Transformer layers and observe that the choice of the objective determines this process. For example, as you go from bottom to top layers, information about the past in left-to-right language models gets vanished and predictions about the future get formed. In contrast, for MLM, representations initially acquire information about the context around the token, partially forgetting the token identity and producing a more generalized token representation. The token identity then gets recreated at the top MLM layers.
Document type	Conference contribution
Note	With attachment
Language	English
Published at	https://doi.org/10.18653/v1/D19-1448
Downloads	D19-1448 (Final published version)
Supplementary materials	D19-1448.Attachment
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives