Representation collapse and where to find it in transformer-based neural machine translation
| Authors | |
|---|---|
| Supervisors | |
| Cosupervisors | |
| Award date | 27-05-2026 |
| ISBN |
|
| Number of pages | 157 |
| Organisations |
|
| Abstract |
In recent years, Transformer-based models have become the standard in both language and image processing. Despite their strong performance on downstream tasks, prior studies have shown that these models often suffer from representation collapse, i.e., a phenomenon where learned representations lack diversity and fail to preserve meaningful distances be- tween features. This issue is especially pronounced while applying distance-based retrieval methods, which rely on well-dispersed representations to distinguish between semanti- cally different inputs. In this work, we investigate representation collapse in the context of two machine translation paradigms that depend on distance-based methods: 𝑘-Nearest Neighbors Machine Translation, which augments translation models with nearest-neighbor examples, and continuous-output NMT, which relaxes the discrete vocabulary bottleneck by generating outputs in continuous space, but requires similarity-based rounding during the decoding step. Chapter 3 examines representation collapse in continuous-output ma- chine translation. By comparing collapsed embeddings with uniformly spread, or dispersed, randomly generated embeddings on the sphere, we show that dispersion positively impacts translation quality, especially for rare tokens. Chapter 4 deepens this analysis by studying angular dispersion on hyperspheres and proposing optimization strategies for obtaining well-spread representations for both image and text data. Chapter 5 applies dispersion to datastore key representations in kNN-MT and demonstrates that it improves both lookup efficiency and translation accuracy. Finally, Chapter 6 generalizes the study of dispersion in Transformer-based models, providing a deeper analysis of its origins and proposing strategies to mitigate its effects.
Overall, our findings highlight the fundamental role of geometric properties in shaping learned representations for text generation, and they offer practical solutions for advancing retrieval-augmented and continuous-output NMT systems. |
| Document type | PhD thesis |
| Language | English |
| Downloads | |
| Permalink to this page | |