- Quantitative comparative linguistics based on tiny corpora: N-gram language identification of wordlists of known and unknown languages from Amazonia and beyond
- Journal of Quantitative Linguistics
- Volume | Issue number
- 22 | 3
- Pages (from-to)
- Document type
- Faculty of Humanities (FGw)
- Amsterdam Center for Language and Communication (ACLC)
Can an unknown Amazonian language be identified by statistical procedures based on n-gram frequencies if only a short list of words is available and at the same time, the available data of the potential candidate languages are also limited to relatively short wordlists? In this paper we show that n-gram frequencies (specifically 1-grams and 2-grams) allow us to identify languages reliably based on as few as 20 words, as long as these are transcribed consistently, and as long as characteristic monogram and bigram frequencies for these languages have previously been established based on consistently transcribed data. If no such consistently transcribed data are available, as is the case of our Amazonian case study, such procedures clearly fail for wordlists with 50 or fewer words. Our study thus contributes to exploring the limits of such automated detection procedures, both in terms of corpus size and transcription quality.
- go to publisher's site
If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library, or send a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.