Separating the wheat from the chaff A topic and keyword-based procedure for identifying research-relevant text

Authors
  • A. Eads
  • A. Schofield
  • F. Mahootian
  • D. Mimno
Publication date 06-2021
Journal Poetics
Article number 101527
Volume | Issue number 86
Number of pages 19
Organisations
  • Faculty of Social and Behavioural Sciences (FMG) - Amsterdam Institute for Social Science Research (AISSR)
Abstract
Social scientists are using computational tools to expand their content research beyond what is humanly readable. This often requires filtering corpora for complex research concepts. The commonly used off-the-shelf filtering techniques are untested at this task. Dictionaries may not recognize language outside of investigators’ expectations and thresholding on topic proportions from topic models may fail to identify brief references to concepts. We develop a typology of texts as they relate to a research concept and use this to structure a filtering procedure. We compare our procedure's performance with dictionary-only and topic-proportion-only approaches on two corpora—government speeches and academic articles—and two research concepts—housing crisis and inequality. Our procedure outperforms overall and on each type of relevant text in the typology. An open-source software package is available for implementing the procedure. This provides researchers with a more structured and tested approach for filtering text data. Additionally, the types-of-text typology analysis provides a unique examination of what constitutes a filtered dataset, allowing researchers to consider how conclusions may be affected.
Document type Article
Language English
Published at https://doi.org/10.1016/j.poetic.2020.101527
Permalink to this page
Back