A Dynamically Expandable, Weakly Supervised, Audio-Visual Database of Stuttered Speech

Authors
Publication date 2020
Book title MuCAI '20
Book subtitle proceedings of the 1st International Workshop on Multimodal Conversational AI : October 16, 2020, Virtual Event, USA
ISBN (electronic)
  • 9781450381567
Event 1st International Workshop on Multimodal Conversational AI
Pages (from-to) 9-13
Publisher New York, NY: Association for Computing Machinery
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
Stuttering affects at least 1% of the world population. It is caused by irregular disruptions in speech production. These interruptions occur in various forms and frequencies. Repetition of words or parts of words, prolongations, or blocks in getting the words out are the most common ones.
Accurate detection and classification of stuttering would be important in the assessment of severity for speech therapy. Furthermore, real time detection might create many new possibilities to facilitate reconstruction into fluent speech. Such an interface could help people to utilize voice-based interfaces like Apple Siri and Google Assistant, or to make (video) phone calls more fluent by delayed delivery.
In this paper we present the first expandable audio-visual database of stuttered speech. We explore an end-to-end, real-time, multi-modal model for detection and classification of stuttered blocks in unbound speech. We also make use of video signals since acoustic signals cannot be produced immediately. We use multiple modalities as acoustic signals together with secondary characteristics exhibited in visual signals will permit an increased accuracy of detection.
Document type Conference contribution
Note Title in ACM library: A Dynamic, Self Supervised, Large Scale AudioVisual Dataset for Stuttered Speech.
Language English
Published at https://doi.org/10.1145/3423325.3423733
Permalink to this page
Back