Rapid Collection of Spontaneous Speech Corpora Using Telephonic Community Forums

Agha Ali Raza, Awais Athar, Shan Randhawa, Zain Tariq, Muhammad Bilal Saleem, Haris Bin Zia, Umar Saif, Roni Rosenfeld

We present a novel technique for rapid collection of spontaneous speech data over mobile phone channel using telephonic community forums. Our public forum allows users to post audio messages, listen to messages posted by others, post votes and audio comments and share content with friends through subsidized phone calls. The entertainment aspects and sharing features of the forum lead to its viral spread in Pakistan. Within 8 months, it reached 11,017 users and gathered 1,207 hours of speech data comprising 57,454 audio-posts and 130,685 audio-comments, spanning Urdu and 9 regional languages. We trained an ASR using just 9.5 hours of the corpus to obtain 24.19% WER. Community forums automatically overcome common spontaneous speech data collection challenges like speaker recruitment, natural speech elicitation, content diversity, informed consent, sampling real-world ambient noise and reach (for geographically remote linguistic communities). This technique is especially useful for gathering speech corpora for under-resourced languages hence enabling the development of speech recognition, keyword spotting, speaker ID and noise classification systems (among others) for such languages. It also allows rapid, automatic preservation of spoken languages and oral aspects of culture. This technique can be extended to collect speech data for endangered languages, oral cultures and linguistic minorities.

 DOI: 10.21437/Interspeech.2018-1139

Cite as: Raza, A.A., Athar, A., Randhawa, S., Tariq, Z., Saleem, M.B., Bin Zia, H., Saif, U., Rosenfeld, R. (2018) Rapid Collection of Spontaneous Speech Corpora Using Telephonic Community Forums. Proc. Interspeech 2018, 1021-1025, DOI: 10.21437/Interspeech.2018-1139.

  author={Agha Ali Raza and Awais Athar and Shan Randhawa and Zain Tariq and Muhammad Bilal Saleem and Haris {Bin Zia} and Umar Saif and Roni Rosenfeld},
  title={Rapid Collection of Spontaneous Speech Corpora Using Telephonic Community Forums},
  booktitle={Proc. Interspeech 2018},