12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

Efficient Harvesting of Internet Audio for Resource-Scarce ASR

Marelie H. Davel (1), Charl van Heerden (2), Neil Kleynhans (2), Etienne Barnard (1)

(1) North-West University, South Africa
(2) CSIR, South Africa

Spoken recordings that have been transcribed for human reading (e.g. as captions for audiovisual material, or to provide alternative modes of access to recordings) are widely available in many languages. Such recordings and transcriptions have proven to be a valuable source of ASR data in well-resourced languages, but have not been exploited to a significant extent in under-resourced languages or dialects. Techniques used to harvest such data typically assume the availability of a fairly accurate ASR system, which is generally not available when working with resource-scarce languages. In this work, we define a process whereby an ASR corpus is bootstrapped using unmatched ASR models in conjunction with speech and approximate transcriptions sourced from the Internet. We introduce a new segmentation technique based on the use of a phone-internal garbage model, and demonstrate how this technique (combined with limited filtering) can be used to develop a large, high-quality corpus in an under-resourced dialect with minimal effort.

Full Paper

Bibliographic reference.  Davel, Marelie H. / Heerden, Charl van / Kleynhans, Neil / Barnard, Etienne (2011): "Efficient harvesting of internet audio for resource-scarce ASR", In INTERSPEECH-2011, 3153-3156.