First International Conference on Spoken Language Processing (ICSLP 90)
Two methods to transcribe spoken utterances into phonemes are described and compared. Both operate on symbolic input provided by using Kohonen's Self-Organizing Feature Maps to vector quantize speech frames into code sequences. The idea is to transform the code sequences deteriorated by coarticulation effects and by an unideal acoustic processor closer to ideal sequences. This facilitates almost trivial conversion into phonemes accomplished without any statistical models. One method called Dynamically Focusing Context (DFC) is based on context-sensitive transformation rules, and the other on a feed-forward network trained using error back-propagation. The network uses multiresolution input derived from the code sequence. The performance of these methods is almost equal: the DFC reduces the error rate about 30 per cents, and the latter method about 25 per cents. The training time required by the feed-forward network is, however, orders of magnitude longer.
Bibliographic reference. Torkkola, Kari / Kokkonen, Mikko (1990): "A comparison of two methods to transcribe speech into phonemes: a rule-based method vs. back-propagation", In ICSLP-1990, 673-676.