Second International Conference on Spoken Language Processing (ICSLP'92)

Banff, Alberta, Canada
October 13-16, 1992

Optimal Speech Recognition Using Phone Recognition and Lexical Access

Andrej Ljolje, Michael D. Riley

Linguistics Research Department, AT&T Bell Laboratories, Murray Hill, NJ, USA

We present an optimal speech recognition system which is based on phone recognition and lexical access. The complete recognition process consists of five consecutive steps: 1) prediction of phone boundary locations; 2) phone likelihood calculation - context independent phone lattice generation; 3) word lattice generation; 4) A* search for n best sentences; 5) rescoring using context dependent models and across word boundary co-articulation. Optimality, in this context, implies that the correct answer is never discarded by any of the stages of the recognition process.

Optimal search is achieved by providing phone lattices which are rich enough to preserve the correct phone sequence and yet small enough to allow efficient lexical access routines. Thus separation of the search for the most likely word sequence into consecutive partial searches with a simple interface does not degrade the performance since we insure that the correct answer is always passed on to the next level of processing. This also facilitates explicit modeling of alternative pronunciations within the scope of the DARPA Resource Management Task, which we use for testing the recognition system. The alternative pronunciations are explicitly used in aligning the training utterances when generating the acoustic models, and in the search for the best path through the phone lattice in lexical access. The whole system, although intimately connected to phonetic transcriptions of speech, has been implemented using automatic techniques and never requires human transcription of either training or testing utterances.

The best achieved phone recognition performance on the Feb 89 test set is 84.5% accurate and 87.7% correct. The best achieved word recognition accuracy using the word pair grammar, context independent lattices, but without across word co-articulation is 95.1%. After re-scoring on the top 10 sentence candidates using context dependent acoustic models and alternative phonetic realizations obtained with the knowledge of the whole sentence the accuracy improves to 96.6%.

Full Paper

Bibliographic reference.  Ljolje, Andrej / Riley, Michael D. (1992): "Optimal speech recognition using phone recognition and lexical access", In ICSLP-1992, 313-316.