EUROSPEECH 2001 Scandinavia
7th European Conference on Speech Communication and Technology

Aalborg, Denmark
September 3-7, 2001


Large Vocabulary Statistical Language Modeling for Continuous Speech Recognition in Finnish

Vesa Siivola, Mikko Kurimo, Krista Lagus

Helsinki University of Technology, Finland

Statistical language modeling (SLM) is an essential part in any large-vocabulary continuous speech recognition (LVCSR) system. The development of the standard SLM methods has been strongly affected by the goals of LVCSR in English. The structure of Finnish is substantially different from English, so if the standard SLMs are directly applied, the success is by no means granted. In this paper we describe our first attempts of building a LVCSR for Finnish and the new SLMs that we have tried. One of our objective has been the indexing and recognition of broadcast news, so special issues of our interest are topic detection, word stemming and modeling words that are poorly covered in the training data. Our new methods are based on neural computing using the self-organizing map (SOM) which has recently been shown to successfully extract and approximate latent semantic structures from massive text collections.

Full Paper

Bibliographic reference.  Siivola, Vesa / Kurimo, Mikko / Lagus, Krista (2001): "Large vocabulary statistical language modeling for continuous speech recognition in finnish", In EUROSPEECH-2001, 737-740.