ISCA - International Speech
Communication Association


Deep Neural Networks for Speech Technology

The best-perfoming systems at the time of writing (2024) are based on Deep Neural Networks. These YouTube videos from MIT provide an introduction to DNNs:

  • ·         Basics of Deep Learning  covers the basics of neural nets and how they are trained: gradient descent and back propagation.
  • ·         RNNs and Transformers  covers sequece modelling, recurrent neural nets, LongShort Term Memory, Transformers and Attention
  • ·         CNNs  covers Convolutional Neural Networks. Illustrated using Vision but the same principles apply to feature detection from speech audio.
  • ·         Generative Modeling (e.g. GANs)  covers Generative Models: the basis of ChatGPT etc.
  • Diffusion models: covers Diffusion models in the second part of the lecture (from 26:40 onwards), as well as some history and challenges related to Neural Networks.

DNNs for Speech Synthesis

 Simon King's Speech Synthesis course has some videos covering DNNs in Text To Speech. See the video section of his Speech Synthesis course here

This Interspeech 22 Tutorial on Neural Speech Synthesis is by Xu Tan (microsoft)  and Hung-li Lee (National Univ Taiwan) and is based on their review paper.

 

DNNs for ASR

DNN Feature Extraction: Wav2vec and Wav2vec2 are schemes for ‘Learning the Structure of Speech from Raw Audio’. A convolutional network is trained by auto-encoding (i.e. to predict the next item given the history). This unsupervised training is then followed by supervised training on a much smaller dataset. The resulting feature extractor outperforms conventional feature sets as a module within an ASR. Wav2vec2 was developed by Meta.

Sequence Modelling: ASR is a Sequence-to-Sequence task. Such tasks can be handled by Recurrent Neural Networks, where at each step the network has access to the next frame of input data and to feedback from the last step.

Recurrent Neural Nets have some serious problems and in recent years they have been superseded by Transformer architectures using Attention-based models, introduced by Google in a famous paper ‘Attention is all you need’. Transformer models designed for speech recognition are usually based on the encoder-decoder architecture similar to seq2seq models. In more detail, they are based on the self-attention mechanism instead of the recurrence that is adopted by RNNs.

In end-to-end ASR the transformation from audio to words is accomplished by DNNs alone, with no use of phonetic or linguistic knowledge.

Links:

DNN Toolkits

 

     Organisation  Events   Membership   Help 
     > Board  > Interspeech  > Join - renew  > Sitemap
     > Legal documents  > Workshops  > Membership directory  > Contact
     > Logos      > FAQ
           > Privacy policy

    © Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

    Powered by Wild Apricot Membership Software