ISCA - ANN-ASR

ISCA - International Speech
Communication Association

Deep Neural Networks for Speech Technology

The best-perfoming systems at the time of writing (2024) are based on Deep Neural Networks. These YouTube videos from MIT provide an introduction to DNNs:

· Basics of Deep Learning covers the basics of neural nets and how they are trained: gradient descent and back propagation.
· RNNs and Transformers covers sequece modelling, recurrent neural nets, LongShort Term Memory, Transformers and Attention
· CNNs covers Convolutional Neural Networks. Illustrated using Vision but the same principles apply to feature detection from speech audio.
· Generative Modeling (e.g. GANs) covers Generative Models: the basis of ChatGPT etc.
Diffusion models: covers Diffusion models in the second part of the lecture (from 26:40 onwards), as well as some history and challenges related to Neural Networks.

DNNs for Speech Synthesis

Simon King's Speech Synthesis course has some videos covering DNNs in Text To Speech. See the video section of his Speech Synthesis course here:

This Interspeech 22 Tutorial on Neural Speech Synthesis is by Xu Tan (microsoft) and Hung-li Lee (National Univ Taiwan) and is based on their review paper.

DNNs for ASR

DNN Feature Extraction: Wav2vec and Wav2vec2 are schemes for ‘Learning the Structure of Speech from Raw Audio’. A convolutional network is trained by auto-encoding (i.e. to predict the next item given the history). This unsupervised training is then followed by supervised training on a much smaller dataset. The resulting feature extractor outperforms conventional feature sets as a module within an ASR. Wav2vec2 was developed by Meta.

Sequence Modelling: ASR is a Sequence-to-Sequence task. Such tasks can be handled by Recurrent Neural Networks, where at each step the network has access to the next frame of input data and to feedback from the last step.

Recurrent Neural Nets have some serious problems and in recent years they have been superseded by Transformer architectures using Attention-based models, introduced by Google in a famous paper ‘Attention is all you need’. Transformer models designed for speech recognition are usually based on the encoder-decoder architecture similar to seq2seq models. In more detail, they are based on the self-attention mechanism instead of the recurrence that is adopted by RNNs.

In end-to-end ASR the transformation from audio to words is accomplished by DNNs alone, with no use of phonetic or linguistic knowledge.

Links:

Speech Recognition: a review of the different deep learning approaches by Ilias Papastratis
·A ‘’Lazy Science’ guide to Wav2vec2
Meta’s guide to Wav2Vec2:
Digital Ocean: Review of End-to-end ASR
Michiel Bacchiani , Google: Google Research on end-to-end models for speech recognition
An animated explanation of the ‘Attention is all you need’ paper by Jean de Dieu Nyandwi of CMU
A Digital Ocean tutorial on end-to-end speech recognition

DNN Toolkits

PyTorch is a machine learning library based on the Torch library and used for applications such as computer vision and natural language processing,^[ This series of YouTube videos talks you through how PyTorch works.
This tutorial allows you to experiment with the basics with the basics in PyTorch:
TensorFlow is another popular DNN toolkit.
SpeechBrain is a Community Toolkit for Open Source Conversational AI based on PyTorch.
Whisper is an open-source neural net system that 'approaches human level robustness and accuracy on English speech recognition.' created by OpenAI.

Organisation	Events	Membership	Help
> Board	> Interspeech	> Join - renew	> Sitemap
> Legal documents	> Workshops	> Membership directory	> Contact
> Logos			> FAQ
			> Privacy policy