ISCA - International Speech |
Deep Neural Networks for Speech Technology
The best-perfoming systems at the time of writing (2024) are based on Deep Neural Networks. These YouTube videos from MIT provide an introduction to DNNs:
DNNs for Speech Synthesis
Simon King's Speech Synthesis course has some videos covering DNNs in Text To Speech. See the video section of his Speech Synthesis course here:
This Interspeech 22 Tutorial on Neural Speech Synthesis is by Xu Tan (microsoft) and Hung-li Lee (National Univ Taiwan) and is based on their review paper.
DNNs for ASR
DNN Feature Extraction: Wav2vec and Wav2vec2 are schemes for ‘Learning the Structure of Speech from Raw Audio’. A convolutional network is trained by auto-encoding (i.e. to predict the next item given the history). This unsupervised training is then followed by supervised training on a much smaller dataset. The resulting feature extractor outperforms conventional feature sets as a module within an ASR. Wav2vec2 was developed by Meta.
Sequence Modelling: ASR is a Sequence-to-Sequence task. Such tasks can be handled by Recurrent Neural Networks, where at each step the network has access to the next frame of input data and to feedback from the last step.
Recurrent Neural Nets have some serious problems and in recent years they have been superseded by Transformer architectures using Attention-based models, introduced by Google in a famous paper ‘Attention is all you need’. Transformer models designed for speech recognition are usually based on the encoder-decoder architecture similar to seq2seq models. In more detail, they are based on the self-attention mechanism instead of the recurrence that is adopted by RNNs.
In end-to-end ASR the transformation from audio to words is accomplished by DNNs alone, with no use of phonetic or linguistic knowledge.
Links:
DNN Toolkits