End-to-End Speech Intelligibility Prediction Using Time-Domain Fully Convolutional Neural Networks

Mathias B. Pedersen, Morten Kolbæk, Asger H. Andersen, Søren H. Jensen, Jesper Jensen


Data-driven speech intelligibility prediction has been slow to take off. Datasets of measured speech intelligibility are scarce, and so current models are relatively small and rely on hand-picked features. Classical predictors based on psychoacoustic models and heuristics are still the state-of-the-art. This work proposes a U-Net inspired fully convolutional neural network architecture, NSIP, trained and tested on ten datasets to predict intelligibility of time-domain speech. The architecture is compared to a frequency domain data-driven predictor and to the classical state-of-the-art predictors STOI, ESTOI, HASPI and SIIB. The performance of NSIP is found to be superior for datasets seen in the training phase. On unseen datasets NSIP reaches performance comparable to classical predictors.


 DOI: 10.21437/Interspeech.2020-1740

Cite as: Pedersen, M.B., Kolbæk, M., Andersen, A.H., Jensen, S.H., Jensen, J. (2020) End-to-End Speech Intelligibility Prediction Using Time-Domain Fully Convolutional Neural Networks. Proc. Interspeech 2020, 1151-1155, DOI: 10.21437/Interspeech.2020-1740.


@inproceedings{Pedersen2020,
  author={Mathias B. Pedersen and Morten Kolbæk and Asger H. Andersen and Søren H. Jensen and Jesper Jensen},
  title={{End-to-End Speech Intelligibility Prediction Using Time-Domain Fully Convolutional Neural Networks}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1151--1155},
  doi={10.21437/Interspeech.2020-1740},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1740}
}