Fully-Convolutional Network for Pitch Estimation of Speech Signals

Luc Ardaillon, Axel Roebel

The estimation of fundamental frequency (F0) from audio is a necessary step in many speech processing tasks such as speech synthesis, that require to accurately analyze big datasets, or real-time voice transformations, that require low computation times. New approaches using neural networks have been recently proposed for F0 estimation, outperforming previous approaches in terms of accuracy. The work presented here aims at bringing some more improvements over such CNN-based state-of-the-art approaches, especially when targeting speech data. More specifically, we first propose to use the recent PaN speech synthesis engine in order to generate a high-quality speech database with a reliable ground truth F0 annotation. Then, we propose 3 variants of a new fully-convolutional network (FCN) architecture that are shown to perform better than other similar data-driven methods, with a significantly reduced computational load making them more suitable for real-time purposes.

 DOI: 10.21437/Interspeech.2019-2815

Cite as: Ardaillon, L., Roebel, A. (2019) Fully-Convolutional Network for Pitch Estimation of Speech Signals. Proc. Interspeech 2019, 2005-2009, DOI: 10.21437/Interspeech.2019-2815.

  author={Luc Ardaillon and Axel Roebel},
  title={{Fully-Convolutional Network for Pitch Estimation of Speech Signals}},
  booktitle={Proc. Interspeech 2019},