Effectiveness of Dynamic Features in INCA and Temporal Context-INCA

Nirmesh Shah, Hemant Patil

Non-parallel Voice Conversion (VC) has gained significant attention since last one decade. Obtaining corresponding speech frames from both the source and target speakers before learning the mapping function in the non-parallel VC is a key step in the standalone VC task. Obtaining such corresponding pairs, is more challenging due to the fact that both the speakers may have uttered different utterances from same or the different languages. Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA) and its variant Temporal Context (TC)-INCA are popular unsupervised alignment algorithms. The INCA and TC-INCA iteratively learn the mapping function after getting the Nearest Neighbor (NN) aligned pairs from the intermediate converted and the target spectral features. In this paper, we propose to use dynamic features along with static features to calculate the NN aligned pairs in both the INCA and TC-INCA algorithms (since the dynamic features are known to play a key role to differentiate major phonetic categories). We obtained on an average relative improvement of 13.75% and 5.39% with our proposed Dynamic INCA and Dynamic TC-INCA, respectively. This improvement is also positively reflected in the quality of converted voices.

 DOI: 10.21437/Interspeech.2018-1538

Cite as: Shah, N., Patil, H. (2018) Effectiveness of Dynamic Features in INCA and Temporal Context-INCA. Proc. Interspeech 2018, 711-715, DOI: 10.21437/Interspeech.2018-1538.

  author={Nirmesh Shah and Hemant Patil},
  title={Effectiveness of Dynamic Features in INCA and Temporal Context-INCA},
  booktitle={Proc. Interspeech 2018},