Attention-Based Speaker Embeddings for One-Shot Voice Conversion

Tatsuma Ishihara, Daisuke Saito


This paper proposes a novel approach to embed speaker information to feature vectors at frame level using an attention mechanism, and its application to one-shot voice conversion. A one-shot voice conversion system is a type of voice conversion system where only one utterance from a target speaker is available for conversion. In many one-shot voice conversion systems, a speaker encoder mechanism compresses an utterance of the target speaker into a fixed-size vector for propagating speaker information. However, the obtained representation has lost temporal information related to speaker identities and it could degrade conversion quality. To alleviate this problem, we propose a novel way to embed speaker information using an attention mechanism. Instead of compressing into a fixed-size vector, our proposed speaker encoder outputs a sequence of speaker embedding vectors. The obtained sequence is selectively combined with input frames of a source speaker by an attention mechanism. Finally the obtained time varying speaker information is utilized for a decoder to generate the converted features. Objective evaluation showed that our method reduced the averaged mel-cepstrum distortion to 5.23 dB from 5.34 dB compared with the baseline system. The subjective preference test showed that our proposed system outperformed the baseline one.


 DOI: 10.21437/Interspeech.2020-2512

Cite as: Ishihara, T., Saito, D. (2020) Attention-Based Speaker Embeddings for One-Shot Voice Conversion. Proc. Interspeech 2020, 806-810, DOI: 10.21437/Interspeech.2020-2512.


@inproceedings{Ishihara2020,
  author={Tatsuma Ishihara and Daisuke Saito},
  title={{Attention-Based Speaker Embeddings for One-Shot Voice Conversion}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={806--810},
  doi={10.21437/Interspeech.2020-2512},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2512}
}