Learning Voice Representation Using Knowledge Distillation for Automatic Voice Casting

Adrien Gresse, Mathias Quillot, Richard Dufour, Jean-Fran├žois Bonastre


The search for professional voice-actors for audiovisual productions is a sensitive task, performed by the artistic directors (ADs). The ADs have a strong appetite for new talents/voices but cannot perform large scale auditions. Automatic tools able to suggest the most suited voices are of a great interest for audiovisual industry. In previous works, we showed the existence of acoustic information allowing to mimic the AD’s choices. However, the only available information is the ADs’ choices from the already dubbed multimedia productions. In this paper, we propose a representation-learning based strategy to build a character/role representation, called p-vector. In addition, the large variability between audiovisual productions makes it difficult to have homogeneous training datasets. We overcome this difficulty by using knowledge distillation methods to take advantage of external datasets. Experiments are conducted on video-game voice excerpts. Results show a significant improvement using the p-vector, compared to the speaker-based x-vector representation.


 DOI: 10.21437/Interspeech.2020-2236

Cite as: Gresse, A., Quillot, M., Dufour, R., Bonastre, J. (2020) Learning Voice Representation Using Knowledge Distillation for Automatic Voice Casting. Proc. Interspeech 2020, 160-164, DOI: 10.21437/Interspeech.2020-2236.


@inproceedings{Gresse2020,
  author={Adrien Gresse and Mathias Quillot and Richard Dufour and Jean-Fran├žois Bonastre},
  title={{Learning Voice Representation Using Knowledge Distillation for Automatic Voice Casting}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={160--164},
  doi={10.21437/Interspeech.2020-2236},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2236}
}