Competency Evaluation in Voice Mimicking Using Acoustic Cues

Abhijith G., Adharsh S., Akshay P. L., Rajeev Rajan


The fusion of i-vector with prosodic features is used to identify the most competent voice imitator through a deep neural network framework (DNN) in this paper. This experiment is conducted by analyzing the spectral and prosodic characteristics during voice imitation. Spectral features include mel-frequency cepstral features (MFCC) and modified group delay features (MODGDF). Prosodic features, computed by the Legendre polynomial approximation, are used as complementary information to the i-vector model. Proposed system evaluates the competence of artists in voice mimicking and ranks them according to the scores from a classifier based on mean opinion score (MOS). If the artist with the highest MOS is identified as rank-1 by the proposed system, a hit occurs. DNN-based classifier makes the decision based on the probability value on the nodes at the output layer. The performance is evaluated using top X-hit criteria on a mimicry dataset. Top-2 hit rate of 81.81% is obtained for fusion experiment. The experiments demonstrate the potential of i-vector framework and its fusion in competency evaluation of voice mimicking.


 DOI: 10.21437/Interspeech.2020-1790

Cite as: G., A., S., A., L., A.P., Rajan, R. (2020) Competency Evaluation in Voice Mimicking Using Acoustic Cues. Proc. Interspeech 2020, 1096-1100, DOI: 10.21437/Interspeech.2020-1790.


@inproceedings{G.2020,
  author={Abhijith G. and Adharsh S. and Akshay P. L. and Rajeev Rajan},
  title={{Competency Evaluation in Voice Mimicking Using Acoustic Cues}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1096--1100},
  doi={10.21437/Interspeech.2020-1790},
  url={http://dx.doi.org/10.21437/Interspeech.2020-1790}
}