Competency Evaluation in Voice Mimicking Using Acoustic Cues

Abhijith G., Adharsh S., Akshay P. L., Rajeev Rajan

The fusion of i-vector with prosodic features is used to identify the most competent voice imitator through a deep neural network framework (DNN) in this paper. This experiment is conducted by analyzing the spectral and prosodic characteristics during voice imitation. Spectral features include mel-frequency cepstral features (MFCC) and modified group delay features (MODGDF). Prosodic features, computed by the Legendre polynomial approximation, are used as complementary information to the i-vector model. Proposed system evaluates the competence of artists in voice mimicking and ranks them according to the scores from a classifier based on mean opinion score (MOS). If the artist with the highest MOS is identified as rank-1 by the proposed system, a hit occurs. DNN-based classifier makes the decision based on the probability value on the nodes at the output layer. The performance is evaluated using top X-hit criteria on a mimicry dataset. Top-2 hit rate of 81.81% is obtained for fusion experiment. The experiments demonstrate the potential of i-vector framework and its fusion in competency evaluation of voice mimicking.

 DOI: 10.21437/Interspeech.2020-1790

Cite as: G., A., S., A., L., A.P., Rajan, R. (2020) Competency Evaluation in Voice Mimicking Using Acoustic Cues. Proc. Interspeech 2020, 1096-1100, DOI: 10.21437/Interspeech.2020-1790.

  author={Abhijith G. and Adharsh S. and Akshay P. L. and Rajeev Rajan},
  title={{Competency Evaluation in Voice Mimicking Using Acoustic Cues}},
  booktitle={Proc. Interspeech 2020},