Speech Prosody 2010

Chicago, IL, USA
May 10-14, 2010

A Modulation-Demodulation Model of Speech Communication

Nobuaki Minematsu

Graduate School of Information Science and Technology, The University of Tokyo, Japan

Perceptual invariance against a large amount of acoustic variability in speech has been a long-discussed question in speech science and engineering [1] and it is still an open question [2, 3]. Recently, we proposed a candidate answer to it based on mathematically-guaranteed relational invariance [4, 5]. Here, completely transform-invariant features, f-divergences, are extracted from speech dynamics of an utterance and they are used to represent that utterance. In this paper, this representation is interpreted from a viewpoint of telecommunications and evolutionary anthropology. Speech production is often regarded as a process of modulating the baseline timbre of a speaker's voices by manipulating the vocal organs, i.e., spectrum modulation. Then, extraction of the linguistic content from an utterance can be viewed as a process of spectrum demodulation. This modulation-demodulation model of speech communication has a good link to known morphological and cognitive differences between humans and apes. The model also claims that a linguistic content is transmitted mainly by supra-segmental features.

Index Terms: speech recognition, invariant features, spectrum demodulation, evolutionary anthropology, language acquisition


  1. J. S. Perkell and D. H. Klatt, Invariance and variability in speech processes, Lawrence Erlbaum Associates, Inc., 1986.
  2. R. Newman, “The level of detail in infants' lexical representations and its implications for computational models,” Keynote speech in Workshop on Acquisition of Communication and Recognition Skills (ACORNS), 2009.
  3. S. Furui, “Generalization problem in ASR acoustic model training and adaptation,” Keynote speech in IEEEWorkshop on Automatic Speech Recognition and Understanding (ASRU), 2009.
  4. N. Minematsu, “Mathematical evidence of the acoustic universal structure in speech,” Proc. ICASSP, 889–892, 2005.
  5. Y. Qiao et al.,“A study on invariance of f-divergence and its application to speech recognition,” IEEE Transactions on Signal Processing, 58, 2010.

Full Paper

Bibliographic reference.  Minematsu, Nobuaki (2010): "A modulation-demodulation model of speech communication", In SP-2010, paper 913.