Speech Prosody 2010
Chicago, IL, USA
Perceptual invariance against a large amount of acoustic variability in speech has been a long-discussed question in speech science and engineering  and it is still an open question [2, 3]. Recently, we proposed a candidate answer to it based on mathematically-guaranteed relational invariance [4, 5]. Here, completely transform-invariant features, f-divergences, are extracted from speech dynamics of an utterance and they are used to represent that utterance. In this paper, this representation is interpreted from a viewpoint of telecommunications and evolutionary anthropology. Speech production is often regarded as a process of modulating the baseline timbre of a speaker's voices by manipulating the vocal organs, i.e., spectrum modulation. Then, extraction of the linguistic content from an utterance can be viewed as a process of spectrum demodulation. This modulation-demodulation model of speech communication has a good link to known morphological and cognitive differences between humans and apes. The model also claims that a linguistic content is transmitted mainly by supra-segmental features.
Index Terms: speech recognition, invariant features, spectrum demodulation, evolutionary anthropology, language acquisition
Bibliographic reference. Minematsu, Nobuaki (2010): "A modulation-demodulation model of speech communication", In SP-2010, paper 913.