Second International Conference on Spoken Language Processing (ICSLP'92)
Banff, Alberta, Canada
As demonstrated by Hirayama, V.-Bateson, Kawato, & Honda , we are attempting to model speech production by means of neural networks. At one stage, a 3-layer perceptron learns the dynamics relating muscle activity and articulator motion and, at a later stage, another perceptron learns the PARCOR parameters relating the effect of articulator motion on vocal tract shape and the speech acoustics. After learning, motor commands to the musculo-skeletal system are generated through time by performing the trajectory formation and inverse dynamics using a cascade neural network, parametrized by the via point and smoothness constraints imposed by the phoneme input string and global performance factors, such as speaking rate and speaking style. Articulator trajectories are then generated which serve as input to the PARCOR synthesizer that produces the speech acoustics. Although this effort is still in its infancy, it is proving to be a fairly successful piece of engineering and we think a discussion of the attendant speech science and motor control issues is warranted. Therefore, in this paper, we discuss our modeling effort in terms of well-known problems common to both computational modeling and speech motor control, such as excess degrees-of-freedom in the mapping between different levels, coordinate transformation between articulator and task space variables, and extrinsic versus intrinsic timing. Problems of less common concern, such as biological and cognitive plausibility, are also considered. For the most part, we focus on issues leading up to the generation of articulator motion and leave discussion of the articulatory-to-acoustic transform to a later date.
Bibliographic reference. Vatikiotis-Bateson, Eric / Hirayama, Makoto / Honda, Kiyoshi / Kawato, Mitsuo (1992): "The articulatory dynamics of running speech: gestures from phonemes?", In ICSLP-1992, 887-890.