Third ESCA/COCOSDA Workshop on Speech Synthesis

November 26-29, 1998
Jenolan Caves House, Blue Mountains, NSW, Australia

A Biphone Constrained Concatenation Method for Diphone Synthesis

H. Timothy Bunnell, Steven R. Hoskins, Debra M. Yarrington

Speech Research Laboratory, duPont Hospital for Children and University of Delaware, Wilmington, DE, USA

Diphone concatenation [1] has the advantages of simplicity and a relatively small database of speech when compared to other concatenative synthesis methods (e.g., [2]). However, diphone concatenation faces two notable problems. The first is coarticulation which extends beyond the scope of a single diphone and entails some degree of contextual mismatch for virtually any diphone in at least some concatenation contexts. The second problem, which stems from the first, is computational. It is the problem of selecting, from a specific speech corpus, an optimal instance of each diphone to achieve the least amount of temporal and spectral distortion in the broadest set of concatenation contexts (e.g., [3]).

We present a variant of diphone synthesis which addresses both problems by (a) allowing multiple tokens of diphones where needed to accommodate the effects of coarticulation, and (b) postponing diphone selection until synthesis when optimization can be constrained by known contextual factors. This method, termed Biphone Constrained Concatenation (BCC), has been implemented for use in the ModelTalker TtS system [4]. Comparisons of speech synthesized using BCC versus speech synthesized using pure diphone concatenation indicate clear improvements in naturalness for the BCC method. However, our listening experiments also demonstrated some increase in consonant confusions for the BCC method due to uncontrolled durational factors.

References

  1. Peterson, G., Wang, W., and Siversten, E. (1958). Segmentation techniques in speech synthesis. . Journal of the Acoustical Society of America, 30, 739-742.
  2. Takeda, K., Abe, K., and Sagisaka, Y. (1992). On the basic scheme and algorithms in non-uniform unit speech synthesis. In G. Bailly, C. Benoit and T.R. Sawallis (eds.), Talking Machines: Theories, Models, and Designs. Amsterdam: Elsevier, 93-105.
  3. Conkie, A. D. and Isard, S. (1997). Optimal coupling of diphones. In Progress in Speech Synthesis, van Santen, J.P.H., Sproat, R.W., Olive, J.P., and Hirschberg, J. (eds.). Springer, New York, pp. 293-304.
  4. Bunnell, H.T., Hoskins, S. R., and Yarrington, D. M. (1998). The ModelTalker project: Software for diphone speech synthesis and automatic diphone extraction. University of Delaware, Computer and Information Sciences Technical Report, 98-13.


Full Paper (with 2 sound examples linked from within the paper)

Bibliographic reference.  Bunnell, H. Timothy / Hoskins, Steven R. / Yarrington, Debra M. (1998): "A Biphone Constrained Concatenation Method for Diphone Synthesis", In SSW3-1998, 171-176.