10th ISCA Speech Synthesis Workshop

20-22 September 2019, Vienna, Austria

Chair: Michael Pucher

ISSN: 2312-2846
DOI: 10.21437/SSW.2019

keynote 1: Deep learning for speech synthesis - Aäron van den Oord


Deep learning for speech synthesis
Aäron van den Oord


oral 1: Neural vocoder


Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis
Xin Wang, Junichi Yamagishi

A Comparison of Recent Neural Vocoders for Speech Signal Reconstruction
Prachi Govalkar, Johannes Fischer, Frank Zalkow, Christian Dittmar

Deep neural network based real-time speech vocoder with periodic and aperiodic inputs
Keiichiro Oura, Kazuhiro Nakamura, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda

Generative Adversarial Network based Speaker Adaptation for High Fidelity WaveNet Vocoder
Qiao Tian, Xucheng Wan, Shan Liu


oral 2: Adaptation


Neural Text-to-Speech Adaptation from Low Quality Public Recordings
Qiong Hu, Erik Marchi, David Winarsky, Yannis Stylianou, Devang Naik, Sachin Kajarekar

Neural VTLN for Speaker Adaptation in TTS
Bastian Schnell, Philip N. Garner

Problem-Agnostic Speech Embeddings for Multi-Speaker Text-to-Speech with SampleRNN
David Álvarez, Santiago Pascual, Antonio Bonafonte


poster 1: Voice conversion and multi-speaker TTS


Multi-Speaker Modeling for DNN-based Speech Synthesis Incorporating Generative Adversarial Networks
Hiroki Kanagawa, Yusuke Ijima

Speaker Adaptation of Acoustic Model using a Few Utterances in DNN-based Speech Synthesis Systems
Ivan Himawan, Sandesh Aryal, Iris Ouyang, Shukhan Ng, Pierre Lanchantin

DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis
Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion
Wen-Chin Huang, Yi-Chiao Wu, Kazuhiro Kobayashi, Yu-Huai Peng, Hsin-Te Hwang, Patrick Lumban Tobing, Yu Tsao, Hsin-Min Wang, Tomoki Toda

Statistical Voice Conversion with Quasi-periodic WaveNet Vocoder
Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda

Voice Conversion without Explicit Separation of Source and Filter Components Based on Non-negative Matrix Factorization
Hitoshi Suda, Daisuke Saito, Nobuaki Minematsu

Voice conversion based on full-covariance mixture density networks for time-variant linear transformations
Gaku Kotani, Daisuke Saito

Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion
Tobias Gburrek, Thomas Glarner, Janek Ebbers, Reinhold Haeb-Umbach, Petra Wagner

Novel Inception-GAN for Whispered-to-Normal Speech Conversion
Maitreya Patel, Mihir Parmar, Savan Doshi, Nirmesh Shah, Hemant Patil

Implementation of DNN-based real-time voice conversion and its improvements by audio data augmentation and mask-shaped device
Riku Arakawa, Shinnosuke Takamichi, Hiroshi Saruwatari


keynote 2: Synthesizing animal vocalizations and modelling animal speech - Tecumseh Fitch and Bart de Boer


Synthesizing animal vocalizations and modelling animal speech
Tecumseh Fitch, Bart de Boer


oral 3: Evaluation and performance


Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs
Rob Clark, Hanna Silen, Tom Kenter, Ralph Leith

Speech Synthesis Evaluation — State-of-the-Art Assessment and Suggestion for a Novel Research Program
Petra Wagner, Jonas Beskow, Simon Betz, Jens Edlund, Joakim Gustafson, Gustav Eje Henter, Sébastien Le Maguer, Zofia Malisz, Éva Székely, Christina Tånnander, Jana Voße

Rakugo speech synthesis using segment-to-segment neural transduction and style tokens — toward speech synthesis for entertaining audiences
Shuhei Kato, Yusuke Yasuda, Xin Wang, Erica Cooper, Shinji Takaki, Junichi Yamagishi

Voice Puppetry: Exploring Dramatic Performance to Develop Speech Synthesis
Matthew Aylett, David Braude, Christopher Pidcock, Blaise Potard


oral 4: Speech science


Measuring the contribution to cognitive load of each predicted vocoder speech parameter in DNN-based speech synthesis
Avashna Govender, Cassia Valentini-Botinhao, Simon King

Statistical parametric synthesis of budgerigar songs
Lorenz Gutscher, Michael Pucher, Carina Lozo, Marisa Hoeschele, Daniel C. Mann

GlottDNN-based spectral tilt analysis of tense voice emotional styles for the expressive 3D numerical synthesis of vowel [a]
Marc Freixes, Marc Arnela, Francesc Alías, Joan Claudi Socoró


poster 2: Applications and practical issues


Preliminary guidelines for the efficient management of OOV words for spoken text
Christina Tånnander, Jens Edlund

Loss Function Considering Temporal Sequence for Feed-Forward Neural Network–Fundamental Frequency Case
Noriyuki Matsunaga, Yamato Ohtani, Tatsuya Hirahara

Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis
Tomoki Koriyama, Shinnosuke Takamichi, Takao Kobayashi

Speaker Anonymization Using X-vector and Neural Waveform Models
Fuming Fang, Xin Wang, Junichi Yamagishi, Isao Echizen, Massimiliano Todisco, Nicholas Evans, Jean-Francois Bonastre

V2S attack: building DNN-based voice conversion from automatic speaker verification
Taiki Nakamura, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Hiroshi Saruwatari

Impacts of input linguistic feature representation on Japanese end-to-end speech synthesis
Takato Fujimoto, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda

Evaluation of Block-Wise Parameter Generation for Statistical Parametric Speech Synthesis
Nobuyuki Nishizawa, Tomohiro Obara, Gen Hattori

Low computational cost speech synthesis based on deep neural networks using hidden semi-Markov model structures
Motoki Shimada, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda

Neural iTTS: Toward Synthesizing Speech in Real-time with End-to-end Neural Text-to-Speech Framework
Tomoya Yanagita, Sakriani Sakti, Satoshi Nakamura


keynote 3: Natural Language Generation: Creating Text - Claire Gardent


Natural Language Generation: Creating Text
Claire Gardent


oral 5: Language and dialect varieties


Enhancing Myanmar Speech Synthesis with Linguistic Information and LSTM-RNN
Aye Mya Hlaing, Win Pa Pa, Ye Kyaw Thu

Building Multilingual End-to-End Speech Synthesisers for Indian Languages
Anusha Prakash, Anju Leela Thomas, S. Umesh, Hema A Murthy

Diphthong interpolation, phone mapping, and prosody transfer for speech synthesis of similar dialect pairs
Michael Pucher, Carina Lozo, Philip Vergeiner, Dominik Wallner

Subset Selection, Adaptation, Gemination and Prosody Prediction for Amharic Text-to-Speech Synthesis
Elshadai Tesfaye Biru, Yishak Tofik Mohammed, David Tofu, Erica Cooper, Julia Hirschberg


oral 6: Sequence to sequence model


Initial investigation of encoder-decoder end-to-end TTS using marginalization of monotonic hard alignments
Yusuke Yasuda, Xin Wang, Junichi Yamagishi

Where do the improvements come from in sequence-to-sequence neural TTS?
Oliver Watts, Gustav Eje Henter, Jason Fong, Cassia Valentini-Botinhao

A Comparison of Letters and Phones as Input to Sequence-to-Sequence Models for Speech Synthesis
Jason Fong, Jason Taylor, Korin Richmond, Simon King


poster 3: Prosody


Generative Modeling of F0 Contours Leveraged by Phrase Structure and Its Application to Statistical Focus Control
Yuma Shirahata, Daisuke Saito, Nobuaki Minematsu

Subword tokenization based on DNN-based acoustic model for end-to-end prosody generation
Masashi Aso, Shinnosuke Takamichi, Norihiro Takamune, Hiroshi Saruwatari

Using generative modelling to produce varied intonation for speech synthesis
Zack Hodari, Oliver Watts, Simon King

How to train your fillers: uh and um in spontaneous speech synthesis
Éva Székely, Gustav Eje Henter, Jonas Beskow, Joakim Gustafson

An Investigation of Features for Fundamental Frequency Pattern Prediction in Electrolaryngeal Speech Enhancement
Mohammad Eshghi, Kou Tanaka, Kazuhiro Kobayashi, Hirokazu Kameoka, Tomoki Toda

PROMIS: a statistical-parametric speech synthesis system with prominence control via a prominence network
Zofia Malisz, Harald Berthelsen, Jonas Beskow, Joakim Gustafson

Deep Mixture-of-Experts Models for Synthetic Prosodic-Contour Generation
Raul Fernandez

Prosody Prediction from Syntactic, Lexical, and Word Embedding Features
Rose Sloan, Syed Sarfaraz Akhtar, Bryan Li, Ritvik Shrivastava, Agustin Gravano, Julia Hirschberg

Sequence to Sequence Neural Speech Synthesis with Prosody Modification Capabilities
Slava Shechtman, Alex Sorin