Stacked Long-Term TDNN for Spoken Language Recognition

Daniel Garcia-Romero, Alan McCree

This paper introduces a stacked architecture that uses a time delay neural network (TDNN) to model long-term patterns for spoken language identification. The first component of the architecture is a feed-forward neural network with a bottleneck layer that is trained to classify context-dependent phone states (senones). The second component is a TDNN that takes the output of the bottleneck, concatenated over a long time span, and produces a posterior probability over the set of languages. The use of a TDNN architecture provides an efficient model to capture discriminative patterns over a wide temporal context. Experimental results are presented using the audio data from the language i-vector challenge (IVC) recently organized by NIST. The proposed system outperforms a state-of-the-art shifted delta cepstra i-vector system and provides complementary information to fuse with the new generation of bottleneck-based i-vector systems that model short-term dependencies.

DOI: 10.21437/Interspeech.2016-1334

Cite as

Garcia-Romero, D., McCree, A. (2016) Stacked Long-Term TDNN for Spoken Language Recognition. Proc. Interspeech 2016, 3226-3230.

author={Daniel Garcia-Romero and Alan McCree},
title={Stacked Long-Term TDNN for Spoken Language Recognition},
booktitle={Interspeech 2016},