Two Tiered Distributed Training Algorithm for Acoustic Modeling

Pranav Ladkat, Oleg Rybakov, Radhika Arava, Sree Hari Krishnan Parthasarathi, I-Fan Chen, Nikko Strom

We present a hybrid approach for scaling distributed training of neural networks by combining Gradient Threshold Compression (GTC) algorithm — a variant of stochastic gradient descent (SGD) — which compresses gradients with thresholding and quantization techniques and Blockwise Model Update Filtering (BMUF) algorithm — a variant of model averaging (MA). In this proposed method, we divide total number of workers into smaller subgroups in a hierarchical manner and limit frequent communication across subgroups. We update local model using GTC within a subgroup and global model using BMUF across different subgroups. We evaluate this approach in an Automatic Speech Recognition (ASR) task, by training deep long short-term memory (LSTM) acoustic models on 2000 hours of speech. Experiments show that, for a wide range in the number of GPUs used for distributed training, the proposed approach achieves a better trade-off between accuracy and scalability compared to GTC and BMUF.

 DOI: 10.21437/Interspeech.2019-1859

Cite as: Ladkat, P., Rybakov, O., Arava, R., Parthasarathi, S.H.K., Chen, I., Strom, N. (2019) Two Tiered Distributed Training Algorithm for Acoustic Modeling. Proc. Interspeech 2019, 1626-1630, DOI: 10.21437/Interspeech.2019-1859.

  author={Pranav Ladkat and Oleg Rybakov and Radhika Arava and Sree Hari Krishnan Parthasarathi and I-Fan Chen and Nikko Strom},
  title={{Two Tiered Distributed Training Algorithm for Acoustic Modeling}},
  booktitle={Proc. Interspeech 2019},