Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu


1-Bit Stochastic Gradient Descent for Fast training of DNNs for Skype Translator

In the upcoming Skype Translator project we set ourselves an ambitious goal - to enable successful open-domain conversations between Skype users in different parts of the world, speaking different languages. A key enabler was the recent progress in speech recognition achieved by CD-DNN-HMMs acoustic models, which reduced word errors by up to 42% relative on the Switchboard benchmark.

This talk addresses a key challenge with these DNNs: the long times it takes to train them with SGD (stochastic-gradient decent), which is hard to parallelize in a straight-forward way due to high data bandwidth requirements. We will show that gradients can be aggressively quantized, if the quantization error is carried forward and compensated across minibatches. Combined with a number of other techniques, this enables a data-parallel SGD algorithm that speeds up training of production-size corpora of thousands of hours by 7 times, from 11+ days to under 48 hours. We believe this technique may also be applicable to other deep models.

Frank Seide, a native of Hamburg, Germany, is a Principal Researcher at Microsoft Research. His current research focus is on deep neural networks for conversational speech recognition; together with co-author Dong Yu, he was first to show the effectiveness of CD-DNN-HMMs for recognition of conversational speech. Since graduation in 1993, Frank has worked on various speech topics, first at Philips Research in Aachen and Taipei, then at Microsoft Research Asia (Beijing), and now at Microsoft Research (Redmond), including spoken-dialogue systems, Mandarin speech recognition, audio search, and speech-to-speech translation.

Back to symposium main page