1-bit Stochastic Gradient Descent is a technique from Microsoft Research aimed at increasing the data parallelism inherent in training deep neural networks. They describe the technique in the paper 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs.
They accelerate training neural networks with stochastic gradient descent by:
- splitting up the computation for each minibatch across many nodes in a distributed system.
- reducing the bandwidth requirements for communication between nodes by exchanging gradients (instead of model parameters) and quantizing those gradients all the way to just 1 bit.
- they add the quantization error from Step 2 into the next minibatch gradient before quantization.