By Jun-Kun Wang, Chi-Heng Lin, and Jacob Abernethy
SGD with stochastic momentum (see Figure 1 below) has been the de facto training algorithm in nonconvex optimization and deep learning. It has been widely adopted for training neural nets in various applications. Modern techniques in computer vision (e.g.[1,2]), speech recognition (e.g. ), natural language processing (e.g. ), and reinforcement learning (e.g.  ) use SGD with stochastic momentum to train models. To be precise, during the training, the algorithm is applied to solve , where the underlying loss function measures the predicted performance of a neural net with weights won a mini-batch of samples , which is typically non-convex.
Despite the wide use of stochastic momentum in practice, a theoretical justification for its superiority has remained elusive, as has any mathematical guidelines for actually setting the momentum parameter —it has been observed that large values (e.g. =0.9) work well in practice. We note that PyTorch, one of the most popular deep learning toolkits, recommends a high value of momentum parameter =0.9 for training neural nets in its tutorial page.
At the same time, a widely-observed empirical phenomenon is that in training deep networks stochastic momentum appears to significantly improve convergence time, variants of it have flourished in the development of other popular update methods, e.g. ADAM (), AMSGrad (), etc. Yet theoretical justification for the use of stochastic momentum has remained a significant open question.
Our work proposes an answer: stochastic momentum improves deep network training because it modifies SGD to escape saddle points faster and, consequently, leads to faster convergence. For practitioners, our result provides guidance for setting the value of the momentum parameter, which is important for speeding up deep learning. For theorists, our work sheds light on understanding the interaction of optimization and deep learning.
Let us provide some high-level intuition about the benefit of stochastic momentum with respect to escaping saddle points. In an iterative update scheme, at some time t the iterate can enter a saddle point region (see Figure 3), that is a place where Hessian 2 has a non-trivial negative eigenvalue, say and the gradient is small in norm, say . The challenge here is that gradient updates may drift only very slowly away from the saddle point, and may not escape this region. On the other hand, if the iterates were to move in one particular direction, namely along the direction of the smallest eigenvector of , then a fast escape is guaranteed under certain constraints on the step size 𝞰. While the negative eigenvector could be computed directly, this 2nd-order method is prohibitively expensive due to millions of parameters in state-of-the-art neural nets and hence we typically aim to rely on first-order methods (e.g. SGD with stochastic momentum).
Now let us conduct a thought experiment. Denote the escape direction (i.e. the negative curvature direction) . Assume that at some iterate , we have momentum which possesses a significant correlation with the negative curvature direction , then on successive rounds is quite close to is quite close to , and so forth; see Figure 4 for an example. This provides an intuitive perspective on how momentum might help accelerate the escape process. Yet one might ask: does this procedure provably contribute to the escape process? And, if so, what is the aggregate performance improvement of the momentum? We answer the first question in the affirmative, and we answer the second question with the following result.
Let us consider a toy example to demonstrate that SGD with stochastic momentum escapes saddle points faster with higher values of . We follow the works of  for the experiment, which considers solving with an embedded saddle given by the matrix A:= diag([1, -0.1]) and stochastic gaussian perturbations given by . Note that the small variance in the second component provides lower projection of the gradient in the escape direction. The experiment considers setting the initial iterate w0 at the saddle point (i.e. the origin), and applying SGD with momentum to solve the optimization problem with different values of . Figure 5 clearly demonstrates that the higher the momentum parameter , the faster the escape process.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Conference on Computer Vision and Pattern Recognition (CVPR), 2016
 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. NIPS, 2012
 Ashish Vaswani, Noam Shazeer, Niki Parmar, and et al. Attention is all you need. NIPS, 2017.
 David Silver, Julian Schrittwieser, Karen Simonyan, and et al. Mastering the game of go without human knowledge. Nature, 2017.
 Geoffrey Hinton, Li Deng, Dong Yu, and et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine 2012
 Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
 Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. ICLR,2018.
 Sashank Reddi, Manzil Zaheer, Suvrit Sra, Barnabas Poczos, Francis Bach, Ruslan Salakhutdinov, and Alex Smola. A generic approach for escaping saddle points. AISTATS, 2018.