Session A7 - Stochastic Computation
July 12, 17:30 ~ 17:55 - Room B2
The stability of stochastic gradient descent
University of California, Berkeley, USA - email@example.com
The most widely used optimization method in machine learning practice is the Stochastic Gradient Method (SGM). This method has been used since the fifties to build statistical estimators, iteratively improving models by correcting errors observed on single data points. SGM is not only scalable, robust, and simple to implement, but achieves the state-of-the-art performance in many different domains. In contemporary systems, SGM powers enterprise analytics systems and is the workhorse tool used to train complex pattern-recognition systems in speech and vision.
In this talk, I will explore why SGM has had such staying power, focusing on the notion of generalization. I will show that any model trained with a few SGM iterations has vanishing generalization error and performs as well on unseen data as on the training data. The analysis will solely employ elementary tools from convex and continuous optimization. Applying the results to the convex case provides new explanations for why multiple epochs of stochastic gradient descent generalize well in practice, and give new insights into minibatch sizes in SGM. In the nonconvex case, I will describe a new interpretation of common practices in neural networks, and provide a formal rationale for stability-promoting mechanisms in training large, deep models.
Joint work with Moritz Hardt (Google Brain/UC Berkeley), Yoram Singer (Google Brain).