Stochastic Gradient Descent - Iterative Method

Iterative Method

In stochastic (or "on-line") gradient descent, the true gradient of is approximated by a gradient at a single example:

As the algorithm sweeps through the training set, it performs the above update for each training example. Several passes over the training set are made until the algorithm converges. Typical implementations may also randomly shuffle training examples at each pass and use an adaptive learning rate.

In pseudocode, stochastic gradient descent with shuffling of training set at each pass can be presented as follows:

Choose an initial vector of parameters and learning rate .
Repeat until an approximate minimum is obtained:
- Randomly shuffle examples in the training set.
- For, do:

There is a compromise between the two forms, which is often called "mini-batches", where the true gradient is approximated by a sum over a small number of training examples.

The convergence of stochastic gradient descent has been analyzed using the theories of convex minimization and of stochastic approximation. Briefly, when the learning rates decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum. This is in fact a consequence of the Robbins-Siegmund theorem.

Read more about this topic: Stochastic Gradient Descent

Famous quotes containing the word method:

“I know no method to secure the repeal of bad or obnoxious laws so effective as their stringent execution.”
—Ulysses S. Grant (1822–1885)