Stochastic Gradient Descent

Background

Main article: M-estimation See also: Estimating equation

Both statistical estimation and machine learning consider the problem of minimizing an objective function that has the form of a sum:

where the parameter is to be estimated and where typically each summand function is associated with the -th observation in the data set (used for training).

In classical statistics, sum-minimization problems arise in least squares and in maximum-likelihood estimation (for independent observations). The general class of estimators that arise as minimizers of sums are called M-estimators. However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation, as shown for example by Thomas Ferguson's example. Therefore, contemporary statistical theorists often consider stationary points of the likelihood function (or zeros of its derivative, the score function, and other estimating equations).

The sum-minimization problem also arises for empirical risk minimization: In this case, is the value of loss function at -th example, and is the empirical risk.

When used to minimize the above function, a standard (or "batch") gradient descent method would perform the following iterations :

where is a step size (sometimes called the learning rate in machine learning).

In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics, one-parameter exponential families allow economical function-evaluations and gradient-evaluations.

However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems.

Read more about this topic: Stochastic Gradient Descent

Famous quotes containing the word background:

“They were more than hostile. In the first place, I was a south Georgian and I was looked upon as a fiscal conservative, and the Atlanta newspapers quite erroneously, because they didn’t know anything about me or my background here in Plains, decided that I was also a racial conservative.”
—Jimmy Carter (James Earl Carter, Jr.)

“Pilate with his question “What is truth?” is gladly trotted out these days as an advocate of Christ, so as to arouse the suspicion that everything known and knowable is an illusion and to erect the cross upon that gruesome background of the impossibility of knowledge.”
—Friedrich Nietzsche (1844–1900)

“I had many problems in my conduct of the office being contrasted with President Kennedy’s conduct in the office, with my manner of dealing with things and his manner, with my accent and his accent, with my background and his background. He was a great public hero, and anything I did that someone didn’t approve of, they would always feel that President Kennedy wouldn’t have done that.”
—Lyndon Baines Johnson (1908–1973)

Stochastic Gradient Descent - Background

Famous quotes containing the word background: