Gradient Descent
Gradient Descent Algorithm
Given a loss function
- Initialize
with - For each
, calculate the gradient at : - Take the gradient step with step size
: - Loop until convergence
Convex Quadratics
Let us first consider the case of quadratic problems, with objective of the type
where
With diagonalizing
where
Taking derivatives
Behavior of GD on Convex Quadratics
In order to further simplify the analysis, we shift
Taking a gradient step:
For
With appropriately chosen step size
Optimal Convergence Rate
Considering all
which is attained at
This results in the weakest rate of convergence
where
With a bad convergence number, the convergence rate will be slow in one direction but fast in another, making the optimization oscillating.
Smoothness
Gradient descent can only work, if gradient does not change too much relative to the step size.
Smooth Functions
Namely, the difference of gradients at two points in the parameter space is bounded by the distance of the two points, or
A function
Implication of Smoothness
If
In gradient descent, let's say
By selecting
One can see that:
- With proper choice of step size, the loss function is guaranteed to decrease, since the right hand size is less or equal to zero.
- If the function is smooth, i.e.
is small, we can choose a large step size to gain larger progress. - The progress we make towards convergence also depends on the gradient norm.
Gradient Norm
With small gradient norm, the convergence becomes prohibitively slow. It is thus reasonable to find
Gradient descent on an
Let
This means in at least one of the iterations we have
If we have reached the critical point at
Therefore
Strong Convexity and the PL-condition
Polyak-Łojasiewicz Condition
A differentiable function
Let
The PL condition is a fundamental property that directly implies geometric convergence to the minimum.
Strongly Convex Functions
A differentiable function
reduces to the special case of convex functions. - A positive definite quadratic function is strongly convex with
. - For twice differentiable functions, which is
-strongly convex and -smooth, we have
The PL condition is implied by strong convexity:
Let
Minimizing both sides of the strong convexity condition:
Therefore,
which gives the PL condition.
In DNNs, the PL condition will typically not hold globally, but possibly over a domain around a local minimum. It then ensures fast local convergence to this critical point without making claims to its sub-optimality.
Momentum and Acceleration
Saddle Points
The training objective of a DNN is usually non-convex. It thus may contain saddle points which can slow down GD in its neighborhood. Therefore, it is useful to introduce some noise to the GD algorithm, i.e. we can compensate small gradients by smoothness.
Heavy Ball Method
In the heavy ball method, we add a
With constant gradient
Therefore, by using large momentum, i.e.
Practically, as the gradient is not a constant, a too large
Nesterov Acceleration
Nesterov acceleration pursues the same idea as the heavy ball method, but evaluates the gradient at the extrapolated point:
- Known to be optimal in the convex case and accelerated in the strongly-convex case
- The heavy ball method, however, seems to robustly result in benefits for non-convex functions
AdaGrad
AdaGrad uses adaptive learning rate for each single dimension. It uses the history of gradients at previous iterations to influence the effective step size. Defining
which is the sum of the squares f the
where
Adam and RMSprop
Adaptive momentum estimation, as known as Adam, is the state-of-the-art learning algorithm. It combines the benefits of momentum and AdaGrad, using an exponentially weighted average to estimate the mean and variance of each partial derivative:
The update rule becomes
Adam without the use of momentum is called RMSprop.