#Algorithm #Machine-Learning/Deep-Learning #Note

Gradient Descent

Gradient Descent Algorithm

Given a loss function $ℓ (θ)$ that we want to minimize, the GD algorithm works as follows:

Initialize $θ$ with $θ^{0}$
For each $θ^{k}$ , calculate the gradient at $θ^{k}$ : $\nabla ℓ (θ^{k})$
Take the gradient step with step size $η$ : $θ^{k + 1} \leftarrow θ^{k} - η \nabla ℓ (θ^{k})$
Loop until convergence

Convex Quadratics

Let us first consider the case of quadratic problems, with objective of the type

ℓ (θ) = \frac{1}{2} θ^{T} Q θ - q^{T} θ,

where $Q$ is positive definite.

With diagonalizing $Q = U Λ U^{T}$ with orthogonal $U$ , and letting $θ \leftarrow U^{T} θ$ and $q \leftarrow U^{T} q$ , the objective becomes

\begin{aligned} ℓ (θ) & = \sum_{i = 1}^{d} ℓ_{i} (θ_{i}), \\ ℓ_{i} (ϑ) & = \frac{λ_{i}}{2} ϑ^{2} - q_{i} ϑ, \end{aligned}

where $λ_{i} > 0$ .

Taking derivatives $ℓ_{i}^{'} (ϑ) = λ_{i} ϑ - q_{i}$ , we have the optima

θ_{i}^{*} = \frac{q_{i}}{λ_{i}}, ℓ_{i}^{*} = ℓ_{i} (θ_{i}^{*}) = - \frac{q_{i}^{2}}{2 λ_{i}} .

Behavior of GD on Convex Quadratics

In order to further simplify the analysis, we shift $ℓ_{i}$ by a constant such that $min ℓ_{i} = 0$ :

ℓ_{i} (ϑ) = \frac{1}{2 λ_{i}} {(λ_{i} ϑ - q_{i})}^{2} .

Taking a gradient step:

ℓ_{i} (ϑ - η (λ_{i} ϑ - q_{i})) = (1 - λ_{i} η)^{2} ℓ_{i} (ϑ) .

For $η > 0$ , the sufficient condition for ${(1 - λ_{i} η)}^{2} < 1$ is $η < \frac{2}{λ_{i}}$ , hence

η < \frac{2}{λ_{max}}, λ_{max} = {max}_{i} {λ_{i}} .

With appropriately chosen step size $η < \frac{2}{λ_{max}}$ , gradient descent converges exponentially fast to the minimum of a convex quadratic.

Optimal Convergence Rate

Considering all $ℓ_{i}$ , to optimize the rate of convergence, we need to minimize the maximum $(1 - η λ_{i})^{2}$ term, where the rate of convergence is at the weakest:

η^{*} = \underset{η}{argmin} max_{i} {(1 - η λ_{i})}^{2} = \underset{η}{argmin} max {η λ_{max} - 1, 1 - η λ_{min}},

which is attained at $η λ_{max} - 1 = 1 - η λ_{min}$ , when

η^{*} = \frac{2}{λ_{max} + λ_{min}} < \frac{2}{λ_{max}} .

This results in the weakest rate of convergence

ρ = (1 - λ_{max} η^{*})^{2} = {(\frac{λ_{max} - λ_{min}}{λ_{max} + λ_{min}})}^{2} \leq {(\frac{κ - 1}{κ + 1})}^{2},

where $κ = \frac{λ_{max}}{λ_{min}}$ is the condition number of $Q$ .

With a bad convergence number, the convergence rate will be slow in one direction but fast in another, making the optimization oscillating.

Smoothness

Gradient descent can only work, if gradient does not change too much relative to the step size.

Smooth Functions

$ℓ : R^{d} \mapsto R$ is $L$ -smooth for some $L > 0$ , if

‖ \nabla ℓ (θ) - \nabla ℓ (θ^{'}) ‖ \leq L ‖ θ - θ^{'} ‖, \forall θ, θ^{'} .

Namely, the difference of gradients at two points in the parameter space is bounded by the distance of the two points, or $\nabla ℓ$ is a Lipschitz continuous function.

lipschitz continuous functions

A function $f$ is Lipschitz continuous, only if for every two points $x$ and $x^{'}$ , the slope is always bounded by a constant number:
$| \frac{f (x) - f (x^{'})}{x - x^{'}} | \leq L .$

Implication of Smoothness

If $ℓ$ is twice differentiable, taking Taylor expansion for $ℓ (θ^{'})$ at $θ^{'} = θ$ :

\begin{aligned} ℓ (θ^{'}) - ℓ (θ) & = (θ^{'} - θ)^{T} \nabla ℓ (θ) + \frac{1}{2} (θ^{'} - θ)^{T} H (ℓ (θ)) (θ^{'} - θ) \\ \leq (θ^{'} - θ)^{T} \nabla ℓ (θ) + \frac{L}{2} ‖ θ^{'} - θ ‖^{2} . \end{aligned}

In gradient descent, let's say $θ^{'} = θ - η \nabla ℓ (θ)$ :

ℓ (θ^{'}) - ℓ (θ) \leq - η (1 - \frac{L η}{2}) ‖ \nabla ℓ (θ) ‖^{2} .

By selecting $η = \frac{1}{L}$ :

ℓ (θ^{'}) - ℓ (θ) \leq - \frac{1}{2 L} ‖ \nabla ℓ (θ) ‖^{2} .

One can see that:

With proper choice of step size, the loss function is guaranteed to decrease, since the right hand size is less or equal to zero.
If the function is smooth, i.e. $L$ is small, we can choose a large step size to gain larger progress.
The progress we make towards convergence also depends on the gradient norm.

Gradient Norm

With small gradient norm, the convergence becomes prohibitively slow. It is thus reasonable to find $θ$ where the gradient norm is small enough. Let $ℓ$ be a differentiable at $θ$ , then $θ$ is an $ϵ$ -critical point, if

‖ \nabla ℓ (θ) ‖ \leq ϵ .

theorem

Gradient descent on an $L$ -smooth, differentiable function $ℓ$ finds an $ϵ$ -critical point in at most $k = \frac{2 L (ℓ (θ^{0}) - ℓ^{*})}{ϵ^{2}}$ steps. Namely, smoothness is sufficient to find $ϵ$ -critical points with $O (ϵ^{- 2})$ steps of gradient descent.

proof

Let $C = ℓ (θ^{0}) - ℓ^{*}$ .

- C \leq ℓ (θ^{k}) - ℓ^{*} \leq - \frac{1}{2 L} \sum_{r = 0}^{k - 1} ‖ \nabla ℓ (θ^{r}) ‖^{2} ⟺ \frac{1}{k} \sum_{r = 0}^{k - 1} ‖ \nabla ℓ (θ^{r}) ‖^{2} \leq \frac{2 L C}{k} .

This means in at least one of the iterations we have

‖ \nabla ℓ (ϑ) ‖^{2} \leq \frac{2 L C}{k} .

If we have reached the critical point at $ϑ$ ,

‖ \nabla ℓ (ϑ) ‖^{2} \leq \frac{2 L C}{k} \leq ϵ^{2} .

Therefore

k \geq \frac{2 L C}{ϵ^{2}} .

Strong Convexity and the PL-condition

Polyak-Łojasiewicz Condition

A differentiable function $ℓ$ obeys the Polyak-Łojasiewicz condition (PL condition) with parameter $μ > 0$ if and only if

\frac{1}{2} ‖ \nabla ℓ (θ) ‖^{2} \geq μ (ℓ (θ) - ℓ^{*}) .

theorem

Let $ℓ$ be differentiable, $L$ -smooth and $μ$ -PL. Then gradient descent with step size $η = \frac{1}{L}$ converges at a geometric rate:
$ℓ (θ^{k}) - ℓ^{*} \leq {(1 - \frac{μ}{L})}^{k} (ℓ (θ^{0}) - ℓ^{*}) .$

The PL condition is a fundamental property that directly implies geometric convergence to the minimum.

Strongly Convex Functions

A differentiable function $ℓ$ is $μ$ -strongly convex for some $μ > 0$ , if

ℓ (θ^{'}) \geq ℓ (θ) + (θ^{'} - θ)^{T} \nabla ℓ (θ) + \frac{μ}{2} ‖ θ^{'} - θ ‖^{2}, \forall θ^{'}, θ .

$μ = 0$ reduces to the special case of convex functions.
A positive definite quadratic function is strongly convex with $μ = λ_{min}$ .
For twice differentiable functions, which is $μ$ -strongly convex and $L$ -smooth, we have

0 ≺ μ I ⪯ H (ℓ (θ)) ⪯ L I .

The PL condition is implied by strong convexity:

theorem

Let $ℓ$ be $μ$ -strongly convex, then it fulfills the PL condition with the same $μ$ .

proof

Minimizing both sides of the strong convexity condition:

\begin{aligned} min_{θ^{'}} ℓ (θ^{'}) - ℓ (θ) & = ℓ^{*} - ℓ (θ) \\ \geq min_{θ^{'}} (θ^{'} - θ)^{T} \nabla ℓ (θ) + \frac{μ}{2} ‖ θ^{'} - θ ‖^{2} \\ = - \frac{1}{2 μ} ‖ \nabla ℓ (θ) ‖^{2} . \end{aligned}

Therefore,

\frac{1}{2} ‖ \nabla ℓ (θ) ‖^{2} \geq μ (ℓ (θ) - ℓ^{*}),

which gives the PL condition.

In DNNs, the PL condition will typically not hold globally, but possibly over a domain around a local minimum. It then ensures fast local convergence to this critical point without making claims to its sub-optimality.

Momentum and Acceleration

Saddle Points

The training objective of a DNN is usually non-convex. It thus may contain saddle points which can slow down GD in its neighborhood. Therefore, it is useful to introduce some noise to the GD algorithm, i.e. we can compensate small gradients by smoothness.

Heavy Ball Method

In the heavy ball method, we add a $β$ -weighted term that includes the change made in the previous update:

θ^{k + 1} \leftarrow θ^{k} - η \nabla ℓ (θ^{k}) + β (θ^{k} - θ^{k - 1}), β \in (0, 1) .

With constant gradient $\nabla ℓ$ , one can show that

lim_{k \to \infty} (θ^{k} - θ^{k - 1}) = - η \sum_{i = 1}^{\infty} β^{i} \nabla ℓ = - [\frac{η}{1 - β}] \nabla ℓ .

Therefore, by using large momentum, i.e. $β \to 1$ , one can boost the effective step size by an arbitrary large factor.

Practically, as the gradient is not a constant, a too large $β$ will create oscillations and instabilities. Therefore, $β$ is usually selected in the range $[0.9, 0.95]$ .

Nesterov Acceleration

Nesterov acceleration pursues the same idea as the heavy ball method, but evaluates the gradient at the extrapolated point:

\begin{aligned} ϑ^{k + 1} & = θ^{k} + β (θ^{k} - θ^{k - 1}), \\ θ^{k + 1} & = ϑ^{k + 1} - η \nabla ℓ (ϑ^{k + 1}) . \end{aligned}

Known to be optimal in the convex case and accelerated in the strongly-convex case
The heavy ball method, however, seems to robustly result in benefits for non-convex functions

AdaGrad

AdaGrad uses adaptive learning rate for each single dimension. It uses the history of gradients at previous iterations to influence the effective step size. Defining

γ_{i}^{k} \leftarrow γ_{i}^{k - 1} + {[\partial_{i} ℓ (θ^{k})]}^{2}, \partial_{i} ≜ \frac{\partial ℓ}{\partial θ_{i}},

which is the sum of the squares f the $i$ -th parameter's partial derivatives. We can use these estimates to adapt the step size of each parameter:

θ_{i}^{k + 1} \leftarrow θ_{i}^{k} - η_{i}^{k} \partial_{i} ℓ (θ^{k}), η_{i}^{k} ≜ \frac{η}{\sqrt{γ_{i}^{k}} + δ},

where $δ$ is a small positive constant for numeric stability. Parameters with historically smaller magnitudes of their partial derivatives are updated with an effectively larger step size.

Adam and RMSprop

Adaptive momentum estimation, as known as Adam, is the state-of-the-art learning algorithm. It combines the benefits of momentum and AdaGrad, using an exponentially weighted average to estimate the mean and variance of each partial derivative:

\begin{aligned} g_{i}^{k} & = β g_{i}^{k - 1} + (1 - β) \partial_{i} ℓ (θ^{k}), g_{i}^{0} ≜ \partial_{i} ℓ (θ^{0}), \\ h_{i}^{k} & = α h_{i}^{k - 1} + (1 - α) {[\partial_{i} ℓ (θ)^{k}]}^{2}, h_{i}^{0} ≜ {[\partial_{i} ℓ (θ^{0})]}^{2} . \end{aligned}

The update rule becomes

θ_{i}^{k + 1} = θ_{i}^{k} - η_{i}^{k} g_{i}^{k}, η_{i}^{k} ≜ \frac{η}{\sqrt{h_{i}^{k}} + δ} .

Adam without the use of momentum is called RMSprop.