#Note #Maths/Linear-Algebra #Machine-Learning/Matrix-Approximation

Singular Value Decomposition

SVD Theorem

For each matrix $A \in R^{n \times m}$ , there exists a padded diagonal matrix $Σ \in R^{n \times m}$ and orthogonal matrices $U \in R^{n \times n}$ and $V \in R^{m \times m}$ such that $A$ can be expressed as

A = U Σ V^{T},

where

Σ = diag (σ_{1}, \dots, σ_{min {n, m}}) .

One can often prune columns of $U$ or $V$ corresponding to zero elements in $Σ$ . For example, if $n > m$ , then we can have
$A = U Σ V^{T} = U_{:, 1 : m} Σ_{1 : m} V^{T},$
where $U_{:, 1 : m}$ is the first $m$ columns of $U$ , and $Σ_{1 : m}$ is the first $m$ rows (non-zero rows) of $Σ$ .

SVD and Eigendecomposition

SVD of $A$ is intimately related to the eigendecomposition of the product of $A$ with its transpose:

\begin{array}{r} A A^{T} = U Σ V^{T} V Σ^{T} U^{T} = U diag (σ_{1}^{2}, \dots, σ_{n}^{2}) U^{T}, \\ A^{T} A = V Σ^{T} U^{T} U Σ V^{T} = V diag (σ_{1}^{2}, \dots, σ_{m}^{2}) V^{T}, \end{array}

where $σ_{r} = 0$ if $r > min {n, m}$ .

One can also note that in PCA, we perform eigendecomposition on the covariance matrix $X X^{T}$ . Therefore, we can also apply SVD to the data matrix $X$ to identify the principle eigenvectors.

SVD and Frobenius Norm

The squared Frobenius norm of a matrix $A \in R^{n \times m}$ is the sum of its squared singular values:

‖ A ‖_{F}^{2} = \sum_{i = 1}^{min {n, m}} σ_{i}^{2} .

proof

Using the properties of the trace
$\begin{aligned} ‖ A ‖_{F}^{2} & = tr (A^{T} A) \\ = tr (V diag (σ_{1}^{2}, \dots, σ_{m}^{2}) V^{T}) \\ = tr (diag (σ_{1}^{2}, . . ., σ_{m}^{2}) \underset{I}{\underset{⏟}{V^{T} V}}) \\ = tr (diag (σ_{1}^{2}, \dots, σ_{m}^{2})) \\ = \sum_{i = 1}^{min {n, m}} σ_{i}^{2} . \end{aligned}$

SVD and Spectral Norm

The spectral norm of a matrix $A$ equals the largest singular value $σ_{1}$ :

‖ A ‖_{2} ≜ sup {‖ A x ‖ ∣ ‖ x ‖ = 1} = σ_{1} .

proof

$\begin{aligned} sup {‖ A x ‖ ∣ ‖ x ‖ = 1} & = sup {\sqrt{x^{T} A^{T} A x} ∣ ‖ x ‖ = 1} \\ = sup {\sqrt{x^{T} V Σ^{T} Σ V x} ∣ ‖ x ‖ = 1} \\ = sup {‖ Σ V x ‖ ∣ ‖ x ‖ = 1} \\ = sup {‖ Σ z ‖ ∣ ‖ z ‖ = 1} \\ = ‖ Σ ‖_{2} = σ_{1} . \end{aligned}$

SVD for Low-Rank Approximation

eckart-young theorem

Given $A \in R^{n \times m}$ with SVD $A = U Σ V^{T}$ . Then for all $1 \leq k \leq min {n, m}$ :
$\begin{aligned} A_{k} & ≜ \underset{B}{argmin} {‖ A - B ‖_{F} ∣ rank (B) < k} \\ = U \underset{Σ_{k}}{\underset{⏟}{diag (σ_{1}, \dots, σ_{k})}} V^{T}, \end{aligned}$
where $Σ_{k}$ is rectangular and padded accroding to the dimensions of $U$ and $V$ .

The theorem also gives the following corollary:

corollary

The squared error of low rank approximations can be expressed as
$‖ A - A_{k} ‖_{F}^{2} = \sum_{i = k + 1}^{rank (A)} σ_{i}^{2} .$

In addition, $A_{k}$ also minimizes the loss defined by spectral norm:

A_{k} = \underset{B}{argmin} {‖ A - B ‖_{2} ∣ rank (B) \leq k} .

SVD can be seen as an additive superposition of $k$ rank 1 matrices:

A \approx \sum_{i = 1}^{k} σ_{i} u_{i} v_{i}^{T} .

SVD for Matrix Completion

For fully observed matrices, SVD is computable with $O (min {n m^{2}, m n^{2}})$ complexity, which is an important example of a solvable non-convex problem. However, in the case of incomplete observations, SVD is in general not application directly to compute low-rank approximations. Low-rank matrix approximation with a weighted Frobenius norm is defined as follows:

{\hat{A}}_{k} = \underset{B}{argmin} \sum_{i, j} w_{i j} (a_{i j} - b_{i j})^{2}, s.t. rank (B) = k,

where the weights $w_{i j} = ω_{i j} \in {0, 1}$ in the collaborative filtering settings, or they can be some positive values. In both case, the problem has been shown to be NP-hard, even for $k = 1$ .