#Machine-Learning/Deep-Learning/Generative-Model #Computer-Graphics-and-Vision #Machine-Learning/Deep-Learning/Generative-Model/VAE #Note

Variational Autoencoders

Autoencoders

An autoencoder consists of an encoder $f$ and a decoder $g$ .
The encoder maps the original input from the input space $X$ to the latent space $Z$ .
The decoder reconstructs the sample from the latent space $Z$ to the input space $X$ .
Ideally, $f \circ g = Identity ()$ .

what is a good latent space?

A good latent space should represent the data using meaningful degrees of freedom.
It should have continuity for interpolation (e.g. a vector to control the degree of smile).

Formulation

Given two parameterized mappings $f (\cdot; θ_{f})$ and $g (\cdot; θ_{g})$ , training consists of minimizing the reconstruction loss

{\hat{θ}}_{f}, {\hat{θ}}_{g} = \underset{θ_{f}, θ_{g}}{argmin} \sum_{n = 1}^{N} \underset{L (x_{n}, {\hat{x}}_{n})}{\underset{⏟}{‖ x_{n} - \underset{{\hat{x}}_{n}}{\underset{⏟}{g (f (x_{n}))}} ‖_{2}^{2}}} .

In linear cases, PCA is the optimal autoencoder that minimizes the reconstruction loss.

Dimensionality of Hidden Layer

If $dim (z) < dim (x)$ , the hidden layer is called undercomplete.
- The hidden layer compresses the input.
- Will compress well for training samples.
- Hidden units will
  - Provide good features for training samples.
  - Bad for out-of-distribution samples.
If $dim (z) > dim (x)$ , the hidden layer is called overcomplete.
- No compression.
- But should be robust to noise: also called denoising autoencoders.
- Idea for training:
  1. Add some noise to the input $x$ to get corrupted input $\tilde{x}$ .
  2. Reconstruction $\hat{x}$ computed from $\tilde{x}$ .
  3. Loss compares $\hat{x}$ with origin $x$ .

example: inpainting autoencoders

Pasted image 20240503002410.png|500

Limitations of Autoencoders

The decoder is not able to generate good quality samples.
The latent space is not well-structured: there are some places with no training samples.

Variational Autoencoders (VAE)

Instead of predicting a single point in the latent space, we predict a Gaussian distribution

f (x) \sim N (\hat{μ}, \hat{Σ}) .

We can generate new samples from the distribution.
There is no "holes" in the latent space:

Modeling

Assume the training data $D = {x^{(i)}}_{i = 1}^{N}$ is generated from underlying latent distribution $z$ .
- We first sample $z^{(i)}$ from the prior distribution $p_{θ^{*}} (z)$ .
- Then we sample $x^{(i)}$ from the conditional distribution $p_{θ^{*}} (x ∣ z^{(i)})$ .
To best represent prior and conditional:
- We need the prior to be a simple distribution (e.g. Gaussian), so we can sample from it.
- We need the conditional to be complex enough to generate outputs like images: we can represent with DNN.

Training

In training, we would like to learn model parameters that maximize likelihood of the training data

p_{θ} (x) = \int_{z} \underset{DNN}{\underset{⏟}{p_{θ} (x ∣ z)}} \underset{Gaussian}{\underset{⏟}{p_{θ} (z)}} d z .

However, the integral is intractable because we cannot integral for every $z$
The posterior density is also intractable

p_{θ} (z ∣ x) = \frac{p_{θ} (x ∣ z) p_{θ} (z)}{p_{θ} (x)} .

The solution is to define additional encoder network $q_{ϕ} (z ∣ x)$ to approximate $p_{θ} (z ∣ x)$ which is tractable (e.g. $q$ can be a Gaussian).

Data Log-Likelihood

Since $p_{θ} (x^{(i)})$ does not depend on $z$ , we can write the log-likelihood as an expectation

\log p_{θ} (x^{(i)}) = E_{z \sim q_{ϕ} (z ∣ x^{(i)})} [\log p_{θ} (x^{(i)})] .

Using Bayes' rule

\begin{aligned} E_{z \sim q_{ϕ} (z ∣ x^{(i)})} [\log p_{θ} (x^{(i)})] \\ = E_{z} [\log \frac{p_{θ} (x^{(i)} ∣ z) p_{θ} (z)}{p_{θ} (z ∣ x^{(i)})}] \\ = E_{z} [\log \frac{p_{θ} (x^{(i)} ∣ z) p_{θ} (z) q_{ϕ} (z ∣ x^{(i)})}{p_{θ} (z ∣ x^{(i)}) q_{ϕ} (z ∣ x^{(i)})}] \\ = E_{z} [\log p_{θ} (x^{(i)} ∣ z)] + E_{z} [\log \frac{q_{ϕ} (z ∣ x^{(i)})}{p_{θ} (z ∣ x^{(i)})}] - E_{z} [\log \frac{q_{ϕ} (z ∣ x^{(i)})}{p_{θ} (z)}] \\ = E_{z} [\log p_{θ} (x^{(i)} ∣ z)] + \underset{\geq 0}{\underset{⏟}{D_{KL} (q_{ϕ} (z ∣ x^{(i)}) ∣∣ p_{θ} (z ∣ x^{(i)}))}} - D_{KL} (q_{ϕ} (z ∣ x^{(i)}) ∣∣ p_{θ} (z)) \\ \geq \underset{L (x^{(i)}, θ, ϕ)}{\underset{⏟}{E_{z} [\log p_{θ} (x^{(i)} ∣ z)] - D_{KL} (q_{ϕ} (z ∣ x^{(i)}) ∣∣ p_{θ} (z))}}, \end{aligned}

The expectation term is the decoder network.
The KL divergence term is between Gaussians for encoder and $z$ prior, which has closed-form solution.
$L (x^{(i)}, θ, ϕ)$ is called evidence lower bound (ELBO), is the tractable lower bound we can maximize

θ^{*}, ϕ^{*} = \underset{θ, ϕ}{argmax} \sum_{i = 1}^{N} L (x^{(i)}, θ, ϕ) .

Forward Pass

Given the input $x$ , the encoder network $q_{ϕ} (z ∣ x)$ gives us a distribution of $z$ to sample from

z \sim N (μ_{ϕ} (x), σ_{ϕ}^{2} (x) I) .

Given the latent variable $z$ , the decoder network $p_{θ} (x ∣ z)$ gives us a distribution of $x$ to sample from

x \sim N (μ_{θ} (z), I) .

In practice, the decoder only predicts the mean value $\hat{x} = μ_{θ} (z)$ .

Reparameterization Trick

In order to calculate the gradient regardless of the randomness of sampling. When sampling from the posterior distribution $q_{ϕ} (z ∣ x)$ , we first sample a random variable $ϵ \sim N (0, I)$ , then

z = μ_{ϕ} (x) + σ_{ϕ} (x) ⊙ ϵ .

Given standard normal distribution with $μ$ and $σ$

z \sim N (μ, σ) .

Here the randomness is no longer a function of $μ$ or $σ$ , we can take derivative w.r.t. $μ$ and $σ$ .

VAE Loss

How to calculate this reconstrution loss?

The PDF of multivariate Gaussians is
$p (x) = (2 π)^{- k / 2} det (Σ)^{- 1 / 2} \exp (- \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ)) .$
Therefore, the log-likelihood is
$\log p (x) = - \frac{1}{2} \log det (Σ) - \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ) + C .$
In our decoder, the conditional distribution is $N (μ_{(x ∣ z)}, I)$ , where the mean is directly predicted from the model
$\log p (x ∣ z) = - \frac{1}{2} ‖ x - μ_{θ} (z) ‖_{2}^{2} + C = - \frac{1}{2} ‖ x - \hat{x} ‖_{2}^{2} + C .$
Therefore, maximizing the log-likelihood is equivalent to minimizing the $ℓ$ -2 distance between the input and the reconstruction.

The VAE loss is derived from the ELBO

- E_{z} [\log p_{θ} (x^{(i)} ∣ z)] + D_{KL} (q_{ϕ} (z ∣ x^{(i)}) ∣∣ p_{θ} (z)) .

The first term is reconstruction loss:
- Similar samples close to each other.
The second term is latent code loss:
- Ensure compactness and smooth interpolation.

Generating Data

Only use decoder.
Sample $z$ from $N (0, I)$ .
Then sample $x$ from $N (μ_{θ} (z), I)$ .

Learning Useful Representations

Goal: learn features that captures similarities and dissimilarities.
Requirement: objective that defines notion of utility.
For autoencoders, we usually have entangled representation, which are individual dimensions in latent code encode some unknown combination of features in the data.
We would like to learn features that correspond to distinct factors of variation (e.g. digits and style).
Notion of utility: statistical independence can be used.
Learning disentangled representation can be achieved with semi-supervised learning.
- For example, we only learn style variable $z$ with digit label $y$ : $p_{θ} (x ∣ z, y)$ .
- Here the style variable is conditionally independent from digit $y$ .

$β$ -VAE

Goal: learn disentangled representation without supervision.
Approach: modify the loss if the VAE by introducing an adjustable hyperparameter $β$ that balances latent channel capacity and independence constraints with reconstruction accuracy.
Intuition: the KL loss enforces the distribution to be standard Gaussians, which is independent on each dimension; therefore, increasing the impact of KL loss can make the representation more disentangled.

We can also formulate it. Our goal is to maximize the reconstruction likelihood, while keeping the KL divergence lower than a threshold $δ$ :

\begin{aligned} max_{θ, ϕ} & E_{x \in D} [E_{z \sim q_{ϕ} (z ∣ x)} \log p_{θ} (x ∣ z)], \\ subject to & D_{KL} (q_{ϕ} (z ∣ x) ∣∣ p_{θ} (z)) < δ . \end{aligned}

Using Lagrange multiplier, we have the objective $F (θ, ϕ, β)$ to maximize

\begin{aligned} F (θ, ϕ, β) & = E_{z \sim q_{ϕ} (z ∣ x)} [\log p_{θ} (x ∣ z)] - β (D_{KL} (q_{ϕ} (z ∣ x) ∣∣ p_{θ} (z)) - δ) \\ = E_{z \sim q_{ϕ} (z ∣ x)} [\log p_{θ} (x ∣ z)] - β D_{KL} (q_{ϕ} (z ∣ x) ∣∣ p_{θ} (z)) + β δ \\ \geq E_{z \sim q_{ϕ} (z ∣ x)} [\log p_{θ} (x ∣ z)] - β D_{KL} (q_{ϕ} (z ∣ x) ∣∣ p_{θ} (z)) . \end{aligned}

Therefore we have the weighted loss

L_{beta} (θ, ϕ, β) = - E_{z \sim q_{ϕ} (z ∣ x)} [\log p_{θ} (x ∣ z)] + β D_{KL} (q_{ϕ} (z ∣ x) ∣∣ p_{θ} (z)) .

Summary

Intuition of VAE
- Criteria of a good latent space
  - Meaningful degree of freedom (disentangled representations)
  - Continuity for interpolation
- Autoencoders map each input to a point in the latent space, making the latent space full of holes and lack of continuity
- Instead, we map the data to a smooth Gaussian distribution
Theoretical background
- Intractability of VAE
- Optimizing the VAE by maximizing the ELBO: the derivation
Encoder: Given input $x$ , predict the distribution of the latent variable $z \sim N (μ_{ϕ} (x), σ_{ϕ}^{2} (x) I)$
Reparameterization: Sampling $z$ given the predicted distribution can be done by the following
- Sample from the standard gaussian $ϵ \sim N (0, I)$
- Construct $z$ as $z = μ_{ϕ} (x) + σ_{ϕ} (x) ⊙ ϵ$
Decoder: Given input $z$ , predict the distribution of the input $z \sim N (μ_{θ} (x), I)$
Loss: the loss consists of reconstruction loss and KL loss
- Reconstruction loss: MSE loss between the input $x$ and the predicted mean of the decoder $\hat{x} = μ_{θ} (z)$
- KL loss: the KL divergence between $q_{ϕ} (z ∣ x) = N (μ_{ϕ} (x), σ_{ϕ}^{2} (x) I)$ and the prior $p_{θ} (z) = N (0, I)$
Generation:
1. Sample $z$ from $N (0, I)$
2. Use the decoder to sample from $N (μ_{θ} (x), I)$
$β$ -VAE: multiplying the KL loss with a hyperparameter $β > 1$ to make the model more disentangled

Variational Autoencoders

Autoencoders

Formulation

Dimensionality of Hidden Layer

Limitations of Autoencoders

Variational Autoencoders (VAE)

Modeling

Training

Data Log-Likelihood

Forward Pass

Reparameterization Trick

VAE Loss

Generating Data

Learning Useful Representations

β-VAE

Summary

$β$ -VAE