Variational Autoencoders

Autoencoders

what is a good latent space?
  • A good latent space should represent the data using meaningful degrees of freedom.
  • It should have continuity for interpolation (e.g. a vector to control the degree of smile).

Formulation

Pasted image 20240503001351.png|500

θ^f,θ^g=argminθf,θgn=1Nxng(f(xn))x^n22L(xn,x^n).

Dimensionality of Hidden Layer

example: inpainting autoencoders

Pasted image 20240503002410.png|500

Limitations of Autoencoders

Variational Autoencoders (VAE)

f(x)N(μ^,Σ^).

Modeling

Training

pθ(x)=zpθ(xz)DNNpθ(z)Gaussiandz. pθ(zx)=pθ(xz)pθ(z)pθ(x).

Data Log-Likelihood

Since pθ(x(i)) does not depend on z, we can write the log-likelihood as an expectation

logpθ(x(i))=Ezqϕ(zx(i))[logpθ(x(i))].

Using Bayes' rule

Ezqϕ(zx(i))[logpθ(x(i))]=Ez[logpθ(x(i)z)pθ(z)pθ(zx(i))]=Ez[logpθ(x(i)z)pθ(z)qϕ(zx(i))pθ(zx(i))qϕ(zx(i))]=Ez[logpθ(x(i)z)]+Ez[logqϕ(zx(i))pθ(zx(i))]Ez[logqϕ(zx(i))pθ(z)]=Ez[logpθ(x(i)z)]+DKL(qϕ(zx(i))∣∣pθ(zx(i)))0DKL(qϕ(zx(i))∣∣pθ(z))Ez[logpθ(x(i)z)]DKL(qϕ(zx(i))∣∣pθ(z))L(x(i),θ,ϕ), θ,ϕ=argmaxθ,ϕi=1NL(x(i),θ,ϕ).

Forward Pass

zN(μϕ(x),σϕ2(x)I). xN(μθ(z),I).

Reparameterization Trick

In order to calculate the gradient regardless of the randomness of sampling. When sampling from the posterior distribution qϕ(zx), we first sample a random variable ϵN(0,I), then

z=μϕ(x)+σϕ(x)ϵ.

Given standard normal distribution with μ and σ

zN(μ,σ).

Here the randomness is no longer a function of μ or σ, we can take derivative w.r.t. μ and σ.

VAE Loss

How to calculate this reconstrution loss?

The PDF of multivariate Gaussians is
p(x)=(2π)k/2det(Σ)1/2exp(12(xμ)TΣ1(xμ)).
Therefore, the log-likelihood is
logp(x)=12logdet(Σ)12(xμ)TΣ1(xμ)+C.
In our decoder, the conditional distribution is N(μ(xz),I), where the mean is directly predicted from the model
logp(xz)=12xμθ(z)22+C=12xx^22+C.
Therefore, maximizing the log-likelihood is equivalent to minimizing the -2 distance between the input and the reconstruction.

The VAE loss is derived from the ELBO

Ez[logpθ(x(i)z)]+DKL(qϕ(zx(i))∣∣pθ(z)).

Generating Data

Learning Useful Representations

β-VAE

We can also formulate it. Our goal is to maximize the reconstruction likelihood, while keeping the KL divergence lower than a threshold δ:

maxθ,ϕExD[Ezqϕ(zx)logpθ(xz)],subject toDKL(qϕ(zx)∣∣pθ(z))<δ.

Using Lagrange multiplier, we have the objective F(θ,ϕ,β) to maximize

F(θ,ϕ,β)=Ezqϕ(zx)[logpθ(xz)]β(DKL(qϕ(zx)∣∣pθ(z))δ)=Ezqϕ(zx)[logpθ(xz)]βDKL(qϕ(zx)∣∣pθ(z))+βδEzqϕ(zx)[logpθ(xz)]βDKL(qϕ(zx)∣∣pθ(z)).

Therefore we have the weighted loss

Lbeta(θ,ϕ,β)=Ezqϕ(zx)[logpθ(xz)]+βDKL(qϕ(zx)∣∣pθ(z)).

Summary