Expectation-Maximization Algorithm

The General EM Algorithm

We denote the set of all observed data by X, and the set of all latent variables by Z, by law of total probability, the log-likelihood of the observed data is given by

logp(X;θ)=logZp(X,Z;θ)dZ

Note that the likelihood is the logarithm of sum, which is usually intractable.

Suppose for each observation in X, we were told the corresponding latent variable Z. We call {X,Z} the complete dataset, and we assume maximizing the complete-data log likelihood function is straightforward. However, our knowledge of Z only comes from the posterior distribution p(ZX;θ). Therefore, we estimate the expectation of the complete-data log-likelihood:

EZX;θ[logp(X,Z;θ)]=Zp(ZX;θ)logp(X,Z;θ)dZ,

where we estimate the posterior using the current estimate of parameters:

p(ZX;θ)p(ZX;θold).

The expected log-likelihood becomes

Q(θ,θold)=Zp(ZX;θold)logp(X,Z;θ)dZ.

We can also derive the above using the ELBO:

logp(X;θ)=logZp(X,Z;θ)dZ=logZp(ZX;θold)p(X,Z;θ)p(ZX;θold)dZZp(ZX;θold)logp(X,Z;θ)dZZp(ZX;θold)logp(ZX;θold)dZ=Q(θ,θold)+H(ZX;θold).

Since the ELBO is equivalent to maximizing Q(θ,θold).

Summary

The complete algorithm is:

  1. Choose an initial setting for the parameters θ{0};
  2. The E step: evaluate the posterior p(ZX,θ{t1});
  3. The M step: evaluate θ{t} given by θ{t}=argmaxθQ(θ,θ{t1}).

In practice (see the following examples), we usually don't calculate the posterior distribution directly. The complete-data likelihood p(X,Z;θ)=i=1Np(xizi)p(zi;θ) usually contains the zi term, which we can substitute by the posterior expectation:
E[zixi;θold].

EM for Mixture of Bernoulli Distributions

problem settings
  • Latent: one-hot vector zCat(π1,,πK)
  • Observed: xzk=1Bern(μk)
  • Parameters: π1,,πK,μ1,,μK

The complete-data log-likelihood is

logp(X,Z;π,μ)=n=1Nlogp(xnzn;π,μ)+logp(zn;π,μ)=n=1Nk=1Kznk[logπk+xnlogμk+(1xn)log(1μk)].

The expectation with respect to the posterior distribution is

EZ[logp(X,Z;π,μ)X;πold,μold]=n=1Nk=1KEZ[znkX;πold,μold][logπk+xnlogμk+(1xn)log(1μk)],

where

EZ[znkX;π,μ]=E[znkxn;π,μ]=Pr[znk=1xn;π,μ]=πkμkxn(1μk)1xnj=1Kπkμjxn(1μj)1xn.

Let the expectation of znk evaluated with the old parameters be γ(znk,πold,μold),

Q=n=1Nk=1Kγ(znk,πold,μold)[logπk+xnlogμk+(1xn)log(1μk)],

with constraint

kπk=1.

Taking partial derivative with respect to πk, we have

Q+λ(1kπk)πk=n=1Nγ(znk,πold,μold)πkλ=0.

The solution is

πk=1Nn=1Nγ(znk,πold,μold).

Similarly,

Qμk=n=1Nγ(znk,πold,μold)xnμkμk(1μk)=0.

The solution is

μk=n=1Nγ(znk,πold,μold)xnn=1Nγ(znk,πold,μold).
two-coin problem

Suppose we have two coins A and B. We choose coin A with probability π, and B with probability 1π. The probability of getting a head is p and q for A and B, respectively. Evaluate p,q and π given results x1,,xn using EM algorithm.

  1. Initialize π,p,q with π0,p0,q0
  2. E step: for each step t, calculate
    γnt=πt1pt1xn(1pt1)1xnπt1pt1xn(1pt1)1xn+(1πt1)qt1xn(1qt1)1xn
  3. M step: maximize the expected log-likelihood
    πt=1Nn=1Nγnt,pt=n=1Nγntxnn=1Nγnt,qt=n=1N(1γnt)xnn=1N1γnt.

Summary

γnk=E[znkxn;θold]. πnew=1Nn=1Nγn. μk=i=1Nγnkxni=1Nγnk.

EM for Mixture of Gaussians

problem settings
  • Latent: one-hot vector zCat(π1,,πK)
  • Observed: xzk=1N(μk,Σk)
  • Parameters: θ={π1,,πK,μ1,,μK,Σ1,,ΣK}

Similar to Bernoulli mixtures, the complete-data log-likelihood takes the form

lnp(X,Z;θ)=n=1Nk=1Kznk[logπk+logN(xn;μk,Σk)].

The expected znk with respect to the posterior distribution is

E[znkxn;θ]=πkN(xn;μk,Σk)j=1KπjN(xn;μj,Σj)=γ(znk,θ).

Plugging in γ(znk,θold) and taking partial derivatives with constraint, we have

πk=1Nn=1Nγ(znk,θold),μk=n=1Nγ(znk,θold)xnn=1Nγ(znk,θold),Σk=n=1Nγ(znk,θold)(xnμk)(xnμk)Tn=1Nγ(znk,θold).