We denote the set of all observed data by , and the set of all latent variables by , by law of total probability, the log-likelihood of the observed data is given by
Note that the likelihood is the logarithm of sum, which is usually intractable.
Suppose for each observation in , we were told the corresponding latent variable . We call the complete dataset, and we assume maximizing the complete-data log likelihood function is straightforward. However, our knowledge of only comes from the posterior distribution . Therefore, we estimate the expectation of the complete-data log-likelihood:
where we estimate the posterior using the current estimate of parameters:
In practice (see the following examples), we usually don't calculate the posterior distribution directly. The complete-data likelihood usually contains the term, which we can substitute by the posterior expectation:
EM for Mixture of Bernoulli Distributions
problem settings
Latent: one-hot vector
Observed:
Parameters:
The complete-data log-likelihood is
The expectation with respect to the posterior distribution is
where
Let the expectation of evaluated with the old parameters be ,
with constraint
Taking partial derivative with respect to , we have
The solution is
Similarly,
The solution is
two-coin problem
Suppose we have two coins A and B. We choose coin A with probability , and B with probability . The probability of getting a head is and for A and B, respectively. Evaluate and given results using EM algorithm.
Initialize with
E step: for each step , calculate
M step: maximize the expected log-likelihood
Summary
Per sample weights are calculated by the expectation of with respect to the posterior distribution
The new estimates of the categorical distribution of the latent variable is the average of across samples:
The new estimates of the Bernoulli distribution of the observed variable is the weighted average of across time steps:
EM for Mixture of Gaussians
problem settings
Latent: one-hot vector
Observed:
Parameters:
Similar to Bernoulli mixtures, the complete-data log-likelihood takes the form
The expected with respect to the posterior distribution is
Plugging in and taking partial derivatives with constraint, we have