#Algorithm #Machine-Learning/Clustering #Note

Expectation-Maximization Algorithm

The General EM Algorithm

We denote the set of all observed data by $X$ , and the set of all latent variables by $Z$ , by law of total probability, the log-likelihood of the observed data is given by

\log p (X; θ) = \log \int_{Z} p (X, Z; θ) d Z

Note that the likelihood is the logarithm of sum, which is usually intractable.

Suppose for each observation in $X$ , we were told the corresponding latent variable $Z$ . We call ${X, Z}$ the complete dataset, and we assume maximizing the complete-data log likelihood function is straightforward. However, our knowledge of $Z$ only comes from the posterior distribution $p (Z ∣ X; θ)$ . Therefore, we estimate the expectation of the complete-data log-likelihood:

E_{Z ∣ X; θ} [\log p (X, Z; θ)] = \int_{Z} p (Z ∣ X; θ) \log p (X, Z; θ) d Z,

where we estimate the posterior using the current estimate of parameters:

p (Z ∣ X; θ) \approx p (Z ∣ X; θ^{old}) .

The expected log-likelihood becomes

Q (θ, θ^{old}) = \int_{Z} p (Z ∣ X; θ^{old}) \log p (X, Z; θ) d Z .

We can also derive the above using the ELBO:

$\begin{aligned} \log p (X; θ) & = \log \int_{Z} p (X, Z; θ) d Z \\ = \log \int_{Z} \frac{p (Z ∣ X; θ^{old}) p (X, Z; θ)}{p (Z ∣ X; θ^{old})} d Z \\ \geq \int_{Z} p (Z ∣ X; θ^{old}) \log p (X, Z; θ) d Z - \int_{Z} p (Z ∣ X; θ^{old}) \log p (Z ∣ X; θ^{old}) d Z \\ = Q (θ, θ^{old}) + H (Z ∣ X; θ^{old}) . \end{aligned}$

Since the ELBO is equivalent to maximizing $Q (θ, θ^{old})$ .

Summary

The complete algorithm is:

Choose an initial setting for the parameters $θ^{{0}}$ ;
The E step: evaluate the posterior $p (Z ∣ X, θ^{{t - 1}})$ ;
The M step: evaluate $θ^{{t}}$ given by $θ^{{t}} = \underset{θ}{argmax} Q (θ, θ^{{t - 1}}) .$

In practice (see the following examples), we usually don't calculate the posterior distribution directly. The complete-data likelihood $p (X, Z; θ) = \prod_{i = 1}^{N} p (x_{i} ∣ z_{i}) p (z_{i}; θ)$ usually contains the $z_{i}$ term, which we can substitute by the posterior expectation:
$E [z_{i} ∣ x_{i}; θ^{old}] .$

EM for Mixture of Bernoulli Distributions

problem settings

Latent: one-hot vector $z \sim Cat (π_{1}, \dots, π_{K})$
Observed: $x ∣ z_{k} = 1 \sim Bern (μ_{k})$
Parameters: $π_{1}, \dots, π_{K}, μ_{1}, \dots, μ_{K}$

The complete-data log-likelihood is

\begin{aligned} \log p (X, Z; π, μ) & = \sum_{n = 1}^{N} \log p (x_{n} ∣ z_{n}; π, μ) + \log p (z_{n}; π, μ) \\ = \sum_{n = 1}^{N} \sum_{k = 1}^{K} z_{n k} [\log π_{k} + x_{n} \log μ_{k} + (1 - x_{n}) \log (1 - μ_{k})] . \end{aligned}

The expectation with respect to the posterior distribution is

\begin{aligned} E_{Z} [\log p (X, Z; π, μ) ∣ X; π^{old}, μ^{old}] \\ = \sum_{n = 1}^{N} \sum_{k = 1}^{K} E_{Z} [z_{n k} ∣ X; π^{old}, μ^{old}] [\log π_{k} + x_{n} \log μ_{k} + (1 - x_{n}) \log (1 - μ_{k})], \end{aligned}

where

\begin{aligned} E_{Z} [z_{n k} ∣ X; π, μ] & = E [z_{n k} ∣ x_{n}; π, μ] \\ = Pr [z_{n k} = 1 ∣ x_{n}; π, μ] \\ = \frac{π_{k} μ_{k}^{x_{n}} (1 - μ_{k})^{1 - x_{n}}}{\sum_{j = 1}^{K} π_{k} μ_{j}^{x_{n}} (1 - μ_{j})^{1 - x_{n}}} . \end{aligned}

Let the expectation of $z_{n k}$ evaluated with the old parameters be $γ (z_{n k}, π^{old}, μ^{old})$ ,

Q = \sum_{n = 1}^{N} \sum_{k = 1}^{K} γ (z_{n k}, π^{old}, μ^{old}) [\log π_{k} + x_{n} \log μ_{k} + (1 - x_{n}) \log (1 - μ_{k})],

with constraint

\sum_{k} π_{k} = 1.

Taking partial derivative with respect to $π_{k}$ , we have

\frac{\partial Q + λ (1 - \sum_{k} π_{k})}{\partial π_{k}} = \sum_{n = 1}^{N} \frac{γ (z_{n k}, π^{old}, μ^{old})}{π_{k}} - λ = 0.

The solution is

π_{k} = \frac{1}{N} \sum_{n = 1}^{N} γ (z_{n k}, π^{old}, μ^{old}) .

Similarly,

\frac{\partial Q}{\partial μ_{k}} = \sum_{n = 1}^{N} γ (z_{n k}, π^{old}, μ^{old}) \frac{x_{n} - μ_{k}}{μ_{k} (1 - μ_{k})} = 0.

The solution is

μ_{k} = \frac{\sum_{n = 1}^{N} γ (z_{n k}, π^{old}, μ^{old}) x_{n}}{\sum_{n = 1}^{N} γ (z_{n k}, π^{old}, μ^{old})} .

two-coin problem

Suppose we have two coins A and B. We choose coin A with probability $π$ , and B with probability $1 - π$ . The probability of getting a head is $p$ and $q$ for A and B, respectively. Evaluate $p, q$ and $π$ given results $x_{1}, \dots, x_{n}$ using EM algorithm.

Initialize $π, p, q$ with $π_{0}, p_{0}, q_{0}$
E step: for each step $t$ , calculate
$γ_{n t} = \frac{π_{t - 1} p_{t - 1}^{x_{n}} (1 - p_{t - 1})^{1 - x_{n}}}{π_{t - 1} p_{t - 1}^{x_{n}} (1 - p_{t - 1})^{1 - x_{n}} + (1 - π_{t - 1}) q_{t - 1}^{x_{n}} (1 - q_{t - 1})^{1 - x_{n}}}$
M step: maximize the expected log-likelihood
$\begin{aligned} π_{t} & = \frac{1}{N} \sum_{n = 1}^{N} γ_{n t}, \\ p_{t} & = \frac{\sum_{n = 1}^{N} γ_{n t} x_{n}}{\sum_{n = 1}^{N} γ_{n t}}, \\ q_{t} & = \frac{\sum_{n = 1}^{N} (1 - γ_{n t}) x_{n}}{\sum_{n = 1}^{N} 1 - γ_{n t}} . \end{aligned}$

Summary

Per sample weights $γ_{n k}$ are calculated by the expectation of $z_{n k}$ with respect to the posterior distribution

γ_{n k} = E [z_{n k} ∣ x_{n}; θ^{old}] .

The new estimates of the categorical distribution of the latent variable $π^{new}$ is the average of $γ_{n}$ across samples:

π^{new} = \frac{1}{N} \sum_{n = 1}^{N} γ_{n} .

The new estimates of the Bernoulli distribution of the observed variable $μ_{k}^{new} = p (x = 1 ∣ z = k)$ is the weighted average of $x_{n}$ across time steps:

μ_{k} = \frac{\sum_{i = 1}^{N} γ_{n k} x_{n}}{\sum_{i = 1}^{N} γ_{n k}} .

EM for Mixture of Gaussians

problem settings

Latent: one-hot vector $z \sim Cat (π_{1}, \dots, π_{K})$
Observed: $x ∣ z_{k} = 1 \sim N (μ_{k}, Σ_{k})$
Parameters: $θ = {π_{1}, \dots, π_{K}, μ_{1}, \dots, μ_{K}, Σ_{1}, \dots, Σ_{K}}$

Similar to Bernoulli mixtures, the complete-data log-likelihood takes the form

\ln p (X, Z; θ) = \sum_{n = 1}^{N} \sum_{k = 1}^{K} z_{n k} [\log π_{k} + \log N (x_{n}; μ_{k}, Σ_{k})] .

The expected $z_{n k}$ with respect to the posterior distribution is

E [z_{n k} ∣ x_{n}; θ] = \frac{π_{k} N (x_{n}; μ_{k}, Σ_{k})}{\sum_{j = 1}^{K} π_{j} N (x_{n}; μ_{j}, Σ_{j})} = γ (z_{n k}, θ) .

Plugging in $γ (z_{n k}, θ^{old})$ and taking partial derivatives with constraint, we have

\begin{aligned} π_{k} & = \frac{1}{N} \sum_{n = 1}^{N} γ (z_{n k}, θ^{old}), \\ μ_{k} & = \frac{\sum_{n = 1}^{N} γ (z_{n k}, θ^{old}) x_{n}}{\sum_{n = 1}^{N} γ (z_{n k}, θ^{old})}, \\ Σ_{k} & = \frac{\sum_{n = 1}^{N} γ (z_{n k}, θ^{old}) (x_{n} - μ_{k}) (x_{n} - μ_{k})^{T}}{\sum_{n = 1}^{N} γ (z_{n k}, θ^{old})} . \end{aligned}