Topic Models

Problem Settings

Documents: $i = 1, \dots, n$ .
Length of Document $i$ : $s_{i}$ .
Word positions: $t = 1, \dots, s_{i}$ .
Words: represented by a random variable $X_{i t}$ .
Vocabulary: the set of all words $Σ$ , with $| Σ | = m$ .
Topic: describes the aboutness of a document. We assume the topic depends only on the word it uses, invariance to observation order.
Word occurrence of Word $j$ in Document $i$ : $N_{i j} = | {x_{i t} = w_{j} ∣ t = 1, \dots, s_{i}} |$
Occurrence matrix: $N \in Z_{\geq 0}^{n \times m}$

Probabilistic Latent Semantic Analysis (pLSA)

Topics are represented by a latent variable $Z \in {1, \dots, k}$ , with a distribution conditioned on documents $p (z ∣ d)$ .
Words are sampled from the distribution conditioned on topics $p (w ∣ z)$ .
Conversion between occurrence distribution and word distribution: $$ p(x_{it} \mid z) = \sum_{j=1}^m p(w_{j} \mid z) \mathbb{I}[x_{it} = w_{j}] .$$

The log-likelihood of the occurrence matrix with respect to the unknown parameters is

\begin{aligned} \log p (N; θ) & = \sum_{i, j} N_{i j} \log p (w_{j} ∣ d_{i}) \\ = \sum_{i, j} N_{i j} \log \sum_{z = 1}^{k} p (w_{j} ∣ z) p (z ∣ d_{i}) \\ = \sum_{i = 1}^{n} \sum_{t = 1}^{s_{i}} \log \sum_{z = 1}^{k} p (x_{i t} ∣ z) p (z ∣ d_{i}) . \end{aligned}

EM for Topic Models

We can apply the standard EM algorithm to solve for $p (w_{j} ∣ z)$ and $p (z ∣ d_{i})$ . Let $q_{i t z}$ be the per-word posterior probability estimated with the old parameters:

\begin{aligned} q_{i t z} & := p (z ∣ x_{i t}, d_{i}) \\ = \frac{p (x_{i t} ∣ z) p (z ∣ d_{i})}{\sum_{ζ = 1}^{k} p (x_{i t} ∣ ζ) p (ζ ∣ d_{i})} \\ = \frac{\sum_{j = 1}^{m} I [x_{i t} = w_{j}] p (w_{j} ∣ z) p (z ∣ d_{i})}{\sum_{ζ = 1}^{k} \sum_{j = 1}^{m} I [x_{i t} = w_{j}] p (w_{j} ∣ ζ) p (ζ ∣ d_{i})} . \end{aligned}

We can easily get the ELBO:

\begin{aligned} \log p (N; θ) & \geq \sum_{i = 1}^{N} \sum_{t = 1}^{s_{i}} \sum_{z = 1}^{k} q_{i t z} [\log p (x_{i t} ∣ z) + \log p (z ∣ d_{i}) - \log q_{i t z}] \\ = \sum_{i = 1}^{N} \sum_{t = 1}^{s_{i}} \sum_{z = 1}^{k} q_{i t z} [\log \sum_{j = 1}^{m} I [x_{i t} = w_{j}] p (w_{j} ∣ z) + \log p (z ∣ d_{i}) - \log q_{i t z}] \\ \geq \sum_{i = 1}^{N} \sum_{t = 1}^{s_{i}} \sum_{z = 1}^{j} q_{i t z} [\sum_{j = 1}^{m} I [x_{i t} = w_{j}] \log p (w_{j} ∣ z) + \log p (z ∣ d_{i}) - \log q_{i t z}] . \end{aligned}

By maximizing the ELBO with constraints

\begin{aligned} \sum_{z = 1}^{k} p (z ∣ d_{i}) & = 1, \\ \sum_{j = 1}^{m} p (w_{j} ∣ z) & = 1, \end{aligned}

we get

\begin{aligned} p (z ∣ d_{i}) & = \frac{1}{s_{i}} \sum_{t = 1}^{s_{i}} q_{i t z}, \\ p (w_{j} ∣ z) & = \frac{\sum_{i t} q_{i t z} I [x_{i t} = w_{j}]}{\sum_{i t} q_{i t z}} . \end{aligned}

Latent Dirichlet Allocation (LDA)

The parameters can be written as

u_{j z} := p (w_{j} ∣ z), v_{z i} := p (z ∣ d_{i}) .

We can further define

U = [u_{j z}] \in [0, 1]^{m \times k}, V = [v_{z i}] \in {[0, 1]}^{k \times n} .

For a new, unseen document, we don't know the exact $v_{z}$ . We can therefore define a distribution over $v$ parameterized by $α$ :

p (v; α) .

We wish the prior distribution $p (v)$ and the posterior $p (v ∣ w_{j})$ belongs to the same distribution family. By conjugate prior property, we can define the prior distribution with Dirichlet distribution:

p (v; α) \propto \prod_{z = 1}^{k} v_{z}^{α_{z} - 1},

where $α_{z} > 0$ is typically set to the same $α$ .

Latent Dirichlet allocation augments topic models with a Dirichlet prior. For a fixed-length document with length $s$ , its likelihood with parameter $U$ is

p (x_{1}, \dots, x_{s}; U) = \int_{v} \prod_{t = 1}^{s} p (x_{t}; U, v) p (v; α) d v;

where

p (x_{t}; U, v) = \sum_{j = 1}^{m} I [x_{t} = w_{j}] \sum_{z = 1}^{k} u_{j z} v_{z} .

Probabilistic Matrix Decomposition

The log-likelihood in #Probabilistic Latent Semantic Analysis (pLSA) can be written as

\log p (N; U, V) = \sum_{i, j} N_{i j} \log {\hat{N}}_{i j}, \hat{N} = U V .

Therefore, we can see the topic model as a non-negative matrix decomposition with a principled log-likelihood objective, which satisfies the following constraints:

{\hat{N}}_{i j} \geq 0, \sum_{j} u_{i z} = 1, \sum_{z} v_{z i} = 1.