Topic Models
Problem Settings
- Documents:
. - Length of Document
: . - Word positions:
. - Words: represented by a random variable
. - Vocabulary: the set of all words
, with . - Topic: describes the aboutness of a document. We assume the topic depends only on the word it uses, invariance to observation order.
- Word occurrence of Word
in Document : - Occurrence matrix:
Probabilistic Latent Semantic Analysis (pLSA)
- Topics are represented by a latent variable
, with a distribution conditioned on documents . - Words are sampled from the distribution conditioned on topics
. - Conversion between occurrence distribution and word distribution: $$ p(x_{it} \mid z) = \sum_{j=1}^m p(w_{j} \mid z) \mathbb{I}[x_{it} = w_{j}] .$$
The log-likelihood of the occurrence matrix with respect to the unknown parameters is
EM for Topic Models
We can apply the standard EM algorithm to solve for
We can easily get the ELBO:
By maximizing the ELBO with constraints
we get
Latent Dirichlet Allocation (LDA)
The parameters can be written as
We can further define
For a new, unseen document, we don't know the exact
We wish the prior distribution
where
Latent Dirichlet allocation augments topic models with a Dirichlet prior. For a fixed-length document with length
where
Probabilistic Matrix Decomposition
The log-likelihood in #Probabilistic Latent Semantic Analysis (pLSA) can be written as
Therefore, we can see the topic model as a non-negative matrix decomposition with a principled log-likelihood objective, which satisfies the following constraints: