Each word in the vocabulary can be represented by an embeddings vector . We treat the embedding of each word as a latent variable that predicts co-occuring context words.
Skip-Gram Model
We only consider the co-occurrence of context words in a set of positional displacements:
Given a sequence of words , its log-likelihood is
The log-likelihood of a pair of words and can be represented by the inner products of the corresponding embedding vectors and :
Therefore,
Refined Model
In a refined model, we introduce explicit biases, and we use different embeddings for the conditioned word and the predicted word:
where the parameters for a word are
By defining the pair occurrence within the displacement window:
the final log-likelihood becomes
One can optimize the function with generic first order methods. However, it is inconvenient to compute the normalization constant.
Negative Sampling
The complexity bottleneck can be overcome by reducing it to a classification problem. For an observed pair of word we consider the binary classification using logistic regression.
where
The total log-likelihood contains the sum of log-likelihood of positive and negative samples. The positive samples can be defined as the set of actually co-occurring pairs:
For negative samples, one can sample randomly from word pairs:
The logistic log-likelihood is given by
In general, the distribution can be chosen as
where is the relative frequency of in the corpus, and typically.
Pointwise Mutual Information
Let be the true distribution of word co-occurrence and be the distribution used for negative sampling. The optimal Bayesian classifier that could be achieved by the embedding model is
Its pre-image under the logistic function is
When and , becomes pointwise mutual information
The word2vec approach with balanced negative sampling and can be interpreted as a method to maximize the pointwise mutual information of word co-occurrences.
Global Word Vectors (GloVe)
The global word vector objective minimizes the following
where
and is the weighting function defined by
where typically.
Unnormalized Models
GloVe objective does not require distribution to be normalized. One can choose:
Therefore, GloVe does not require expensive computation of normalizing factors.
GloVe as Matrix Factorization
Defining and , then
Therefore, we want to find and such that
which can be optimized using SGD, i.e., sampling a pair at random, and performing updates with step size :
Word Analogy Problems
One can use GloVe embeddings to solve word analogy problems. For instance, if king is to man, what is to woman can be solved by