#Note #Machine-Learning/Natural-Language-Processing

Word Embeddings

Word2Vec

Each word $w$ in the vocabulary $Σ$ can be represented by an embeddings vector $z_{w} \in R^{m}$ . We treat the embedding of each word as a latent variable that predicts co-occuring context words.

Skip-Gram Model

We only consider the co-occurrence of context words in a set of positional displacements:

I = {- R, \dots, - 1, 1, \dots, R} .

Given a sequence of words $x = [x_{1}, \dots, x_{T}]$ , its log-likelihood is

\log p (x; θ) = \sum_{t = 1}^{T} \sum_{Δ \in I} \log p (x_{t + Δ} ∣ x_{t}; θ) .

The log-likelihood of a pair of words $v$ and $w$ can be represented by the inner products of the corresponding embedding vectors $z_{v}$ and $z_{w}$ :

\log p (v ∣ w) = ⟨ z_{w}, z_{v} ⟩ + const .

Therefore,

p (v ∣ w) = \frac{\exp [⟨ z_{w}, z_{v} ⟩]}{\sum_{u} \exp [⟨ z_{w}, z_{u} ⟩]} .

Refined Model

In a refined model, we introduce explicit biases, and we use different embeddings for the conditioned word and the predicted word:

p (v ∣ w; θ) = \frac{\exp [ζ_{w}^{T} z_{v} + b_{v}]}{\sum_{u} \exp [ζ_{w}^{T} z_{u} + b_{u}]},

where the parameters for a word are

w \mapsto θ_{w} = (z_{w}, ζ_{w}, b_{w}) \in R^{2 m + 1} .

By defining the pair occurrence within the displacement window:

N_{v w} := | {t : x_{t} = w, x_{t + Δ} = v, Δ \in I} |,

the final log-likelihood becomes

\log p (x; θ) = \sum_{v, w} N_{v w} (⟨ ζ_{w}, z_{v} ⟩ + b_{v} - \log \underset{normalization constant}{\underset{⏟}{\sum_{u \in Σ} \exp [ζ_{w}^{T} z_{u} + b_{u}]}}) .

One can optimize the function with generic first order methods. However, it is inconvenient to compute the normalization constant.

Negative Sampling

The complexity bottleneck can be overcome by reducing it to a classification problem. For an observed pair of word $(v, w)$ we consider the binary classification using logistic regression.

p (v ∣ w) = σ (⟨ ζ_{w}, z_{v} ⟩ + b_{v}),

where

σ (z) = \frac{1}{1 + \exp (- z)} .

The total log-likelihood contains the sum of log-likelihood of positive and negative samples. The positive samples can be defined as the set of actually co-occurring pairs:

S^{+} = {(x_{t}, x_{t + Δ}) ∣ t = 1, \dots, T, Δ \in I} .

For negative samples, one can sample randomly from word pairs:

S^{-} = {(x_{t}, v_{t j}) ∣ t = 1, \dots, T, v_{t j} \overset{iid}{\sim} q, j = 1, \dots, r} .

The logistic log-likelihood is given by

p (x; θ) = \sum_{(w, v) \in S^{+}} \log σ (v, w; θ) + \sum_{(w, u) \in S^{-}} \log (1 - σ (u, w; θ)) .

In general, the distribution $q$ can be chosen as

q (w) \propto p (w)^{α},

where $p (w)$ is the relative frequency of $w$ in the corpus, and $α = \frac{3}{4}$ typically.

Pointwise Mutual Information

Let $p (v, w)$ be the true distribution of word co-occurrence and $q (v, w) = p (w) q (v)$ be the distribution used for negative sampling. The optimal Bayesian classifier that could be achieved by the embedding model is

Pr [(v, w) = true] = \frac{π p (v, w)}{π p (v, w) + (1 - π) q (v, w)} .

Its pre-image under the logistic function is

h_{v w}^{*} = σ^{- 1} (\frac{π p (v, w)}{π p (v, w) + (1 - π) q (v, w)}) = \log \frac{p (v, w)}{q (v, w)} + \log \frac{π}{1 - π} .

When $π = \frac{1}{2}$ and $α = 1$ , $h_{v w}^{*}$ becomes pointwise mutual information

h_{v w}^{*} = \log \frac{p (v, w)}{p (v) p (w)} .

The word2vec approach with balanced negative sampling and $α = 1$ can be interpreted as a method to maximize the pointwise mutual information of word co-occurrences.

Global Word Vectors (GloVe)

The global word vector objective minimizes the following

\sum_{v, w : N_{v w} > 0} f (N_{v w}) {[\log N_{v w} - \log {\hat{N}}_{v w}]}^{2},

where

{\hat{N}}_{v w} := \tilde{p} (v, w; θ),

and $f (N)$ is the weighting function defined by

f (N) = min {1, {(\frac{N}{N_{max}})}^{α}},

where $α = \frac{3}{4}$ typically.

Unnormalized Models

GloVe objective does not require distribution to be normalized. One can choose:

\log \hat{p} (v, w) = ⟨ ζ_{w}, z_{v} ⟩ .

Therefore, GloVe does not require expensive computation of normalizing factors.

GloVe as Matrix Factorization

Defining $U^{T} := [ζ_{w_{1}}, \dots, ζ_{w_{n}}]$ and $V := [z_{w_{1}}, \dots, z_{w_{m}}]$ , then

\log \hat{N} = U V .

Therefore, we want to find $U$ and $V$ such that

\underset{U, V}{argmin} \sum_{i, j : N_{i j} > 0} f (N_{i j}) {(\ln N_{i j} - {(U V)}_{i j})}^{2},

which can be optimized using SGD, i.e., sampling a pair $(v, w)$ at random, and performing updates with step size $η > 0$ :

\begin{aligned} ζ_{w} & \leftarrow ζ_{w} + 2 η f (N_{v w}) (\ln N_{v w} - ⟨ ζ_{w}, z_{v} ⟩) z_{v}, \\ z_{v} & \leftarrow z_{v} + 2 η f (N_{v w}) (\ln N_{v w} - ⟨ ζ_{w}, z_{v} ⟩) ζ_{w} . \end{aligned}

Word Analogy Problems

One can use GloVe embeddings to solve word analogy problems. For instance, if king is to man, what $w$ is to woman can be solved by

w = \underset{v}{argmax} ⟨ ζ_{king} - ζ_{man} + ζ_{woman}, ζ_{v} ⟩ .