Word Embeddings

Word2Vec

Each word w in the vocabulary Σ can be represented by an embeddings vector zwRm. We treat the embedding of each word as a latent variable that predicts co-occuring context words.

Skip-Gram Model

We only consider the co-occurrence of context words in a set of positional displacements:

I={R,,1,1,,R}.

Given a sequence of words x=[x1,,xT], its log-likelihood is

logp(x;θ)=t=1TΔIlogp(xt+Δxt;θ).

The log-likelihood of a pair of words v and w can be represented by the inner products of the corresponding embedding vectors zv and zw:

logp(vw)=zw,zv+const.

Therefore,

p(vw)=exp[zw,zv]uexp[zw,zu].

Refined Model

In a refined model, we introduce explicit biases, and we use different embeddings for the conditioned word and the predicted word:

p(vw;θ)=exp[ζwTzv+bv]uexp[ζwTzu+bu],

where the parameters for a word are

wθw=(zw,ζw,bw)R2m+1.

By defining the pair occurrence within the displacement window:

Nvw:=|{t:xt=w,xt+Δ=v,ΔI}|,

the final log-likelihood becomes

logp(x;θ)=v,wNvw(ζw,zv+bvloguΣexp[ζwTzu+bu]normalization constant).

One can optimize the function with generic first order methods. However, it is inconvenient to compute the normalization constant.

Negative Sampling

The complexity bottleneck can be overcome by reducing it to a classification problem. For an observed pair of word (v,w) we consider the binary classification using logistic regression.

p(vw)=σ(ζw,zv+bv),

where

σ(z)=11+exp(z).

The total log-likelihood contains the sum of log-likelihood of positive and negative samples. The positive samples can be defined as the set of actually co-occurring pairs:

S+={(xt,xt+Δ)t=1,,T,ΔI}.

For negative samples, one can sample randomly from word pairs:

S={(xt,vtj)t=1,,T,vtjiidq,j=1,,r}.

The logistic log-likelihood is given by

p(x;θ)=(w,v)S+logσ(v,w;θ)+(w,u)Slog(1σ(u,w;θ)).

In general, the distribution q can be chosen as

q(w)p(w)α,

where p(w) is the relative frequency of w in the corpus, and α=34 typically.

Pointwise Mutual Information

Let p(v,w) be the true distribution of word co-occurrence and q(v,w)=p(w)q(v) be the distribution used for negative sampling. The optimal Bayesian classifier that could be achieved by the embedding model is

Pr[(v,w)=true]=πp(v,w)πp(v,w)+(1π)q(v,w).

Its pre-image under the logistic function is

hvw=σ1(πp(v,w)πp(v,w)+(1π)q(v,w))=logp(v,w)q(v,w)+logπ1π.

When π=12 and α=1, hvw becomes pointwise mutual information

hvw=logp(v,w)p(v)p(w).

The word2vec approach with balanced negative sampling and α=1 can be interpreted as a method to maximize the pointwise mutual information of word co-occurrences.

Global Word Vectors (GloVe)

The global word vector objective minimizes the following

v,w:Nvw>0f(Nvw)[logNvwlogN^vw]2,

where

N^vw:=p~(v,w;θ),

and f(N) is the weighting function defined by

f(N)=min{1,(NNmax)α},

where α=34 typically.

Unnormalized Models

GloVe objective does not require distribution to be normalized. One can choose:

logp^(v,w)=ζw,zv.

Therefore, GloVe does not require expensive computation of normalizing factors.

GloVe as Matrix Factorization

Defining UT:=[ζw1,,ζwn] and V:=[zw1,,zwm], then

logN^=UV.

Therefore, we want to find U and V such that

argminU,Vi,j:Nij>0f(Nij)(lnNij(UV)ij)2,

which can be optimized using SGD, i.e., sampling a pair (v,w) at random, and performing updates with step size η>0:

ζwζw+2ηf(Nvw)(lnNvwζw,zv)zv,zvzv+2ηf(Nvw)(lnNvwζw,zv)ζw.

Word Analogy Problems

One can use GloVe embeddings to solve word analogy problems. For instance, if king is to man, what w is to woman can be solved by

w=argmaxvζkingζman+ζwoman,ζv.