Entropy

Entropy

Intuition

We can measure the "surprise" of a event A using negative logarithm logP(A).

Definition

Given a random variable X, we define the entropy as the expectation of surprise:

H(pX)=Ep[1logpX(x)]={xXpX(x)log1pX(x),for discrete p;XpX(x)log1pX(x)dx,for continuous p.

We usually only sum/integrate for pX(x)>0.

Interpretation

The entropy can be interpreted as a measure of uncertainty:

Properties

For discrete p:

For continuous p:

Examples

Bernoulli Distribution

For a Bernoulli distribution parameterized by θ, if XBern(θ) such that P(X=1)=θ and P(X=0)=1θ:

H(p)=θlogθ(1θ)log(1θ).

H(p) is maximized when θ=12.

Gaussian Distribution

The entropy of a d-dimensional Gaussian is:

H(N(μ,Σ))=12logdet(2πeΣ)=d2+d2log(2π)+12logdet(Σ).

In univariate case,

H(N(μ,σ2))=12log(2πeσ2).

Entropy in Information Theory

Entropy is useful to calculate the smallest amount of information required to convey a message.

Consider the transmission of sequences with a set of characters X. Given the frequency of each character XX: p(X), the minimum expected number of bits to encode a character is

XXp(X)log2p(X),

where each character is encoded with log2p(X) bits.

Suppose we want to encode a sequence of 4 characters A, B, C, D with equal frequency 0.25. If we encode each character with log2(0.25)=2 bits, the expected number of bits to encode a character is at the minimum:
i=140.25×2=2.
In this case, A, B, C, D is encoded with 00, 01, 10, 11.

Cross Entropy

For two distributions p and q, the cross entropy of p relative to q is

H(p,q)=Ep[logq]={xXp(x)logq(x),for discrete p,q;Xp(x)logq(x)dx,for continuous p,q.

The cross entropy can be interpreted as the expected number of bits needed to encode a data sample drawn from p using a code based on q.

Suppose we want to encode a sequence of 4 characters A, B, C, D with equal frequency 0.25. If we encode each character with 0, 10, 110, 111 respectively, the probability of each encoding appearing in a random sequence is 0.5, 0.25, 0.125, and 0.125. Then the expected number of bits to encode a character with this encoding is:
[0.25×log20.5+0.25×log20.25+0.25×log20.125+0.25×log20.125]=2.25.

Gibb's Inequality

The entropy of a distribution p is less or equal to its cross entropy with any other distribution q:

H(p)H(p,q),

where the equality holds if and only if p=q.

proof

H(p,q)H(p)=Ep[logq]+Ep[logp]=Ep[logpq]=Ep[logqp].
Since log(x) is a convex function, using [[Jensen's inequality]],
Ep[logqp]logEp[qp]=log1=0.
Therefore,
H(p,q)H(p).

The difference between H(p,q) and H(p) gives rise to the definition of KL divergence.

Using Gibb's inequality, we can also show that the entropy of uniform categorical distribution upper bounds the entropy of any categorical distribution.

proof

Given categories X={C1,,Cn} and the uniform distribution q=1n. For any categorical distribution p, using Gibb's inequality:
H(p,q)=i=1np(Ci)log1n=lognH(p).