#Maths/Probability-Theory #Maths/Information-Theory #Note

Entropy

Intuition

We can measure the "surprise" of a event $A$ using negative logarithm $- \log P (A)$ .

If $A$ happens with a high probability, i.e., $P (A) \to 1$ , it is not surprising when $A$ happens. $- \log P (A)$ is closed to $0$ .
If $A$ happens with a low probability, i.e., $P (A) \to 0$ , it is surprising when $A$ happens. $- \log P (A)$ is closed to $+ \infty$ .

Definition

Given a random variable $X$ , we define the entropy as the expectation of surprise:

H (p_{X}) = E_{p} [\frac{1}{\log p_{X} (x)}] = {\begin{cases} \sum_{x \in X} p_{X} (x) \log \frac{1}{p_{X} (x)}, & for discrete p; \\ \int_{X} p_{X} (x) \log \frac{1}{p_{X} (x)} d x, & for continuous p . \end{cases}

We usually only sum/integrate for $p_{X} (x) > 0$ .

Interpretation

The entropy can be interpreted as a measure of uncertainty:

If a distribution has high entropy, the observations are hard to predict, and the dataset sampled from the distribution will have high information content.
If a distribution has small entropy, then observations are likely to be the same, and the dataset sampled from the distribution does not contain much information.

Properties

For discrete $p$ :

$H (p)$ is always non-negative, with $H (X) = 0$ if and only if $X$ is deterministic;
$H (p)$ is at its maximum if $p$ is the uniform distribution;
$H (p)$ is at its minimum if the probability mass is all in one state.

For continuous $p$ :

$H (p)$ can be negative;
For fixed mean and variance, $H (p)$ is at its maximum if $p$ is Gaussian.

Examples

Bernoulli Distribution

For a Bernoulli distribution parameterized by $θ$ , if $X \sim Bern (θ)$ such that $P (X = 1) = θ$ and $P (X = 0) = 1 - θ$ :

H (p) = - θ \log θ - (1 - θ) \log (1 - θ) .

$H (p)$ is maximized when $θ = \frac{1}{2}$ .

Gaussian Distribution

The entropy of a $d$ -dimensional Gaussian is:

H (N (μ, Σ)) = \frac{1}{2} \log det (2 π e Σ) = \frac{d}{2} + \frac{d}{2} \log (2 π) + \frac{1}{2} \log det (Σ) .

In univariate case,

H (N (μ, σ^{2})) = \frac{1}{2} \log (2 π e σ^{2}) .

Entropy in Information Theory

Entropy is useful to calculate the smallest amount of information required to convey a message.

Consider the transmission of sequences with a set of characters $X$ . Given the frequency of each character $X \in X$ : $p (X)$ , the minimum expected number of bits to encode a character is

- \sum_{X \in X} p (X) \log_{2} p (X),

where each character is encoded with $- \log_{2} p (X)$ bits.

Suppose we want to encode a sequence of 4 characters A, B, C, D with equal frequency $0.25$ . If we encode each character with $- \log_{2} (0.25) = 2$ bits, the expected number of bits to encode a character is at the minimum:
$- \sum_{i = 1}^{4} 0.25 \times 2 = 2.$
In this case, A, B, C, D is encoded with 00, 01, 10, 11.

Cross Entropy

For two distributions $p$ and $q$ , the cross entropy of $p$ relative to $q$ is

H (p, q) = - E_{p} [\log q] = {\begin{cases} - \sum_{x \in X} p (x) \log q (x), & for discrete p, q; \\ - \int_{X} p (x) \log q (x) d x, & for continuous p, q . \end{cases}

The cross entropy can be interpreted as the expected number of bits needed to encode a data sample drawn from $p$ using a code based on $q$ .

Suppose we want to encode a sequence of 4 characters A, B, C, D with equal frequency $0.25$ . If we encode each character with 0, 10, 110, 111 respectively, the probability of each encoding appearing in a random sequence is $0.5$ , $0.25$ , $0.125$ , and $0.125$ . Then the expected number of bits to encode a character with this encoding is:
$- [0.25 \times \log_{2} 0.5 + 0.25 \times \log_{2} 0.25 + 0.25 \times \log_{2} 0.125 + 0.25 \times \log_{2} 0.125] = 2.25 .$

Gibb's Inequality

The entropy of a distribution $p$ is less or equal to its cross entropy with any other distribution $q$ :

H (p) \leq H (p, q),

where the equality holds if and only if $p = q$ .

proof

$H (p, q) - H (p) = - E_{p} [\log q] + E_{p} [\log p] = E_{p} [\log \frac{p}{q}] = E_{p} [- \log \frac{q}{p}] .$
Since $- \log (x)$ is a convex function, using [[Jensen's inequality]],
$E_{p} [- \log \frac{q}{p}] \geq - \log E_{p} [\frac{q}{p}] = - \log 1 = 0.$
Therefore,
$H (p, q) \geq H (p) .$

The difference between $H (p, q)$ and $H (p)$ gives rise to the definition of KL divergence.

Using Gibb's inequality, we can also show that the entropy of uniform categorical distribution upper bounds the entropy of any categorical distribution.

proof

Given categories $X = {C_{1}, \dots, C_{n}}$ and the uniform distribution $q = \frac{1}{n}$ . For any categorical distribution $p$ , using Gibb's inequality:
$H (p, q) = - \sum_{i = 1}^{n} p (C_{i}) \log \frac{1}{n} = \log n \geq H (p) .$