Information Theory

Entropy

Consider a random variable $X$ defined on $\mathcal{X}$. It’s entropy is defined as:

$$ H(X)=-\sum_{x\in\mathcal{X}}p(x)\log p(x) $$

We use the convention that $0 \log 0 = 0$.
We can also write the entropy as an expectation, through the law of the unconscious statistician:

$$ H(X)=\mathbb{E}\left[\log\frac{1}{p(X)}\right] $$

This is quite a weird, self-referential statement, but it kind of makes sense:

<aside> 💡 The entropy is the expected number of bits we need to encode the value of a random variable if we do it in the shortest possible way. If the entropy is high, we need a lot of bits, meaning that outcomes are generally surprising (packed with information). Hence we think of entropy as a measure of information!

</aside>

Examples

Let $X = 1$ with probability $p$ and $0$ with probability $1-p$. Then:

$$ H(X)=-p\log p-(1-p)\log(1-p):=H(p) $$

We call this the binary entropy function. It’s a concave function that looks like this:

Untitled

The first important property of entropy is that it is non-negative:

<aside> 💡 Entropy is non-negative: $H(X) \geq 0$

</aside>

Joint and Conditional Entropy

We can extend the definition of entropy to multiple random variables. The joint entropy of a pair of discrete random variables $(X,Y)$ with a joint distribution $p(x,y)$ is defined as:

$$ H(X,Y)=-\mathbb{E}\log p(x,y) $$

The conditional entropy $H(Y\mid X)$ is defined as the expected entropy of $Y$ conditioned on the value of $X$, over the choices of $X$.

$$ H(Y\mid X) = \mathbb{E}\left[\log \frac{1}{p(Y\mid X)}\right]\\=\sum_{x\in\mathcal{X}}p(x)H(Y\mid X=x) $$