Consider a random variable $X$ defined on $\mathcal{X}$. It’s entropy is defined as:
$$ H(X)=-\sum_{x\in\mathcal{X}}p(x)\log p(x) $$
$$ H(X)=\mathbb{E}\left[\log\frac{1}{p(X)}\right] $$
<aside> 💡 The entropy is the expected number of bits we need to encode the value of a random variable if we do it in the shortest possible way. If the entropy is high, we need a lot of bits, meaning that outcomes are generally surprising (packed with information). Hence we think of entropy as a measure of information!
</aside>
Examples
$$ H(X)=-p\log p-(1-p)\log(1-p):=H(p) $$
We call this the binary entropy function. It’s a concave function that looks like this:
The first important property of entropy is that it is non-negative:
<aside> 💡 Entropy is non-negative: $H(X) \geq 0$
</aside>
We can extend the definition of entropy to multiple random variables. The joint entropy of a pair of discrete random variables $(X,Y)$ with a joint distribution $p(x,y)$ is defined as:
$$ H(X,Y)=-\mathbb{E}\log p(x,y) $$
The conditional entropy $H(Y\mid X)$ is defined as the expected entropy of $Y$ conditioned on the value of $X$, over the choices of $X$.
$$ H(Y\mid X) = \mathbb{E}\left[\log \frac{1}{p(Y\mid X)}\right]\\=\sum_{x\in\mathcal{X}}p(x)H(Y\mid X=x) $$