Entropy in machine learning
A brief overview of entropy, cross-entropy, and their usefulness in machine learning
Entropy is a familiar concept in physics, where it is used to measure the amount of “disorder” in a system. In 1948, mathematician Claude Shannon expanded this concept to information theory in a paper titled, “A Mathematical Theory of Communication”. In this article, I’ll give a brief explanation of what entropy is, and why it is relevant to machine learning.
What is entropy?
Entropy in information theory is analogous to entropy in thermodynamics. In short, entropy measures the amount of “surprise” or “uncertainty” inherent to a random variable. Formally, if a random variable X is distributed according to a discrete distribution p, its entropy H is given by
This formula is somewhat cryptic, so let’s clarify with a simple example. Suppose we have a fair coin with equal probabilities of landing heads (X = 0) or tails (X = 1). The entropy of a single coin flip
Note that I have implicitly assumed the logarithm has base 2 to fit Shannon’s interpretation of entropy as bits. According to this interpretation, entropy is the number of bits of information conveyed by the realization of a random variable (also called a random variate). In the case of our coin flip, there are two variates with equal probability, so the flip conveys one bit of information. Now consider an unfair coin that always lands on heads. The entropy is now
The unfair coin has only a single variate, so the coin flip conveys zero bits of information. There is no uncertainty about the outcome; the coin always lands on heads.
Let’s say we have an estimate q for the true distribution p of X. We can measure the cross-entropy of p and q, which is the expected number of bits conveyed by an outcome assuming it follows the distribution q:
When p = q, this formula reduces to the regular definition of entropy. It’s also important to note that cross-entropy is always greater than entropy, since the information conveyed by an outcome assuming the wrong distribution is necessarily higher than the information conveyed assuming the right distribution. If you’re familiar with the Kullback–Leibler divergence, you can express the relationship between entropy and cross-entropy as
Let’s go back to our coin example. We’re given a fair coin, but we’re skeptical that the coin is actually fair. That is, we have a coin with probability 0.5 of landing heads (corresponding to the true distribution p), but we believe that it actually has probability 0.7 of landing heads (the estimated distribution q). The cross-entropy of a coin flip is then
As expected, the cross-entropy is greater than the entropy.
Applications in machine learning
I’ve laid out some neat facts about entropy, but it’s not clear why we should care about it. To see why it’s useful, suppose we are building a logistic regression model to classify points into two possible classes, 0 and 1. For each point, p = 1 for the point’s true label and 0 otherwise. The model gives us a prediction q for each point. The cross-entropy of p and q is a measure of similarity between the distributions; the closer q is to p, the better the model. This means we can use cross-entropy as a loss function to optimize our model. That is, we try to find the parameters w of the model which produce the lowest average categorical cross-entropy, or the lowest “surprise” in the variable we are trying to predict. In this simple example, the cross-entropy takes a nice form:
Here, y represents the true label and y hat is the probability that the point belongs to class 0. The total loss function is simply the average of the cross-entropy across all samples,
And voila! We’ve shown that entropy is useful not only in physics but also in information theory and machine learning.