Cross-entropy is a measure of error, while mutual information measures the shared information between two variable. Both concepts used in information theory, but they serve different purposes and are applied in different contexts.
Let’s understand both in complete detail.
Cross-Entropy
Cross-entropy measures the difference between two probability distributions. Specifically, it quantifies the amount of additional information needed to code samples from a true distribution (P) using a code optimized for an estimated distribution (Q).
Cross-entropy tells you how “off” your predictions are compared to the true outcomes. For example: If you’re trying to guess the weather (sunny or rainy), and you predict sunny with 90% confidence but it’s actually rainy, cross-entropy would give you a score that shows how far off your prediction was.
It’s often used as a loss function in machine learning models to help improve predictions by showing how wrong they are.
Mathematical Formulation:
$$ H(P, Q) = – \sum_x P(x) \log Q(x) $$
Here, (P(x)) is the true distribution, and (Q(x)) is the estimated distribution.
Cross-entropy is commonly used as a loss function in machine learning, particularly in classification tasks where it measures how well the predicted probability distribution (from the model) matches the true distribution (ground truth).
Interpretation: A lower cross-entropy value indicates that the predicted distribution (Q) is closer to the true distribution (P).
Mutual Information
Mutual information measures the amount of information shared between two random variables. It quantifies the reduction in uncertainty about one variable given knowledge of the other.
It tells you how much knowing one thing helps you know another.
For example: If you know it’s raining, mutual information tells you how much that knowledge helps you predict if people are carrying umbrellas. If everyone carries an umbrella when it rains, the mutual information is high.
So, in simple words, it is used to see how strongly two things are related or to pick out important features in data.
Mathematical Formulation:
$$
I(X; Y) = \sum_{x,y} P(x, y) \log \frac{P(x, y)}{P(x)P(y)}
$$
Here, (P(x, y)) is the joint distribution of (X) and (Y), and (P(x)) and (P(y)) are the marginal distributions.
Mutual information is used in feature selection, image registration, clustering, and other areas where understanding the dependence or relationship between variables is crucial.
Mutual information is non-negative, with a higher value indicating a stronger relationship between the variables. If (X) and (Y) are independent, their mutual information is zero.
Key Differences
- Cross-Entropy: Measures the difference between two distributions and is often used as a loss function in classification problems. It is asymmetrical, that is, it depends on the direction of (P) and (Q).
-
Mutual Information: Measures the shared information between two variables. It is used to quantify the dependency between variables and is often employed in feature selection and information retrieval. It is symmetrical; (I(X; Y) = I(Y; X)).
Let’s explore both concepts with simple mathematical examples.
Cross-Entropy Example Calculation
Imagine you have a binary classification problem where you’re predicting whether an email is spam (1) or not spam (0). You have the true label ( P ) and your model’s predicted probability ( Q ).
Data:
- True label ( P(y) ): 1 (the email is spam).
- Predicted probability ( Q(y) ): 0.8 (your model predicts it’s spam with 80% confidence).
The cross-entropy for a binary classification problem is given by:
$$
H(P, Q) = -[P(y) \log Q(y) + (1 – P(y)) \log (1 – Q(y))]
$$
Substituting the values:
$$
H(P, Q) = -[1 \cdot \log(0.8) + 0 \cdot \log(0.2)]
$$
$$
H(P, Q) = -[\log(0.8)]
$$
$$
H(P, Q) \approx -[-0.223] \approx 0.223 \text{ bits}
$$
So, the cross-entropy is 0.223 bits, which indicates the cost or penalty for the incorrect prediction.
Mutual Information Example Calculation
Now, let’s consider two random variables ( X ) and ( Y ):
- ( X ): Whether it rains (0 = no, 1 = yes).
- ( Y ): Whether people carry an umbrella (0 = no, 1 = yes).
Data (Joint Distribution):
- Probability that it rains and people carry an umbrella: ( P(X=1, Y=1) = 0.4 ).
- Probability that it rains and people don’t carry an umbrella: ( P(X=1, Y=0) = 0.1 ).
- Probability that it doesn’t rain and people carry an umbrella: ( P(X=0, Y=1) = 0.1 ).
- Probability that it doesn’t rain and people don’t carry an umbrella: ( P(X=0, Y=0) = 0.4 ).
Marginal Probabilities:
- ( P(X=1) = P(X=1, Y=1) + P(X=1, Y=0) = 0.4 + 0.1 = 0.5 ) (Probability of rain).
- ( P(Y=1) = P(X=1, Y=1) + P(X=0, Y=1) = 0.4 + 0.1 = 0.5 ) (Probability of carrying an umbrella).
Mutual Information Calculation:
Mutual information ( I(X; Y) ) is calculated as:
$$
I(X; Y) = \sum_{x,y} P(x, y) \log \frac{P(x, y)}{P(x)P(y)}
$$
Let’s compute it step by step:
- For ( X=1, Y=1 ):
$$
P(X=1, Y=1) \log \frac{P(X=1, Y=1)}{P(X=1)P(Y=1)} = 0.4 \log \frac{0.4}{0.5 \times 0.5} = 0.4 \log \frac{0.4}{0.25} = 0.4 \log 1.6
$$
$$
= 0.4 \times 0.223 = 0.0892 \text{ bits}
$$ -
Similarly calculate for other combinations ( (X=1, Y=0) ), ( (X=0, Y=1) ), and ( (X=0, Y=0) ).
Summing these up gives you the total mutual information. Let’s do that:
- For ( X=1, Y=0 ): ( 0.1 \log \frac{0.1}{0.5 \times 0.5} = 0.1 \log 0.4 = -0.0602 ) bits.
- For ( X=0, Y=1 ): ( 0.1 \log \frac{0.1}{0.5 \times 0.5} = -0.0602 ) bits.
- For ( X=0, Y=0 ): ( 0.4 \log \frac{0.4}{0.5 \times 0.5} = 0.0892 ) bits.
Adding these up:
$$
I(X; Y) = 0.0892 – 0.0602 – 0.0602 + 0.0892 = 0.058 bits
$$
Summary
Cross-Entropy Example shows how much “wrongness” there is in predicting the value of a given variable.
Mutual Information Example shows how much knowing one variable can help in predicting the other.
That is, Cross-entropy is a measure of error, while mutual information measures the shared information between two variable.