Mutual information vs Cross Entropy

Join thousands of students who advanced their careers with MachineLearningPlus. Go from Beginner to Data Science Expert through a structured road map of 70+ courses in 9 core specializations. Build industry grade Data Science projects.

Cross-entropy is a measure of error, while mutual information measures the shared information between two variable. Both concepts used in information theory, but they serve different purposes and are applied in different contexts.

Let’s understand both in complete detail.

Cross-Entropy

Cross-entropy measures the difference between two probability distributions. Specifically, it quantifies the amount of additional information needed to code samples from a true distribution (P) using a code optimized for an estimated distribution (Q).

Cross-entropy tells you how “off” your predictions are compared to the true outcomes. For example: If you’re trying to guess the weather (sunny or rainy), and you predict sunny with 90% confidence but it’s actually rainy, cross-entropy would give you a score that shows how far off your prediction was.

It’s often used as a loss function in machine learning models to help improve predictions by showing how wrong they are.

Mathematical Formulation:

$$ H(P, Q) = – \sum_x P(x) \log Q(x) $$
Here, (P(x)) is the true distribution, and (Q(x)) is the estimated distribution.

Cross-entropy is commonly used as a loss function in machine learning, particularly in classification tasks where it measures how well the predicted probability distribution (from the model) matches the true distribution (ground truth).

Interpretation: A lower cross-entropy value indicates that the predicted distribution (Q) is closer to the true distribution (P).

Mutual Information

Mutual information measures the amount of information shared between two random variables. It quantifies the reduction in uncertainty about one variable given knowledge of the other.

It tells you how much knowing one thing helps you know another.

For example: If you know it’s raining, mutual information tells you how much that knowledge helps you predict if people are carrying umbrellas. If everyone carries an umbrella when it rains, the mutual information is high.

So, in simple words, it is used to see how strongly two things are related or to pick out important features in data.

Mathematical Formulation:

$$
I(X; Y) = \sum_{x,y} P(x, y) \log \frac{P(x, y)}{P(x)P(y)}
$$
Here, (P(x, y)) is the joint distribution of (X) and (Y), and (P(x)) and (P(y)) are the marginal distributions.

Mutual information is used in feature selection, image registration, clustering, and other areas where understanding the dependence or relationship between variables is crucial.

Mutual information is non-negative, with a higher value indicating a stronger relationship between the variables. If (X) and (Y) are independent, their mutual information is zero.

Key Differences

Cross-Entropy: Measures the difference between two distributions and is often used as a loss function in classification problems. It is asymmetrical, that is, it depends on the direction of (P) and (Q).
Mutual Information: Measures the shared information between two variables. It is used to quantify the dependency between variables and is often employed in feature selection and information retrieval. It is symmetrical; (I(X; Y) = I(Y; X)).

Let’s explore both concepts with simple mathematical examples.

Cross-Entropy Example Calculation

Imagine you have a binary classification problem where you’re predicting whether an email is spam (1) or not spam (0). You have the true label ( P ) and your model’s predicted probability ( Q ).

Data:

True label ( P(y) ): 1 (the email is spam).
Predicted probability ( Q(y) ): 0.8 (your model predicts it’s spam with 80% confidence).

The cross-entropy for a binary classification problem is given by:
$$
H(P, Q) = -[P(y) \log Q(y) + (1 – P(y)) \log (1 – Q(y))] $$
Substituting the values:
$$
H(P, Q) = -[1 \cdot \log(0.8) + 0 \cdot \log(0.2)] $$
$$
H(P, Q) = -[\log(0.8)] $$
$$
H(P, Q) \approx -[-0.223] \approx 0.223 \text{ bits}
$$
So, the cross-entropy is 0.223 bits, which indicates the cost or penalty for the incorrect prediction.

Mutual Information Example Calculation

Now, let’s consider two random variables ( X ) and ( Y ):

( X ): Whether it rains (0 = no, 1 = yes).
( Y ): Whether people carry an umbrella (0 = no, 1 = yes).

Data (Joint Distribution):

Probability that it rains and people carry an umbrella: ( P(X=1, Y=1) = 0.4 ).
Probability that it rains and people don’t carry an umbrella: ( P(X=1, Y=0) = 0.1 ).
Probability that it doesn’t rain and people carry an umbrella: ( P(X=0, Y=1) = 0.1 ).
Probability that it doesn’t rain and people don’t carry an umbrella: ( P(X=0, Y=0) = 0.4 ).

Marginal Probabilities:

( P(X=1) = P(X=1, Y=1) + P(X=1, Y=0) = 0.4 + 0.1 = 0.5 ) (Probability of rain).
( P(Y=1) = P(X=1, Y=1) + P(X=0, Y=1) = 0.4 + 0.1 = 0.5 ) (Probability of carrying an umbrella).

Mutual Information Calculation:

Mutual information ( I(X; Y) ) is calculated as:
$$
I(X; Y) = \sum_{x,y} P(x, y) \log \frac{P(x, y)}{P(x)P(y)}
$$
Let’s compute it step by step:

For ( X=1, Y=1 ):
$$
P(X=1, Y=1) \log \frac{P(X=1, Y=1)}{P(X=1)P(Y=1)} = 0.4 \log \frac{0.4}{0.5 \times 0.5} = 0.4 \log \frac{0.4}{0.25} = 0.4 \log 1.6
$$
$$
= 0.4 \times 0.223 = 0.0892 \text{ bits}
$$
Similarly calculate for other combinations ( (X=1, Y=0) ), ( (X=0, Y=1) ), and ( (X=0, Y=0) ).

Summing these up gives you the total mutual information. Let’s do that:

For ( X=1, Y=0 ): ( 0.1 \log \frac{0.1}{0.5 \times 0.5} = 0.1 \log 0.4 = -0.0602 ) bits.
For ( X=0, Y=1 ): ( 0.1 \log \frac{0.1}{0.5 \times 0.5} = -0.0602 ) bits.
For ( X=0, Y=0 ): ( 0.4 \log \frac{0.4}{0.5 \times 0.5} = 0.0892 ) bits.

Adding these up:
$$
I(X; Y) = 0.0892 – 0.0602 – 0.0602 + 0.0892 = 0.058 bits
$$

Summary

Cross-Entropy Example shows how much “wrongness” there is in predicting the value of a given variable.

Mutual Information Example shows how much knowing one variable can help in predicting the other.

That is, Cross-entropy is a measure of error, while mutual information measures the shared information between two variable.

Machine Learning

Complete Data Science and AI Roadmap by ML+

Sep 01, 2024

Machine Learning

Mutual information vs Cross Entropy

Aug 15, 2024

Machine Learning

Bayesian Optimization for Hyperparameter Tuning – Clearly explained.

Aug 03, 2024

Machine Learning

KL Divergence – What is it and mathematical details explained

Oct 02, 2023

Machine Learning

Probe Method – How to select features for ML models

Sep 30, 2023

Machine Learning

Cook’s Distance for Detecting Influential Observations

Aug 09, 2023

Mutual information vs Cross Entropy

Cross-Entropy

Mathematical Formulation:

Mutual Information

Mathematical Formulation:

Key Differences

Cross-Entropy Example Calculation

Data:

Mutual Information Example Calculation

Data (Joint Distribution):

Marginal Probabilities:

Mutual Information Calculation:

Summary

More Articles

Complete Data Science and AI Roadmap by ML+

Mutual information vs Cross Entropy

Bayesian Optimization for Hyperparameter Tuning – Clearly explained.

KL Divergence – What is it and mathematical details explained

Probe Method – How to select features for ML models

Cook’s Distance for Detecting Influential Observations

Similar Articles

Complete Introduction to Linear Regression in R

How to implement common statistical significance tests and find the p value?

Logistic Regression – A Complete Tutorial With Examples in R

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos: