(1/2) Same Story(MLE) Different Endings: Mean Square Error, Cross Entropy, KL Divergence

Mathematically Proving That They are All the Same

Kowshik chilamkurthy
5 min readAug 1, 2022



Very often, data scientists and machine learning practitioners don’t appreciate the mathematical and intuitive relationships between different loss metrics like Negative Log Likelihood, Cross Entropy, Maximum Likelihood Estimation, Kullback-Leibler (KL) divergence, and most importantly Mean Square Error. Wouldn’t you be surprised if I say that KL-Divergence and Mean Square Error are the same mathematically?
As a seasoned data scientist, I am confounded by the fact that these mathematical relations are not given the kind of emphasis this topic deserves in AI/ML courses and textbooks. In this blog, I aim to establish the solid mathematical and intuitive relations between these different losses which are used in different problems like classification, regressions, GANS, etc.

This blog immensely helps data scientists to deepen their understanding of different loss metrics and also helps aspiring data scientists crack machine learning interviews.

The Mother of All Loss Functions: Maximum Likelihood Estimation

The maximum likelihood method is used for parameter estimation. Often in machine learning each model contains its own set of parameters, for example, linear model y = mx + c : weight/slope m and intercept c are the parameters that ultimately define the model.

Now the challenge is to find the model parameters when the data is provided. Maximum likelihood estimation is a method that determines values for the parameters. But how it’s done ?. Intuitively the parameter values are found such that they maximize the likelihood that the predicted is close to observed.

The parameter set in the total search space that maximizes the likelihood function is called the maximum likelihood estimate.

Math Behind MLE

The logic of maximum likelihood is both intuitive and flexible. Math is simple and elegant, just follow along.

1: Let’s assume that we want to build a model with parameters θ. where θ: [θ₀, θ₁, θ₂,θ₃ …θₙ]^T, for example in linear regression (y = mx + c) model θ: [m, c]. where 𝞗 is called the parameter space. In linear regression case 𝞗 is the seach space for different combinations of [(m, c), (m, c),(m, c)……(m, c)].

2: The goal of MLE is to find the best . The goal of maximum likelihood estimation is to determine the best parameters θₖ ∈ 𝞗. For example, in linear regression, θ: (m, c).

3: The way to find the right parameter set θₖ is using the Likelihood function. The concept is simple, if carefully understood. Lets assume our linear (y = mx + c) model again, for a given data point (xₚ, yₚ) and parameters θ: (m, c).

4: PDF: f(yₚ, θₖ) tells the probability of model predicted yₚ if the actual label is yₚ. Simple right, you flip a coin and see heads pdf: f(Head) tells us how likely you will see heads.

Probability Density Function

5: f(yₚ, θₖ) is for one data point p, but we need to calculate this function for all the data point (y₀, y₁, y₂,y₃ …yₙ). How do we do this, we could use Joint Probability Distribution to take all data points into consideration.

Joint Probability Distribution


For independent and identically distributed random variables, joint probability distribution f(y; θ) will be the product of univariate density functions f(yₚ, θₖ)

6: For a given parameter θₖ, joint density funtion f(y, θₖ) tells me how likely I will see y distribution that is equal to the observed y distribution. Now reverse the situation, we want to find the θₖ so that I will see y distribution closest to the observed y distribution. That reversed JDF is called the likelihood function.

Likelihood Function

7: So we search all the parameter space θ ∈ 𝞗 and the specific value θₖ that maximizes the likelihood function is called the Maximum Likelihood Estimate (MLE).

Maximum Likelihood Estimate

8: In practice, it is often convenient to work with the natural logarithm of the likelihood function, called the log-likelihood:


Maximizing the log-likelihood is the same as maximizing the likelihood. Since ‘log’ is an increasing function, the value of Θ that maximizes the log-likelihood function will also maximize the likelihood function.

Loss: Negative Log Likelihood (Teaser)

Before concluding the blog, let me give a teaser: the loss that is a very obvious outcome from MLE is Negative log-likelihood. It is a loss function used in multi-class classification. Losses are generally minimized so we use a negative sign in the above equation and thus called Negative log-likelihood loss. We minimize the Negative log-likelihood loss and thus achieve the Maximum likelihood estimate.


Almost all common loss functions can be derived from the Maximum Likelihood Estimation. In my next article, we will understand how they can be derived mathematically and appreciate the similarities between these seemingly different loss functions that are used in regression, classification, and GANS.

Thanks for your time !



Kowshik chilamkurthy