Deep Learning 101: Lesson 14: Loss Functions

Muneeb S. Ahmad
10 min readAug 31, 2024

--

This article is part of the “Deep Learning 101” series. Explore the full series for more insights and in-depth learning here.

Loss functions play a pivotal role in the training of machine learning models. They are mathematical functions that quantify the difference between the predicted values by the model and the actual values in the training data. This difference is commonly known as “loss” or “error.” The primary objective of a machine learning algorithm during the training phase is to minimize this loss, which essentially means improving the accuracy of predictions made by the model.

Key Aspects of Loss Functions:

The following points highlight the importance and role of loss functions in machine learning:

  • Guiding the Learning Process: Loss functions guide the optimization process in learning algorithms. By calculating the error, they provide a clear objective for the algorithms to achieve — minimizing this error.
  • Quantifying Model Accuracy: They offer a quantitative measure of model performance. A lower value of the loss function indicates a model that makes predictions more closely aligned with the actual values.
  • Type Selection: Different types of loss functions are suitable for different types of machine learning problems, like regression, classification, etc. Choosing the right loss function is crucial for model performance.

Relationship Between Loss Functions and Model Performance

The relationship between loss functions and model performance is a direct and significant one. The choice of a loss function can greatly influence the behavior of the learning algorithm and, consequently, the performance of the model. The following points elaborate on how loss functions influence the performance of machine learning models:

  • Convergence to Minima: Certain loss functions can lead to faster convergence during training, meaning the model reaches its optimal state quicker. This can be especially important in large-scale applications where computational efficiency is a concern.
  • Handling Outliers: Some loss functions, like Mean Squared Error, are more sensitive to outliers in the data. Others, like Huber Loss, offer a balance between the sensitivity to outliers and convergence properties.
  • Overfitting vs. Underfitting: The complexity of a loss function can contribute to overfitting or underfitting. Overfitting occurs when a model learns the training data too well, including the noise, while underfitting happens when the model does not capture the underlying pattern in the data.
  • Impact on Probabilistic Interpretation: Loss functions like Cross-Entropy Loss in classification tasks have a probabilistic interpretation. They not only measure the accuracy but also quantify the certainty of predictions.

As we explore the different loss functions in the following sections, we will delve into their mathematical formulations, practical applications, and the nuances that make each of them unique. This exploration will provide a comprehensive understanding of how these functions shape the landscape of machine learning models.

Mean Square Error (MSE)

Mean Square Error (MSE) is a commonly used loss function for regression problems. It measures the average of the squares of the errors — that is, the average squared difference between the estimated values and the actual value.

The MSE is calculated as:

Where

The squaring of the errors has significant implications: it penalizes larger errors more severely than smaller ones, which can be both advantageous and disadvantageous, depending on the context.

MSE is widely used in linear regression, logistic regression, and other types of regression analyses where a continuous output is predicted. It’s particularly useful in scenarios where we need to emphasize larger errors more than smaller ones, as the squaring operation magnifies the errors.

Advantages:

  • Sensitivity to Large Errors: Due to the squaring term, MSE is highly sensitive to larger errors. This can be beneficial in cases where avoiding large errors is crucial.
  • Differentiability: MSE is smooth and differentiable. This property makes it easy to calculate gradients for optimization algorithms like gradient descent.
  • Analytical Convenience: MSE has a simple and convenient mathematical form, which makes it easy to work with analytically.

Limitations:

  • Sensitivity to Outliers: The same sensitivity to large errors can become a disadvantage when dealing with outliers, as it can skew the model performance.
  • Scale Dependency: The value of MSE is not scale-invariant. It depends on the scale of the output variable, making comparisons across different datasets or models challenging.
  • Mean-Oriented: MSE gives an estimate that minimizes the variance of the errors, which may not always be the desirable property, especially if the distribution of data is not symmetric.

Example Calculation

Figure 1: Mean Squared Error (MSE) Example Calculation

The Mean Square Error (MSE) in the above example is calculated from actual and predicted values. It involves squaring the difference between each actual value (y) and predicted value (ŷ) to ensure errors are positive, then averaging these squared errors. With the actual values ranging from 3 to 13 and predictions from 1 to 15, the MSE is computed as approximately 1.833, indicating that, on average, the predictions deviate from the actual values by a squared error of 1.833. This metric helps in assessing the accuracy of a predictive model, with a lower MSE indicating a better fit to the observed data.

Mean Absolute Error (MAE)

Mean Absolute Error (MAE) is another loss function used to measure accuracy for continuous variables in regression models. Unlike the Mean Square Error, MAE measures the average magnitude of errors in a set of predictions, without considering their direction (positive or negative).

The MAE is calculated as:

Where

The absolute value of the errors means that all errors are treated equally, regardless of their direction, making MAE less sensitive to outliers compared to MSE.

MAE is widely applied in regression problems where it is important to treat all errors on the same scale. It is particularly beneficial in contexts where outliers are expected but should not significantly influence the model’s performance.

Advantages:

  • Robustness to Outliers: MAE is less sensitive to outliers than MSE. Outliers will not contribute disproportionately to the total error.
  • Interpretability: The MAE is more directly interpretable in terms of average error magnitude, making it easier to explain in layman’s terms.
  • Equal Weighting: All errors are weighted equally, which can be advantageous when each error contributes equally to the overall model performance.

Limitations:

  • Lack of Differentiability: The absolute value function is not differentiable at zero, which can pose problems for certain optimization algorithms that rely on derivatives.
  • Scale Sensitivity: Similar to MSE, MAE is also scale-dependent, and hence comparisons across different scales can be misleading.
  • No Emphasis on Larger Errors: Since all errors are treated equally, MAE does not emphasize larger errors, which can sometimes be a disadvantage, especially in cases where larger errors are particularly undesirable.

Calculation Example:

Figure 2: Mean Absolute Error (MAE) Example Calculation

The Mean Absolute Error (MAE) in the provided example is a measure of the average magnitude of errors between the actual values (y) and the predicted values (ŷ). It is calculated by taking the absolute difference between each actual and predicted value, signifying the average error without considering the direction. With actual values ranging from 3 to 13 and predictions from 1 to 15, the MAE is computed as approximately 1.167. This indicates that, on average, the model’s predictions are about 1.167 units away from the actual values. MAE is particularly useful as it gives an even weighting to all errors, providing a straightforward representation of model accuracy without being overly sensitive to outliers.

Huber Loss

The Huber Loss function combines elements of Mean Squared Error (MSE) and Mean Absolute Error (MAE) to create a loss function that is robust to outliers and sensitive to small errors. It features a piecewise definition with a threshold parameter δ: for errors smaller than δ, the loss is quadratic, and for larger errors, the loss is linear. This dual nature balances sensitivity to small errors and robustness to outliers, making Huber Loss a versatile tool for regression problems with noisy data. The Huber Loss is calculated using the formula below:

The value of δ is chosen based on the specific needs of the problem and the desired sensitivity to outliers. If δ is set very high, the Huber Loss will resemble MSE, and if it is set very low, it will resemble MAE.

Calculation Example:

The below calculation is based on δ = 1

Figure 3: Huber Loss Example Calculation

Below is what each column and row in the above table represents:

  • y: the actual values of the observations or targets.
  • ŷ: the predicted values generated by the model.
  • |y-ŷ|: the absolute difference between the actual and predicted values. It represents the absolute error for each observation.
  • ½(y-ŷ)²: This is the squared error for each observation, divided by 2. It is used in the Huber Loss calculation for smaller errors (those within the threshold δ).
  • δ|y-ŷ| — ½δ²: This column is used for larger errors (those exceeding the threshold δ). It calculates a modified version of the absolute error that is less sensitive to outliers. The constant δ is a predefined threshold that determines where the loss function switches from quadratic to linear.
  • Huber Loss: This column shows the final Huber Loss for each observation. It applies the quadratic formula for errors less than δ and the linear formula for errors greater than δ. The decision between which formula to use for each observation is based on the value calculated in the previous two columns.
  • AVG (Average Huber Loss): The final row under “Huber Loss” is the mean of all the individual Huber Loss values, giving the overall average Huber Loss for the model on the dataset. A lower average indicates better predictive performance with a balanced sensitivity to outliers.

Log Loss

Log Loss, also known as logistic loss or cross-entropy loss, is a pivotal loss function used in classification problems, especially with models that predict probabilities. The most critical aspect of Log Loss is its ability to quantify the accuracy of a classifier by penalizing false classifications. It achieves this by taking into account the uncertainty of the predictions — assigning a higher loss to predictions that are confidently incorrect, and a lower loss to those that are correct or less confident. This property of Log Loss is crucial because it encourages the model not only to classify examples correctly but also to refine the probability estimations for its predictions. The use of Log Loss leads to classifiers that are well-calibrated, which means the predicted probabilities reflect true probabilities of the observed outcomes, an essential feature for decision-making processes that rely on probabilistic interpretations.

The Huber Loss is calculated using the below formula:

The formula is structured to penalize predictions that diverge from the actual labels, and it operates as follows:

  • n represents the number of observations or samples in the dataset.
  • represents the actual label of the i-th observation, with 1 indicating the positive class and 0 the negative class.
  • p() represents the predicted probability that the i-th observation is of the positive class.

Calculation Example:

Figure 4: Log Loss Example Calculation

Below is what each column and row in the above table represents:

  • y: the actual labels for each observation in the dataset. A value of 1 indicates the positive class, and a value of 0 indicates the negative class.
  • p(1): the predicted probability that the observation belongs to the positive class (class 1), as estimated by the model.
  • ŷ: the predicted label based on the predicted probability p(1). If p(1) is greater than or equal to 0.5, the prediction is usually classified as 1 (positive); if it’s less than 0.5, as 0 (negative).
  • p(y): This column adjusts the predicted probabilities to match the actual labels. If the actual label y is 1, p(y) is the same as p(1). If y is 0, p(y) is 1 — p(1), reflecting the probability of the negative class.
  • y*ln(p(y)): This column represents the first part of the Log Loss formula, -y * ln(p(y)). It computes the log loss contribution from the positive class predictions. For actual negatives (y=0), this part contributes 0 to the log loss, as seen in the sheet.
  • (1-y)ln(1-p(y)): This is the second part of the Log Loss formula, (1-y) * ln(1-p(y)). It calculates the log loss contribution from the negative class predictions. For actual positives (y=1), this part contributes 0.
  • Log loss: This column is the sum of the two previous columns and represents the individual log loss for each observation. It’s a measure of how far each prediction lies from the actual label.
  • AVG (Average Log Loss): The final row under “Log loss” calculates the average of the individual log losses for all observations, which gives the overall Log Loss for the model on this dataset

Summary

Loss functions are crucial in guiding the optimization process of machine learning models by quantifying the difference between predicted and actual values. Different loss functions, such as Mean Squared Error, Mean Absolute Error, Huber Loss, and Log Loss, offer unique advantages and are suited to different types of problems. The choice of loss function significantly affects model performance, sensitivity to outliers, convergence speed, and the probabilistic interpretation of predictions. Understanding the nuances of each loss function helps in selecting the appropriate one for specific applications, thereby enhancing the model’s accuracy and robustness.

4 Ways to Learn

1. Read the article: Loss Functions

2. Play with the visual tool: Loss Functions

Play with the visual tool: Loss Functions

3. Watch the video: Loss Functions

4. Practice with the code: Loss Functions

Previous Article: Optimizers
Next Article: Deep Learning Visual Demo

--

--

Muneeb S. Ahmad
Muneeb S. Ahmad

Written by Muneeb S. Ahmad

Muneeb Ahmad is a Senior Microservices Architect and Recognized Educator at IBM. He is pursuing passion in ABC (AI, Blockchain, and Cloud)

No responses yet