Deep Learning 101: Lesson 4: Gradient Descent

Muneeb S. Ahmad
6 min readAug 28, 2024

--

This article is part of the “Deep Learning 101” series. Explore the full series for more insights and in-depth learning here.

Gradient descent is a cornerstone in the world of machine learning and artificial intelligence, forming the backbone of many optimization and learning algorithms. This chapter delves into the intricacies and nuances of gradient descent, uncovering its fundamental principles and examining its practical applications.

The concept of Gradient Descent

Figure 1: Illustration of Gradient Descent Concept

Gradient Descent can be compared to a treasure hunter’s journey to find the lowest point in a valley. Imagine being blindfolded and placed on top of a mountain. Your mission? To find the lowest point possible. Your only guide is the slope you feel under your feet or in your hands, forcing you to take tiny, incremental steps downhill in the sharpest direction you can perceive. As you repeat this process-each step a small adjustment in direction-you gradually descend closer to the bottom of the valley. In machine learning, the gradient descent algorithm works in a similar way. Here, the valley symbolizes the error or loss in our model’s predictions, and our goal is to minimize that error. The slope at each point is indicative of the gradient, which guides us to the path of steepest descent.

Figure 2: Gradient Descent Process

This above graph illustrates the “treasure hunt” machine learning process, where each learning step or epoch aims to minimize the value of the loss function. It visually represents the journey of gradient descent, with the x-axis representing the number of epochs and the y-axis representing the value of the loss function. As the epochs progress, one can observe an exponential decrease in the loss function, demonstrating the effectiveness of the algorithm in optimizing and reducing errors over time.

Consider a straightforward two-dimensional x-y dataset:

x: 0.2, 0.4, 0.6, 0.8, 1.0, 1.2

y: 0.18, 0.32, 0.42, 0.48, 0.58, 0.72

We aim to fit this data with a linear equation:

ŷ = ax + b

Let’s choose the initial values of the parameters a,, b and the learning rate (alpha) as below:

a=1

b=1

alpha= 0.1

This learning rate in gradient descent is like the size of the steps we take as we descend the mountain; too large a step might overshoot the valley, and too small a step might take too long or get stuck in a small dip.

We calculate some variables values as shown in the below table:

Figure 3: Initial Data Points and Predicted Values

This table contains x and y values, along with their predicted counterparts (ŷ), the difference (ŷ — y), twice the difference (2(ŷ — y)), and the squared difference ((ŷ — y)²), which is used to compute the loss.

The loss, representing the average squared error, is calculated as:

The computed loss (L) is:

L = 1.591

Updating values of a and b:

We compute the partial derivatives as below

Using the values from the table and plugin them into the above equations we come up with the below values:

∂L/∂ŷ = 2.5 ∂ŷ/∂a = 0.7 ∂ŷ/∂b = 1

∂L/∂a = 1.75 ∂L/∂b = 2.5

a = 0.825 b = 0.75

Below is the X-Y plot of the input data and the graph of the regression line at this initial stage:

This chart displays a scatter plot with observed data (as dots) and an initial linear regression line (as solid line). The line’s poor fit at this stage suggests the need for further training or adjustment for a more accurate model.

Given that the value of a and b are 0.825 and 0.75 at the initial stage, the equation of the regression line is as below:

ŷ = 0.825x + 0.75

Now, if we want to iterate through the gradient descent function, we must continue to compute the values of the partial derivatives and the new values of a and b at each iteration or epoch. As the line converges to represent the input data fitted to a straight line, the value of the loss function will begin to converge. Once the value of the loss function becomes very low, close to zero, then the final values of a and b are the values that represent the perfectly fitted regression line.

The below table captures the progressive updates of various parameter values and the corresponding loss during each iteration of the gradient descent process.

Learning rate (α) = 0.1

It’s clear from the table that as the iterations progress, the values of ∂L/∂a and ∂L/∂b are decreasing, leading to an update in ‘a’ and ‘b’ values, which in turn results in a significant reduction in the loss (L). This trend is indicative of the gradient descent algorithm successfully moving towards a minimum of the loss function. The change from iteration 2 to 3, for instance, shows a notable decrease in loss, suggesting that the learning rate and direction of the gradient descent are effectively optimizing the parameters. Iteration 13 demonstrates a very low loss, indicating that the algorithm is close to finding the optimal values of a and b for the given problem.

The below chart demonstrates how the Loss (L) is reduced at each iteration.

Initially, at epoch 0, the loss is at its highest, close to 1.5. As the number of epochs increases, the loss sharply decreases, with the most significant drop occurring between epoch 0 and 5. After this steep decline, the loss continues to decrease but at a much slower rate, leveling off and approaching 0 as the epochs approach and surpass 10. The graph indicates a rapid improvement in model performance in the initial epochs, followed by slower, incremental improvements in later epochs.

After 13 iterations the equation of the regression line becomes as below:

ŷ = 0.417x + 0.167

The below chart shows the input x-y data points and the regression line fitted to the data.

Please note that the data points are fairly close to the regression line, suggesting a good fit.

The essence of the gradient descent algorithm in machine learning is to iteratively adjust the parameters of a model to minimize the loss function-the difference between the predicted output and the actual output. Through successive iterations, as seen by the changes from the initial equation to the updated one and the movement of the regression line in the graphs, the model gradually becomes more accurate and reliable. This method, which is fundamental to machine learning, illustrates how continuous improvement and fine-tuning are critical to developing effective predictive models. Just as a treasure hunter gets closer to the treasure with each step, gradient descent moves incrementally toward the optimal solution, symbolizing the continuous journey of learning and improvement in AI and data science.

Summary

Gradient descent is a fundamental algorithm in machine learning, essential for optimizing models by iteratively adjusting parameters to minimize the loss function. This process is akin to a treasure hunt, where each step towards the valley’s lowest point symbolizes reducing the error between predicted and actual values. By continuously refining parameters through successive iterations, gradient descent helps models become more accurate and reliable, as demonstrated through the progressive updates and loss reduction in the provided examples.

4 Ways to Learn

1. Read the article: Gradient Descent

2. Play with the visual tool: Gradient Descent

Play with the visual tool: Gradient Descent

3. Watch the video: Gradient Descent

4. Practice with the code: Gradient Descent

Previous Article: Loss and Metric
Next Article: Stochastic Gradient Descent

--

--

Muneeb S. Ahmad
Muneeb S. Ahmad

Written by Muneeb S. Ahmad

Muneeb Ahmad is a Senior Microservices Architect and Recognized Educator at IBM. He is pursuing passion in ABC (AI, Blockchain, and Cloud)