Deep Learning 101: Lesson 5: Stochastic Gradient Descent

Muneeb S. Ahmad
8 min readAug 28, 2024

--

This article is part of the “Deep Learning 101” series. Explore the full series for more insights and in-depth learning here.

As discussed in the previous section, gradient descent is a fundamental optimization algorithm used in machine learning to minimize the loss function, which represents the error between the results predicted by our model and the actual values. The process involves computing the gradient, which is essentially the slope of the loss function at a given point, by considering each individual example within the data set. By understanding the direction and steepness of this slope, we can adjust the model parameters — such as the weights in a neural network — so that they move closer to the optimal values that minimize the error with each iteration. This method is particularly efficient for smaller data sets.

However, as datasets grow larger, which is common in the era of big data, traditional gradient descent can become computationally expensive and time-consuming. It requires processing all examples in the dataset to make a single update to the model parameters, which can be inefficient when dealing with large numbers of samples. In scenarios where datasets contain thousands or even millions of examples, Stochastic Gradient Descent (SGD) becomes a more viable alternative. SGD tackles the problem by updating model parameters based on the gradient of the loss function computed from a randomly selected subset of the data, rather than the entire data set. This stochastic approach introduces noise into the parameter updates, which can help the model escape local minima and potentially find a better overall solution faster than the traditional method. It’s a trade-off that takes advantage of computational efficiency and speed at the cost of higher variance in the updates, which often requires careful tuning of the learning rate and other hyperparameters to achieve the best results.

Figure 1: Gradient Descent Overview

In the Stochastic Gradient Descent algorithm, instead of looking at the entire data set at once, we take a small subset of examples from the data set and compute the gradient and other parameters, such as a and etc., based only on that subset of examples. We repeat this process for the entire data set. Typically, the subset of examples is selected randomly. Because of this random data set aspect of the algorithm, it is called “stochastic”, which means it involves a random component. The subset of examples is called a batch.

Let’s look at the same example we covered in the Gradient Descent section of this chapter, with the following values of x, y, ŷ, a, b, and alpha.

x: 0.2, 0.4, 0.6, 0.8, 1.0, 1.2

y: 0.18, 0.32, 0.42, 0.48, 0.58, 0.72

ŷ = ax + b

a=1

b=1

alpha= 0.1

Since in the Stochastic Gradient Descent algorithm there is a concept of batch, which is the number of data samples used in calculating the gradient descent parameters, we choose the batch size for this example as below:

batch size = 2

Let’s create a table of x, y, ŷ, and other parameters and variables, just as we did for the gradient descent algorithm. We start by transferring the x and y data to the table. Next, since the batch size is 2, we calculate the ŷ values for only the first two data samples, using the initial values of a = 1 and b = 1 and the equation ŷ = ax + b.

Epoch: 0 Batch: 0 a = 1 b = 1 Batch size: 2 alpha = 0.1

Next, we calculate the values of ŷ — y, loss (L), ∂L/∂a, ∂L/∂a, updated a, and updated b using the gradient descent formulas discussed in the previous section. Note that these formulas require the summation (Σ) of values and the number of data samples (n) in the calculation. In Scholastic Gradient Descent, the scope of the summation and the number of samples is limited to a single batch, not the entire data set. This means that for the first batch (of the first epoch), Batch 0, we calculate the values of the columns ŷ — y, Loss (L), and others by calculating the column sums for the first two rows of the data table and averaging over two data samples (or dividing the sums by 2). The table below shows all the column values for this first epoch and first batch. Below is an example of how we calculated some of these values:

ŷ — y = ((1.2–0.18) + (1.4–0.32)) / 2 = 1.05

Loss = ((1.2–0.18)² + (1.4–0.32)²) / 2 = 1.1.03

Epoch: 0 Batch: 0 a = 1 b = 1 Batch size: 2 alpha = 0.1

Figure 2: Initial Batch Processing in Stochastic Gradient Descent

We use the following formula to calculate the values of loss (L), ∂L/∂a, ∂L/∂a, updated a, and updated b. Note that only data from the first two rows are used to calculate these values, and n = 2.

The following tables show the values of all parameters at various stages of epochs and batches, starting with Epoch: 0, Batch 1 to Epoch 3:, Batch: 1. Note how the Loss (L) decreases as we go through epochs and batches.

Figure 3: Batch Processing in Subsequent Epochs (Epoch: 0 Batch: 1 a = 0.937 b = 0.790 Batch size: 2 alpha = 0.1)
Figure 4: Convergence Across Epochs and Batches (Epoch: 0 Batch: 2 a = 0.798 b = 0.591 Batch size: 2 alpha = 0.1)
Figure 5: Final Epoch and Batch Calculations (Epoch: 3 Batch: 1 a = 0.456 b = 0.166 Batch size: 2 alpha = 0.1)

Stochastic Gradient Descent with Momentum

Stochastic Gradient Descent with Momentum, or SGDM, is an extension of the traditional Stochastic Gradient Descent (SGD) algorithm. While SGD is effective due to its computational efficiency, it has the disadvantage of sometimes being erratic in its search for the global minimum. This is where SGDM comes in, providing a more refined approach.

The SGDM incorporates the concept of momentum, which is analogous to a heavy ball rolling through the loss function landscape. Just as a ball gains speed and continues on its path, momentum helps the parameter updates maintain direction and be less perturbed by the “noisy” gradients. This is achieved by combining the gradient of the current batch with the previous gradients, effectively “smoothing” the updates. Incorporating momentum allows the algorithm to dampen oscillations and converge more quickly and smoothly.

In SGDM, instead of calculating the gradient based solely on the current batch, the algorithm keeps track of past gradients and adds a fraction of them to the current gradient. This fraction is controlled by a parameter often referred to as γ(gamma), which typically has a value such as 0.9. This means that 90% of the update from the previous step is added to the current step, giving the algorithm a memory of past updates and thus a smoother trajectory towards the minimum.

Below are the formulas used in the Stochastic Gradient Descent with Momentum (SGDM):

Where:

  • a and b are the parameters of the linear regression model which we try to optimize.
  • α is the learning rate.
  • va and vb are the momentum terms for the parameter a and b.
  • 𝛽 is the momentum coefficient. It controls the mixing of the current gradient with the past velocity to calculate the new velocity. A common value for β is 0.9.
  • ∂L/∂a and ∂L/∂b are the partial derivatives of the loss function L with respect to the parameters a and b.
  • (1 — 𝛽): This term scales the contribution of the current gradient relative to the momentum coefficient. The closer β is to 1, the larger the influence of the previous velocities on the current update

To demonstrate how Momentum can help overcome the divergence of the loss function, let’s use the same data examples discussed earlier and use the initial value of a=1, b=2, and alpha=1.0. These values are summarized below:

x: 0.2, 0.4, 0.6, 0.8, 1.0, 1.2

y: 0.18, 0.32, 0.42, 0.48, 0.58, 0.72

ŷ = ax + b

a = 1

b = 2

alpha= 1.0

If we use the above data and parameters to run Stochastic Gradient Descent for a few iterations WITHOUT incorporating momentum, the loss function starts to converge and get out of control, as shown in the chart below.

Figure 6: Loss Function Divergence Without Momentum

Obviously, this is an undesirable solution. To solve this problem, we try the same example data and parameters as above, but this time we use the formulas for “Stochastic Gradient Descent WITH Momentum”. Note that this time the loss function has now converged to a relatively acceptable value, as shown in the diagram below.

Stochastic Gradient Descent (SGD) is an optimization algorithm used to minimize the loss function in machine learning models, and is particularly effective on large datasets. Unlike traditional gradient descent, which processes the entire dataset to update model parameters, SGD updates parameters using a randomly selected subset of the data, called a batch.

Summary

Stochastic Gradient Descent (SGD) extends the traditional gradient descent algorithm by introducing batch processing, where parameters are updated using randomly selected subsets of data. This approach makes SGD computationally efficient and effective for large data sets by reducing the time required for each iteration. In addition, incorporating momentum into SGD helps achieve smoother and more stable convergence of the loss function, further improving the optimization process and allowing models to escape local minima more effectively.

4 Ways to Learn

1. Read the article: Stochastic Gradient Descent

2. Play with the visual tool: Stochastic Gradient Descent

Play with the visual tool: Stochastic Gradient Descent

3. Watch the video: Stochastic Gradient Descent

4. Practice with the code: Stochastic Gradient Descent

Previous Article: Gradient Descent
Next Article: Tensors

--

--

Muneeb S. Ahmad

Muneeb Ahmad is a Senior Microservices Architect and Recognized Educator at IBM. He is pursuing passion in ABC (AI, Blockchain, and Cloud)