Deep Learning 101: Lesson 13: Optimizers

10 min readAug 30, 2024

This article is part of the “Deep Learning 101” series. Explore the full series for more insights and in-depth learning here.
☞ Learn with the visual tool: Optimizers

Optimizers in machine learning are advanced algorithms that fine-tune the parameters of models to minimize the error in predictions. They are critical in determining the speed and accuracy with which models learn from data. This section will elaborate on the role of optimizers by referencing the foundational concept of gradient descent, as explored in Chapter 3, and will further dissect how these optimizers refine the gradient descent process to achieve efficient convergence.

Gradient descent, as it can be generally described like a blindfolded trek down a mountain in search of the lowest point in a valley. Each step represents an iteration where the model parameters are updated to reduce the loss — the measure of error between the model’s predictions and the actual outcomes. However, gradient descent is merely the starting point. Optimizers enhance this journey by intelligently controlling the size and direction of these steps, or parameter updates, to reach the valley’s bottom — our point of convergence — more swiftly and accurately.

Role of Optimizers

Optimizers refine the gradient descent process in various ways:

Adaptive Step Size: While gradient descent moves with a fixed step size, optimizers like Adam and RMSProp adjust the step size dynamically. They make it possible to take larger steps when we’re far from the goal and smaller steps as we approach the target, preventing overshooting.
Momentum: Just like a ball rolling down a hill gathers momentum, certain optimizers accumulate past gradients. This helps in pushing through noisy gradients and plateaus in the loss landscape, a method used by Momentum and Adam.
Stochastic Approach: SGD, mentioned in Chapter 3, updates parameters more frequently using a random subset of data, which can lead to faster convergence, especially with large datasets.

Importance of Optimizers

The primary importance of optimizers is their impact on the efficiency of convergence:

Convergence Speed: Optimizers can significantly reduce the number of iterations required to reach convergence, translating to faster training times, which is crucial for large-scale applications and complex models.
Stability and Accuracy: By controlling the step sizes and directions, optimizers prevent erratic updates that could lead to divergence or suboptimal convergence, thus enhancing the stability and accuracy of the model.
Practicality: In real-world scenarios, where data can be noisy and functions non-convex, optimizers ensure that models remain robust and less sensitive to the initial choice of parameters or learning rates.

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is a powerful optimization algorithm that represents an evolution of the basic gradient descent technique discussed in Chapter 3. While traditional gradient descent updates model parameters using the entire dataset to calculate the gradient of the loss function, SGD introduces randomness into the process, significantly improving efficiency, especially with large datasets.

SGD is predicated on the premise that the gradient of the loss function can be estimated using a randomly selected subset of data, rather than the full dataset. This subset, known as a mini-batch, provides a stochastic approximation of the gradient, thus the name Stochastic Gradient Descent.

The core advantage of SGD lies in its use of mini-batches. By using a small, randomly selected subset of the data at each iteration, SGD reduces the computational burden significantly. This stochastic approach also confers several benefits:

Faster Iterations: Each update is quicker since it processes less data, allowing the algorithm to take more frequent steps towards the minimum.

Noise Reduction: The randomness helps in avoiding local minima traps, which might hinder convergence in the standard gradient descent.

Scalability: SGD can handle massive datasets that are otherwise infeasible to process with standard gradient descent due to memory constraints.

In each iteration, SGD performs the following steps:

Randomly shuffle the dataset at the beginning of each epoch (a full pass through the dataset).
Select a mini-batch of data points based on a pre-defined batch size.
Calculate the gradient of the loss function for that mini-batch.
Update the model parameters according to the calculated gradient.
Repeat steps 2–4 for each mini-batch until the entire dataset is processed.

The parameters converge to the optimum as the algorithm processes more mini-batches. Unlike batch gradient descent, which might only make one update per epoch, SGD could make as many updates as there are mini-batches. This frequent updating enables SGD to converge more quickly in practice, despite the higher variance in the updates.

Below is the fundamental SDG equation, which provides the backbone of the algorithm’s operation:

Where

In the SGD algorithm the parameters are updated incrementally, guided by a randomly selected subset of data, which is where the ‘stochastic’ nature of the algorithm derives from.

Figure 1: Visualizing the action of Stochastic Gradient Descent (SGD)

To visualize the action of SGD, consider a 3D graph that plots the loss function landscape. The z-axis represents the loss, while the x and y-axes represent the model parameters. The image shows a surface with contours representing different loss values, and the goal of SGD is to find the lowest point on this surface.

Figure 2a: SGD path towards the local minimum

Figure 2b: Another SGD path towards the local minimum

Convergence with SGD can be toward a global minimum or a local minimum. A global minimum is the point where the function attains its lowest possible value, ideal for the most accurate model performance. The above images shows the path of SGD as it successfully navigates toward the global minimum, represented by the deepest basin on the graph.

Figure 3: Gradient Descent with Momentum path towards gloabl minimum

However, SGD does not always lead to the global minimum; it can sometimes get trapped in a local minimum. The above image represents an instance where SGD converges to a local minima, which are points where the function value is lower than the surrounding area but not the lowest overall.

Gradient Descent with Momentum (Momentum)

Gradient Descent with Momentum, often abbreviated as Momentum, is a method that accelerates the convergence of the stochastic gradient descent by incorporating the direction of previous gradients into the current update. This technique is especially effective in addressing the oscillations and slow convergence rates that can occur with standard SGD. The Momentum algorithm enhances the standard SGD update rule by adding a fraction of the previous update vector to the current update. This fraction is called the momentum coefficient and it is typically set between 0 and 1.

In a side-by-side comparison of the convergence plots for standard SGD and SGD with Momentum, one can observe a stark contrast. The plots will reveal that while standard SGD makes progress towards the minimum, its path may zigzag due to gradient variance. In contrast, Momentum’s path is smoother and often reaches the minimum in fewer iterations. This comparison highlights the effectiveness of Momentum in navigating the loss landscape more efficiently.

In summary the Gradient Descent with Momentum is a crucial advancement in optimization algorithms for machine learning. By considering the direction and magnitude of prior updates, Momentum mitigates the irregularities inherent in the gradient descent process and accelerates convergence. The visual contrasts between SGD with and without Momentum, as seen in the charts from Chapter 3, illustrate the substantial benefits of this method. Momentum not only smoothens the convergence path but also often results in faster training times and a more robust convergence to the global minimum, making it a valuable asset in the optimization toolbox.

Root Mean Squared Propagation (RMSProp)

Root Mean Squared Propagation, commonly known as RMSProp, is an adaptive learning rate method that addresses some of the inefficiencies of classical stochastic gradient descent. RMSProp modifies the learning rate for each weight individually, often leading to better performance and stability, especially in the context of large-scale neural networks.

RMSProp adjusts the learning rate for each weight by dividing it by a running average of the magnitudes of recent gradients. This means that the update for each weight is scaled by the inverse of the square root of the mean of the squares of recent gradients for that weight.

The key to RMSProp’s effectiveness is its ability to modulate the learning rate based on the recent history of gradients. Weights associated with infrequent but large gradients receive a smaller update since the denominator in the RMSProp update rule becomes larger for these weights. Conversely, weights associated with consistent but small gradients receive larger updates. This adaptive process results in a more stable and reliable convergence by preventing drastic changes in the parameter values due to erratic gradients.

RMSProp is particularly adept at dealing with functions where the ideal learning rate varies across dimensions. By allowing the learning rate to adapt based on the historical gradients, RMSProp can take larger steps in flatter directions and smaller steps in steeper directions, efficiently finding its way through different curvatures in the loss landscape.

The introduction of RMSProp has been a significant milestone in the development of optimization algorithms for deep learning. It paved the way for more sophisticated methods, like Adam, which combine the ideas of momentum and adaptive learning rates to provide even more robust convergence properties.

Adaptive Moment Estimation (Adam)

Adaptive Moment Estimation, known as Adam, is a sophisticated optimizer that has gained widespread popularity in training deep learning models. It combines the benefits of two other methods: the momentum principle and the adaptive learning rate from RMSProp, to deliver an optimization algorithm that can handle sparse gradients on noisy problems.

Adam’s power comes from its two-fold approach: it borrows the idea of momentum by using the moving average of the gradient, which helps in smoothing out the path towards the objective; and it employs the concept of adaptive learning rates from RMSProp, where the learning rate is adjusted based on the recent magnitudes of the gradients. This hybrid approach enables Adam to adjust its learning rate for each parameter based on the history of gradients, making it well-suited for dealing with noisy or sparse gradients.

In practice, Adam is particularly effective on problems with noisy or sparse gradient information. The optimizer can dampen the oscillations in the optimization trajectory, which is a common issue with momentum, by scaling the step size with the square root of the second moment estimate. This adaptability makes Adam highly reliable for a wide range of models and data modalities, from computer vision tasks with large, correlated datasets to natural language processing problems where the data is more sparse.

In summary, Adam stands out as a state-of-the-art optimizer in machine learning, adept at handling the diverse challenges presented by modern deep learning tasks. Its design, which thoughtfully combines momentum with an individualized adaptive learning rate for each parameter, gives it an edge in both convergence speed and stability. By automatically tuning the step size based on the gradients’ historical data, Adam offers a robust and efficient route to train models, often leading to rapid improvements in performance on a wide array of problems, including those with noisy or incomplete gradient information. Its versatility and ease of use have made it one of the go-to optimizers in the field.

Figure 4: Comparison of various optimizers in terms of how they converge to the local and global maximas.

In the quest for the optimal solution within machine learning models, different optimization algorithms can navigate the loss landscape in diverse ways. The above images offer a visual comparison of four such algorithms — Stochastic Gradient Descent (SGD), Momentum (SGDM), RMSProp, and Adaptive Moment Estimation (Adam) — illustrating their paths through a function’s contour as they seek to minimize the loss.

When targeting the global minimum (left side of the images), the paths taken by these optimizers highlight their distinctive strategies:

SGD starts with a direct approach towards the global minimum but then veers off, illustrating the challenges it faces with oscillations and potentially getting stuck in suboptimal paths. Momentum accelerates SGD by propelling it across the valleys and over the hills of the loss landscape, providing a smoother and more direct route to the global minimum, as indicated by its path. RMSProp shows an adaptive approach, adjusting its path more responsively to the curvature of the loss function, which helps it stay on course towards the global minimum without the wide arcs seen in SGD and SGDM. Adam combines the strengths of both Momentum and RMSProp, which allows it to navigate the contours with an adaptive, smooth trajectory, often reaching the global minimum more efficiently as depicted by its path.

In the context of local minima (right side of the images), these algorithms display varied behaviors:

SGD, without the aid of momentum or adaptive learning rates, can easily be trapped in local minima, as its path towards the right side indicates a potential premature convergence. Momentum may overshoot the local minima due to its accumulated velocity, which can be beneficial for escaping shallow basins but might also miss the broader context of the landscape. RMSProp adjusts its learning rate based on the gradient’s magnitude, allowing it to slow down and take careful steps around sharp curves, which can help it detect and avoid local minima more effectively.

Adam’s path shows the balanced approach, where it leverages both the smoothing effect of momentum and the adaptive learning rate to discern between local and global minima effectively.

Summary

Optimizers play a critical role in machine learning by fine-tuning model parameters to minimize prediction error efficiently. Advanced methods like Momentum, RMSProp, and Adam enhance traditional gradient descent by introducing adaptive step sizes, momentum accumulation, and adaptive learning rates, leading to faster and more stable convergence. These optimizers address challenges such as noisy gradients, local minima, and the need for faster training times, making them indispensable for training complex models in real-world scenarios.

4 Ways to Learn

1. Read the article: Optimizers

2. Play with the visual tool: Optimizers

3. Watch the video: Optimizers

4. Practice with the code: Optimizers

Previous Article: Activation Functions
Next Article: Loss Functions

Deep Learning 101: Lesson 13: Optimizers

Summary

4 Ways to Learn

Written by Muneeb S. Ahmad

No responses yet