Understanding the Forward and Backward Pass in PyTorch: A Step-by-Step Walkthrough
In this article, we’ll dive into the fundamental concepts of the forward and backward pass using PyTorch, demonstrating how gradient descent works with a simple linear regression model. We’ll explain the process by visualizing each step, using charts to track loss reduction and parameter updates (weights and bias). By the end, you’ll understand how PyTorch automates the optimization process using backpropagation.
Problem Setup: Linear Regression
We will use a simple linear regression model defined by the equation:
y′ = a⋅x + b
Here, a is the slope, and b is the intercept. Our task is to adjust these parameters such that the predicted values y′ closely match the true values y.
Input and Output Data
We start by defining our input and output data as shown below:
X_values = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2]
Y_values = [0.18, 0.32, 0.42, 0.48, 0.58, 0.72]
These represent six data points that follow a linear trend. We aim to fit a line through these points using Stochastic Gradient Descent (SGD) in PyTorch.
Model Initialization
Defining the Model
We define our linear regression model using PyTorch’s nn.Module:
class LinearRegressionModel(nn.Module):
def __init__(self):
super(LinearRegressionModel, self).__init__()
self.linear = nn.Linear(1, 1) # 1 input, 1 output
def forward(self, x):
return self.linear(x)
Here, self.linear contains the parameters a and b we want to optimize. The forward method computes the predicted values y’ for the given input x.
Initializing Parameters
We initialize the parameters manually for better control:
initial_a = 1.0
initial_b = 1.0
model.linear.weight.data.fill_(initial_a)
model.linear.bias.data.fill_(initial_b)
This ensures that we start with a=1.0
and b=1.0
, and we can observe how these values evolve as the training progresses.
Optimizer and Loss Function
To train the model, we need two key components: an optimizer and a loss function.
Optimizer: We use Stochastic Gradient Descent (SGD) to optimize the model parameters:
optimizer = optim.SGD(model.parameters(), lr=0.1)
The learning rate (lr) of 0.1 controls how much we adjust the parameters on each step. A smaller learning rate means smaller updates, while a larger one means more aggressive updates.
Loss Function: We use Mean Squared Error (MSE) to measure how well our predictions match the actual values:
criterion = nn.MSELoss()
The MSE loss calculates the average squared difference between predicted and actual values. The goal is to minimize this loss.
Training Loop with Forward and Backward Passes
The training process involves alternating between forward passes (where the model makes predictions) and backward passes (where gradients are computed and used to update the parameters). We use Mean Squared Error (MSE) as the loss function and Stochastic Gradient Descent (SGD) as the optimizer.
Let’s break down the key steps of the training loop:
1. Forward Pass
In the forward pass, the model computes the predicted value y’ using the current values of a and b:
Y_pred = model(X_batch)
For each batch of data points, we calculate the loss, which is the difference between the predicted and true values:
loss = criterion(Y_pred, Y_batch)
2. Backward Pass
After calculating the loss, we perform the backward pass to compute gradients. PyTorch automatically computes the gradients of the loss with respect to a and b using backpropagation:
loss.backward()
This step computes how much we should adjust a and b to reduce the loss.
3. Zeroing Gradients
Before applying the gradients to update the parameters, we need to zero them out. Otherwise, PyTorch accumulates gradients across multiple batches:
optimizer.zero_grad()
This step ensures that the gradients are reset for the current batch.
4. Updating Parameters
Finally, we update the parameters using the computed gradients:
optimizer.step()
The optimizer adjusts a and b by moving them in the direction that reduces the loss.
Epochs and Batching
An epoch refers to a full pass through the entire dataset. In each epoch, we divide the data into smaller batches. For this example, we use a batch size of 2:
batch_size = 2
n_batches = len(X_train) // batch_size
In each batch, the model processes two data points, performs the forward and backward passes, and updates the parameters.
For each epoch, we loop through all batches, as shown in the code snippet below:
for epoch in range(num_epochs):
for batch_idx in range(n_batches):
# Get the mini-batch
start_idx = batch_idx * batch_size
end_idx = start_idx + batch_size
X_batch = X_train[start_idx:end_idx]
Y_batch = Y_train[start_idx:end_idx]
# Forward pass
Y_pred = model(X_batch)
loss = criterion(Y_pred, Y_batch)
# Backward pass and optimization
optimizer.zero_grad() # Zero the gradients
loss.backward() # Backpropagate the loss
optimizer.step() # Update weights
For each batch, the model adjusts its parameters based on the batch-specific loss. Over time, this helps the model converge toward the optimal values for a and b.
Visualizing the Training Process
Understanding the internal workings of gradient descent and backpropagation can be challenging, so visualizing the training process helps clarify how model parameters are updated and how the loss decreases over time. In this section, we will break down the key visual elements that track the progress of the model’s training. By examining how predictions, loss, and gradients evolve batch by batch, we can observe how the model improves and converges toward an optimal solution.
Tracking Parameter Updates and Loss Reduction through Stochastic Gradient Descent
In the animation, we visualize how each mini-batch of data affects the weight (a
), bias (b
), and loss (L
). With each forward pass, the model makes predictions based on the current values of a
and b
, calculates the loss, and then adjusts these parameters during the backward pass. The table captures these adjustments for each data point, showing the residuals (ŷ - y
), the computed gradients (∂L/∂a, ∂L/∂b), and how the parameters are updated as training progresses. Over time, the model becomes more accurate, which is reflected in the decreasing loss.
Visualizing the X-Y Plot for Predictions
This animation provides a real-time view of how the model’s predictions (ŷ
) fit the actual data points (y
). The X-Y plot dynamically updates as the model adjusts its weight and bias values. Initially, the predicted line may not align well with the data points, but as the training continues, the line shifts to better fit the data, indicating that the model is improving its predictions.
Monitoring the Loss Reduction over Batches
Another essential visualization is the loss curve, which shows how the loss decreases as more batches are processed. This curve starts with a relatively high loss, but with each iteration, the loss gradually decreases, signifying that the model is learning from the data. As the curve flattens out, it indicates that the model is nearing its optimal parameters, and further improvements become minimal.
Try It Yourself: Run the Code in Google Colab
To fully understand and experiment with the concepts discussed in this article, you can run the complete code directly in your browser using Google Colab. Colab provides an easy-to-use environment for running Python code, including PyTorch, without any setup required on your local machine. You’ll be able to visualize the forward and backward passes, track the updates of the model parameters, and see how the loss decreases over time.
Click the link below to open the Colab notebook and try it for yourself:
In this notebook, you can modify the learning rate, batch size, and number of epochs to see how these hyperparameters impact the training process. This hands-on experience will solidify your understanding of gradient descent and backpropagation in PyTorch.
Explore More: Interactive Visual Demo
To deepen your understanding of Stochastic Gradient Descent (SGD) and the forward-backward pass, you can explore an interactive visual demo on this topic. This tool allows you to see how the gradient descent algorithm adjusts parameters in real-time and provides visualizations similar to what we’ve discussed in this article.
Click the link below to access the interactive tool:
Explore the Interactive Demo on Gradient Descent
In this demo, you can experiment with different learning rates, batch sizes, and see how using momentum affects the convergence process. This hands-on experience will provide valuable insights into how SGD works and further solidify your grasp of the concept through direct experimentation.
Conclusion
In this article, we explained the forward and backward pass in PyTorch through a simple linear regression model. We walked through the initialization of the model, optimizer, and loss function, and explored how PyTorch automates the optimization process using backpropagation. By using epochs and batching, we processed the data efficiently and tracked how the model improved over time.
Through this understanding, you can confidently approach more complex models in PyTorch and implement similar training loops for various machine learning tasks.