Understanding the Forward and Backward Pass in PyTorch: A Step-by-Step Walkthrough

Muneeb S. Ahmad
6 min readOct 19, 2024

--

In this article, we’ll dive into the fundamental concepts of the forward and backward pass using PyTorch, demonstrating how gradient descent works with a simple linear regression model. We’ll explain the process by visualizing each step, using charts to track loss reduction and parameter updates (weights and bias). By the end, you’ll understand how PyTorch automates the optimization process using backpropagation.

Problem Setup: Linear Regression

We will use a simple linear regression model defined by the equation:

y′ = a⋅x + b

Here, a is the slope, and b is the intercept. Our task is to adjust these parameters such that the predicted values y′ closely match the true values y.

Input and Output Data

We start by defining our input and output data as shown below:

X_values = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2]
Y_values = [0.18, 0.32, 0.42, 0.48, 0.58, 0.72]

These represent six data points that follow a linear trend. We aim to fit a line through these points using Stochastic Gradient Descent (SGD) in PyTorch.

Model Initialization

Defining the Model

We define our linear regression model using PyTorch’s nn.Module:

class LinearRegressionModel(nn.Module):
def __init__(self):
super(LinearRegressionModel, self).__init__()
self.linear = nn.Linear(1, 1) # 1 input, 1 output

def forward(self, x):
return self.linear(x)

Here, self.linear contains the parameters a and b we want to optimize. The forward method computes the predicted values y’ for the given input x.

Initializing Parameters

We initialize the parameters manually for better control:

initial_a = 1.0
initial_b = 1.0
model.linear.weight.data.fill_(initial_a)
model.linear.bias.data.fill_(initial_b)

This ensures that we start with a=1.0 and b=1.0, and we can observe how these values evolve as the training progresses.

Optimizer and Loss Function

To train the model, we need two key components: an optimizer and a loss function.

Optimizer: We use Stochastic Gradient Descent (SGD) to optimize the model parameters:

optimizer = optim.SGD(model.parameters(), lr=0.1)

The learning rate (lr) of 0.1 controls how much we adjust the parameters on each step. A smaller learning rate means smaller updates, while a larger one means more aggressive updates.

Loss Function: We use Mean Squared Error (MSE) to measure how well our predictions match the actual values:

criterion = nn.MSELoss()

The MSE loss calculates the average squared difference between predicted and actual values. The goal is to minimize this loss.

Training Loop with Forward and Backward Passes

The training process involves alternating between forward passes (where the model makes predictions) and backward passes (where gradients are computed and used to update the parameters). We use Mean Squared Error (MSE) as the loss function and Stochastic Gradient Descent (SGD) as the optimizer.

Let’s break down the key steps of the training loop:

1. Forward Pass
In the forward pass, the model computes the predicted value y’ using the current values of a and b:

Y_pred = model(X_batch)

For each batch of data points, we calculate the loss, which is the difference between the predicted and true values:

loss = criterion(Y_pred, Y_batch)

2. Backward Pass
After calculating the loss, we perform the backward pass to compute gradients. PyTorch automatically computes the gradients of the loss with respect to a and b using backpropagation:

loss.backward()

This step computes how much we should adjust a and b to reduce the loss.

3. Zeroing Gradients

Before applying the gradients to update the parameters, we need to zero them out. Otherwise, PyTorch accumulates gradients across multiple batches:

optimizer.zero_grad()

This step ensures that the gradients are reset for the current batch.

4. Updating Parameters
Finally, we update the parameters using the computed gradients:

optimizer.step()

The optimizer adjusts a and b by moving them in the direction that reduces the loss.

Epochs and Batching

An epoch refers to a full pass through the entire dataset. In each epoch, we divide the data into smaller batches. For this example, we use a batch size of 2:

batch_size = 2
n_batches = len(X_train) // batch_size

In each batch, the model processes two data points, performs the forward and backward passes, and updates the parameters.

For each epoch, we loop through all batches, as shown in the code snippet below:

for epoch in range(num_epochs):
for batch_idx in range(n_batches):
# Get the mini-batch
start_idx = batch_idx * batch_size
end_idx = start_idx + batch_size

X_batch = X_train[start_idx:end_idx]
Y_batch = Y_train[start_idx:end_idx]

# Forward pass
Y_pred = model(X_batch)
loss = criterion(Y_pred, Y_batch)

# Backward pass and optimization
optimizer.zero_grad() # Zero the gradients
loss.backward() # Backpropagate the loss
optimizer.step() # Update weights

For each batch, the model adjusts its parameters based on the batch-specific loss. Over time, this helps the model converge toward the optimal values for a and b.

Visualizing the Training Process

Understanding the internal workings of gradient descent and backpropagation can be challenging, so visualizing the training process helps clarify how model parameters are updated and how the loss decreases over time. In this section, we will break down the key visual elements that track the progress of the model’s training. By examining how predictions, loss, and gradients evolve batch by batch, we can observe how the model improves and converges toward an optimal solution.

Tracking Parameter Updates and Loss Reduction through Stochastic Gradient Descent

In the animation, we visualize how each mini-batch of data affects the weight (a), bias (b), and loss (L). With each forward pass, the model makes predictions based on the current values of a and b, calculates the loss, and then adjusts these parameters during the backward pass. The table captures these adjustments for each data point, showing the residuals (ŷ - y), the computed gradients (∂L/∂a, ∂L/∂b), and how the parameters are updated as training progresses. Over time, the model becomes more accurate, which is reflected in the decreasing loss.

Visualizing the X-Y Plot for Predictions

This animation provides a real-time view of how the model’s predictions (ŷ) fit the actual data points (y). The X-Y plot dynamically updates as the model adjusts its weight and bias values. Initially, the predicted line may not align well with the data points, but as the training continues, the line shifts to better fit the data, indicating that the model is improving its predictions.

Monitoring the Loss Reduction over Batches

Another essential visualization is the loss curve, which shows how the loss decreases as more batches are processed. This curve starts with a relatively high loss, but with each iteration, the loss gradually decreases, signifying that the model is learning from the data. As the curve flattens out, it indicates that the model is nearing its optimal parameters, and further improvements become minimal.

Try It Yourself: Run the Code in Google Colab

To fully understand and experiment with the concepts discussed in this article, you can run the complete code directly in your browser using Google Colab. Colab provides an easy-to-use environment for running Python code, including PyTorch, without any setup required on your local machine. You’ll be able to visualize the forward and backward passes, track the updates of the model parameters, and see how the loss decreases over time.

Click the link below to open the Colab notebook and try it for yourself:

Run the Code in Google Colab

In this notebook, you can modify the learning rate, batch size, and number of epochs to see how these hyperparameters impact the training process. This hands-on experience will solidify your understanding of gradient descent and backpropagation in PyTorch.

Explore More: Interactive Visual Demo

To deepen your understanding of Stochastic Gradient Descent (SGD) and the forward-backward pass, you can explore an interactive visual demo on this topic. This tool allows you to see how the gradient descent algorithm adjusts parameters in real-time and provides visualizations similar to what we’ve discussed in this article.

Click the link below to access the interactive tool:

Explore the Interactive Demo on Gradient Descent

In this demo, you can experiment with different learning rates, batch sizes, and see how using momentum affects the convergence process. This hands-on experience will provide valuable insights into how SGD works and further solidify your grasp of the concept through direct experimentation.

Conclusion

In this article, we explained the forward and backward pass in PyTorch through a simple linear regression model. We walked through the initialization of the model, optimizer, and loss function, and explored how PyTorch automates the optimization process using backpropagation. By using epochs and batching, we processed the data efficiently and tracked how the model improved over time.

Through this understanding, you can confidently approach more complex models in PyTorch and implement similar training loops for various machine learning tasks.

--

--

Muneeb S. Ahmad

Muneeb Ahmad is a Senior Microservices Architect and Recognized Educator at IBM. He is pursuing passion in ABC (AI, Blockchain, and Cloud)