Deep Learning 101: Lesson 9: Multi-layer Neural Network

7 min readAug 29, 2024

This article is part of the “Deep Learning 101” series. Explore the full series for more insights and in-depth learning here.
☞ Learn with the visual tool: Multi-layer Network

A multilayer neural network is an advanced model used in artificial intelligence and machine learning. Think of it as a collection of interconnected perceptrons, the topic we talked about in a previous session. However, unlike a perceptron, which has only one layer of neurons, a multilayer neural network has multiple layers stacked on top of each other. Each layer receives input from the previous layer and applies a mathematical operation called an activation function, such as the sigmoid function. This allows the network to capture complex relationships between inputs and outputs. To make the network learn, we use back-propagation, a technique that adjusts the weights connecting the neurons based on the error between the predicted and actual output. This adjustment is controlled by gradient descent or other algorithms. Through this iterative process, the network improves its ability to make accurate predictions, ultimately enabling it to solve complex problems.

Forward Propagation

The first step in forward propagation is to calculate the weighted sum of the inputs and biases for each neuron in the hidden layers. For the first hidden neuron

h1 = w1⋅x1 + w2⋅x2 + b1

Where x1 and x2 are inputs, w1 and w2 are weights, and b1 is the bias for the neuron. The same process applies to ℎ2 with its corresponding weights and bias.

h2 = w3⋅x1 + w4⋅x2 + b2

The next step is to pass these weighted sums through an activation function a(⋅), which introduces non-linearity into the model. This is crucial for the network to learn complex patterns. The output of the activation function for the first hidden neuron is:

a(h1)=ActivationFunction(h1)

And similarly for ℎ2

a(h2)=ActivationFunction(h2)

The final output of the network, before applying the output activation function, is the weighted sum of the activated hidden neurons (signals) plus the output bias b3:

y = w5⋅a(h1) + w6⋅a(h2) + b3

The output y is then passed through an output activation function a(⋅) to produce the final prediction a(y). This could be a sigmoid, softmax, or any other activation function depending on the task:

a(y)=OutputActivationFunction(y)

The predicted output a(y) is then compared to the actual target t to calculate the error E, which will be used during backpropagation. Assuming the error as Mean Square Error, can be calculated using the below formula:

Forward propagation is the essential first step that leads to the generation of predictions in a neural network. It sets the stage for backpropagation, where the network learns from its errors and updates its weights accordingly. Together, these processes allow neural networks to learn from data and improve their predictions over time.

Backpropagation and The Chain Rule

In neural network training, the chain rule is a cornerstone of the backpropagation algorithm, which is used to update the network’s weights and biases. This algorithm relies on the chain rule to compute the gradient of the loss function with respect to each weight, allowing the network to learn from data.

Figure 2: Applying the Chain Rule in Backpropagation

In the above diagram, each neuron’s output, such as h1 and h2, is a function of its inputs, which are affected by weights (w1,w2,…) and biases (b1,b2). The final output a(y) is then a function of these neurons’ outputs.

To update the weight w5, the chain rule helps us decompose the derivative of the loss E with respect to w5 into a product of simpler derivatives, as shown in the equation above.

Figure 3: Gradient Calculation for Weights in Backpropagation

For a weight that is not directly connected to the output, like w1, we must apply the chain rule across multiple layers. The gradient of w1 with respect to the error is computed as shown in the equation below.

Practical Example: XOR Problem

The XOR problem serves as an ideal example to demonstrate a multi-layer network’s ability to capture non-linear relationships. For this example below is our dataset:

After training the network for a few epochs until the loss (error) is reduced to about 0.01, the various weights and biases become as shown in the below diagram.

Figure 4: Network Parameters and Predicted Outputs After Training on XOR Problem

With these values of network parameters, the predicted outputs are calculated as shown in the below table

Multi-layer networks are especially adept at handling complex datasets where the relationship between inputs and outputs is not linearly separable. We will explore the classic XOR problem, where the goal is to classify input pairs based on their exclusivity. The XOR function outputs true only when the inputs differ. Our dataset is as follows:

A simple MLNN for the XOR problem consists of an input layer with two neurons, a hidden layer with two neurons, and an output layer with one neuron as shown below:

Tensor form of the forward and backward propagation

Figure 5: Tensor Form of Forward and Backward Propagation

Let’s explore the structure and parameters of a simple multi-layer neural network. The network consists of an input layer with two inputs, x1 and x2, a hidden layer with two neurons, and an output layer with a single neuron. These layers are shown below with the associated mathematical symbols and equations.

Input Layer:

Weights and biases: Layer 1

Weights and biases: Layer 2

The hidden layer and output calculations can be expressed as:

The activation function a(..) introduces non-linearity, allowing the network to learn complex patterns.

Below are the equations for backpropagation algorithm:

The overall goal is to update the weights and biases by subtracting the product of the learning rate (α) and the calculated the respective gradients:

where l indicates the layer number.

The gradient with respect to the weights of the output layer, which can be represented as below:

The gradient with respect to the hidden layer weights is a bit more complex because it requires the chain rule. It can be represented as below.

Where:

Multi-layer neural networks, with their ability to capture complex, non-linear interactions between features, are a cornerstone of modern AI. Through the processes of forward propagation and backpropagation, these networks can be trained to solve a wide array of problems, continually improving their performance as they learn from more data.

Summary

A multi-layer neural network, with its ability to capture complex, non-linear interactions between features, is a cornerstone of modern AI. By leveraging forward propagation to generate predictions and backpropagation to learn from errors, these networks iteratively refine their weights and biases, enabling them to solve complex problems. The XOR problem exemplifies the power of these networks, showcasing their capability to learn and predict non-linear relationships effectively.