Deep Learning 101: Lesson 12: Activation Functions

8 min readAug 30, 2024

This article is part of the “Deep Learning 101” series. Explore the full series for more insights and in-depth learning here.
☞ Learn with the visual tool: Activation Functions

An “activation function” is a crucial component of a neural network that helps introduce non-linearity and enables the network to learn complex patterns and relationships in data. Think of it as a mathematical function that takes the weighted sum of inputs from the previous layer and applies a transformation to produce an output for each neuron. It acts as a decision-maker, determining whether the neuron should be activated or not based on the input it receives. Activation functions like the sigmoid function or the rectified linear unit (ReLU) function are commonly used. By applying these functions, the network becomes capable of learning and modeling complicated relationships in the data, enhancing its ability to solve complex problems and make accurate predictions.

This is a basic neuron and some inputs x1, x2, x3 are fed to it with weights w1, w2, and w3. Then, the weighted sum is calculated as Sigma xi wi. Next, we apply the activation function which produced the output. This output is either the final output or fed into the next layer of the network. There are several different activation functions available because different functions have unique properties that make them suitable for specific tasks. Some functions, like the sigmoid function, can squash the input values between 0 and 1, which is helpful for tasks that involve probabilities or binary decisions. Other functions, like the rectified linear unit (ReLU), are simpler and computationally efficient, making them effective for many tasks.

Before we explore the specifics of some well-known activation function, it’s essential to understand why different functions are used in various scenarios. Activation functions are not one-size-fits-all; they are selected based on the specific needs of the neural network and the problem it is designed to solve. The choice of an activation function can significantly impact the performance and learning ability of a neural network. Some functions are better suited for certain types of data and tasks, such as classification or regression, while others excel in different aspects like computational efficiency or the ability to handle non-linear relationships. Understanding the properties and behaviors of each activation function helps in making informed decisions about which one to use in a given neural network architecture. This choice is crucial for optimizing the network’s performance and ensuring efficient learning during the training process. Below are some commonly used activation functions.

Linear Activation Function: This function maintains the input as is, without applying any transformation. It’s useful in situations where we want to preserve the numerical values of the input.

Sigmoid Function: Often used in binary classification problems, the sigmoid function squashes input values into a range between 0 and 1, resembling a probability.

Softmax Function: Extending the concept of the sigmoid function, softmax is used primarily in multi-class classification problems. It converts the outputs into probability distributions, where the sum of all probabilities is 1.

Tanh (Hyperbolic Tangent) Function: Similar to the sigmoid but with a range from -1 to 1, the tanh function is useful when we want to center the data, maintaining a zero mean.

ReLU (Rectified Linear Unit) Function: Popular in deep learning, ReLU is simple yet effective, converting all negative inputs to zero while maintaining positive inputs as they are.

ELU (Exponential Linear Unit) Function: ELU is similar to ReLU but tries to make the mean activations closer to zero, speeding up learning. It allows a small gradient when the unit is not active.

Hard Sigmoid Function: A computationally efficient approximation of the sigmoid function, it’s used in situations where computational resources are limited.

Softplus Function: This function provides a smooth approximation to the ReLU function. It’s differentiable and is used in scenarios where a differentiable approximation of ReLU is needed.

Softsign Function: This function is similar to tanh but uses a different mathematical approach. It’s less aggressive than tanh, offering a gentler transition.

ReLU6 Function: A variant of the ReLU function, ReLU6 caps the maximum output value at 6. This is particularly useful in scenarios where a bounded output is necessary.

SeLU (Scaled Exponential Linear Unit) Function: SeLU scales the ELU function for better performance in certain deep learning scenarios.

Let’s explore some of these activation functions in more detail.

Linear Activation Function

The Linear activation function, exemplified in the provided graph, is fundamentally a direct proportionality operator, where the output is a linear transformation of the input. Mathematically, this is represented as f(x) = w ⋅x + b for a single input/output scenario, where x is the input to the neuron, w denotes the weight, and b is the bias. If we set the bias b to zero and assign the weight w a value of 1, the function effectively becomes the identity function f(x)=x, indicating that the output is identical to the input. Therefore, an input of 2 yields an output of 2, as the below graph illustrates with a slope of 1.

Please note that altering the weight influences the steepness of the function’s slope. For instance, if the weight w is adjusted to 2, the relationship between the input and output doubles, making the function f(x)=2x. Consequently, an input of 2 will now result in an output of 4, demonstrating that the output is twice the input as shown in the bellow chart.

Figure 3: Linear Activation Function with Weight 2

This linear transformation allows the model to scale and shift the input data, providing a fundamental building block in neural networks for tasks where the proportionality of input and output is essential.

Sigmoid Function

In the realm of neural networks, the sigmoid function plays a pivotal role in shaping the curve of decision — a transition from one state to another. Visually represented in the attached graph, the sigmoid function showcases a characteristic ‘S’-shaped curve that elegantly transitions from a near-zero value to a value close to one.

Below is the simple form of the sigmoid function:

This is a beautiful mathematical representation where the function’s output is clearly bounded between 0 and 1, regardless of the input value. This boundedness makes the sigmoid function particularly useful for binary classification tasks, where the output can be interpreted as a probability: the likelihood of the input belonging to one class or the other.

The sigmoid function serves as a critical transformation in machine learning, converting any real-valued number into a value between 0 and 1, thereby framing it as a probability-like output. Its graph displays an ‘S’-shaped curve, which compresses large positive inputs towards 1 and large negative inputs towards 0, ensuring outputs remain within the bounds of a probability range. The central part of the curve, characterized by a steep slope, acts as a sensitive detector of input variations, making it especially useful for binary classification tasks. This region, where the sigmoid function mimics a linear behavior, allows for a clear distinction between outputs, thereby providing a sharp transition from lower to higher probability outcomes.

For a neural network, where decisions are not merely binary, the generalized form of the sigmoid function incorporates weights and biases, adjusting the curve’s steepness and position. Below is the generalized form of the equation for Sigmoid function:

This allows the function to be more flexible. Here w represents the weighted sum of the inputs, b is the bias, which shifts the curve laterally. This form maintains the same characteristic ‘S’ shape but enables the function to adapt to the specific data and decision boundaries of the task at hand.

The below table presents the mathematical expressions for some of the most prominent activation functions used in neural networks today, as well as their derivatives. This juxtaposition of functions with their derivatives is more than a reference — it’s a map of how neural networks learn and adapt to complex data patterns. Understanding these equations is fundamental for anyone looking to delve into neural network design and the intricacies of machine learning algorithms.

Figure 5: Common Activation Functions and Their Derivatives

Figure 5 presents a comparison of common activation functions and their derivatives. It serves as a critical reference for understanding how different functions impact neural network learning and behavior. The derivatives highlight the sensitivity of each function to input changes, which is essential for backpropagation and optimizing neural network weights. This understanding aids in selecting appropriate activation functions based on the specific requirements of the neural network and the nature of the data.

Summary

Activation functions play a crucial role in neural networks by introducing non-linearity, enabling the network to learn complex patterns and relationships in data. They transform the weighted sum of inputs from the previous layer into an output for each neuron. Various activation functions, including linear, sigmoid, softmax, tanh, ReLU, ELU, hard sigmoid, softplus, softsign, and ReLU6, each have unique properties that make them suitable for specific tasks. These functions help neural networks model complicated relationships, optimize performance, and ensure efficient learning during training.

4 Ways to Learn

1. Read the article: Activation Functions

2. Play with the visual tool: Activation Functions

3. Watch the video: Activation Functions

4. Practice with the code: Activation Functions

Previous Article: Data Preparation for Training Models
Next Article: Optimizers

Deep Learning 101: Lesson 12: Activation Functions

Summary

4 Ways to Learn

Written by Muneeb S. Ahmad

No responses yet