Deep Learning 101: Lesson 1: Data Scaling

Muneeb S. Ahmad
4 min readAug 28, 2024

--

This article is part of the “Deep Learning 101” series. Explore the full series for more insights and in-depth learning here.

Data scaling is a critical step in preparing data for machine learning and AI. Its purpose is to normalize and standardize the data to a consistent level. This ensures that no particular feature dominates the data and causes biased predictions. In addition, data scaling improves the performance of optimization algorithms used to train deep neural networks. Normalization and standardization are two commonly used data scaling methods in AI and machine learning.

Figure 1: Comparison of Original, Normalized, and Standardized Data

Normalization, also called min-max scaling, changes the values of features to occupy a range of [0, 1] or [-1, 1], reducing the influence of unusual values on the model. In contrast, standardization modifies the data to have a mean of zero and a standard deviation of one, making it conform to a standard normal distribution. We will examine when each method is best and how it affects AI model performance, giving you a useful guide for applying them to data preparation.

Let’s say we need to convert the following data, which is a list of test scores for a group of students, into its normalized and standardized form

Student scores:

85, 86, 100, 76, 81, 93, 84, 99, 71, 69, 93, 85, 81, 87, 89

Below is the formula for calculating the standard form:

In the above equation, the symbol ‘data’ represents each data element in the data set, min is the minimum value and max is the maximum value of the data set. From the above data set, we can obtain the min and max values as shown below:

min = 69

max = 100

Using these values for min, max, and data, we can calculate the normalized data set, rounded to 2 decimal places, as follows

normalized dataset:

0.52, 0.55, 1.00, 0.23, 0.39, 0.77, 0.48, 0.97, 0.06, 0.00, 0.77, 0.52, 0.39, 0.58, 0.65

Next, let’s convert the data set to the standardized data set format. Below is the formula:

In the above equation, the symbol ‘data’ represents each data element in the data set, min is the minimum value and max is the maximum value of the data set. From the above data set, we can obtain the min and max values as shown below:

min = 69

max = 100

Using these values for min, max, and data, we can calculate the normalized data set, rounded to 2 decimal places, as follows

normalized dataset:

0.52, 0.55, 1.00, 0.23, 0.39, 0.77, 0.48, 0.97, 0.06, 0.00, 0.77, 0.52, 0.39, 0.58, 0.65

Next, let’s convert the data set to the standardized data set format. Below is the formula:

In the equation above, the symbol “data” represents each data item in the data set, “mean” is the mean, and “std” is the standard deviation of the data set. Below are the formulas for calculating the mean and standard deviation:

Using the data set above and applying these formulas, we calculate the mean and standard deviation (std) as shown below:

mean = 85.27

std = 8.70

By substituting these mean and std values into the Standardized Data Set formula, we get the converted data set below.

standardized dataset:

-0.03,0.08,1.69,-1.07,-0.49,0.89,-0.15,1.58,-1.64,-1.87,0.89,-0.03,-0.49,0.20,0.43

Summary

Data scaling is an essential preprocessing step in machine learning and AI, aimed at normalizing and standardizing data to ensure consistent feature influence and unbiased predictions. Normalization, or min-max scaling, adjusts feature values to a range of [0, 1] or [-1, 1], reducing the impact of outliers, while standardization modifies the data to have a mean of zero and a standard deviation of one, aligning it with a standard normal distribution. By enhancing the performance of optimization algorithms used in training neural networks, data scaling is crucial for improving model accuracy. The document illustrates these concepts through practical examples of converting student test scores into normalized and standardized formats, providing a clear guide for applying these techniques in data preparation.

4 Ways to Learn

1. Read the article: Data Scaling

2. Play with the visual tool: Data Scaling

Play with the visual tool: Data Scaling

3. Watch the video: Data Scaling

4. Practice with the code: Data Scaling

Next Article: Linear Regression

--

--

Muneeb S. Ahmad

Muneeb Ahmad is a Senior Microservices Architect and Recognized Educator at IBM. He is pursuing passion in ABC (AI, Blockchain, and Cloud)