Deep Learning 101: Lesson 11: Data Preparation for Training Models
This article is part of the “Deep Learning 101” series. Explore the full series for more insights and in-depth learning here.
Data is the fuel that powers the learning process of artificial intelligence models. It serves as the foundation upon which these models are built and trained to recognize patterns, make predictions, and generate meaningful insights. The availability and quality of data directly impact the performance and effectiveness of deep learning algorithms. When training a deep learning model, a large and diverse dataset is crucial to expose the model to a wide range of examples and variations, enabling it to learn robust and generalizable representations. The more data we provide, the better the model can learn and understand complex patterns and relationships in the data. Additionally, the quality of the data matters as it influences the model’s ability to generalize well to new, unseen examples.
The subsequent sections will walk you through two distinct examples that illuminate the practical aspects of dataset structure, preparation, and utilization. We’ll explore how a relatively simple dataset can teach a model to classify flowers while a more complex dataset can be used to predict real estate prices. These examples will illustrate the importance of data diversity and quality, and how they empower models to make accurate predictions. By examining these datasets, we will gain insights into how to effectively prepare and leverage data for training purposes, setting the stage for our exploration into the world of machine learning through hands-on examples.
Dataset Example: Iris Flower Classification
The Iris Flower dataset is a classic in the field of machine learning, offering a straightforward classification challenge. This dataset is essential for beginners to familiarize themselves with the concepts of machine learning. It involves predicting the species of an iris flower based on the measurements of its petals and sepals.
Dataset Overview
- Homepage: Iris Flower Dataset can be accessed from the UCI Machine Learning Repository, a popular source for machine learning datasets.
Homepage URL: https://archive-beta.ics.uci.edu/dataset/53/iris - Sample Data: The dataset contains four features: sepal length, sepal width, petal length, and petal width, and a label for the species of the iris flower, which can be one of three types: Iris-setosa, Iris-versicolor, or Iris-virginica. The below sheet shows a sample dataset.
Data Representation
The above data sheet shows a tabular format with five columns, corresponding to the four features; sepal length, sepal width, petal length, and petal width and the label; species. For instance:
- A sample with 5.1 cm sepal length, 3.5 cm sepal width, 1.4 cm petal length, 0.2 cm petal width is labeled as Iris-setosa.
Deep Learning Model Architecture
The neural network for this classification task begins with an input layer that takes in the four features. It processes the data through one or more hidden layers and outputs the class probabilities in the output layer. For example, the network might predict a 95% probability for Iris-setosa.
Tensors in the Model
- Input/Features Tensor: A rank-2 tensor with a shape that corresponds to the number of samples (n) by 4 (the number of features).
rank: 2
shape: [n, 4]
Tensor:
[[7.9, 3.8, 6.4, 2.0],
[5.1, 3.5, 1.4, 0.2],
[6.7, 2.5, 5.8, 1.8],
…
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.7, 5.3, 1.9],
[5.6, 2.9, 3.6, 1.3],
…]]
- Output/Labels Tensor: A rank-2 tensor with a shape corresponding to the number of samples by the number of class labels.
rank: 2
shape: [n, 1]
Tensor:
[[2],
[0],
[2],
…,
[1],
[2],
[1]]
The Iris Flower Classification dataset is an excellent starting point for understanding the basics of machine learning. This dataset, which consists of measurements for sepal length, sepal width, petal length, and petal width, along with species labels, allows us to build and train a neural network for classifying iris flowers into three species. Through the preprocessing, training, and evaluation steps, this example demonstrates how to handle simple yet effective data preparation techniques, network architecture, and model training to achieve accurate classification results.
Dataset Example: House Price Prediction
The Boston House Prices Dataset provides a more complex scenario, ideal for regression tasks in machine learning. It includes various features such as crime rate, average room number, and more, to predict the median value of homes in different Boston areas.
Dataset Overview
- Homepage: The dataset is available on platforms like Kaggle, offering detailed information and download options.
Homepage URL: https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html - Sample Data: Includes 14 attributes such as crime rate (CRIM), proportion of non-retail business acres per town (INDUS), and others, along with the median value (MEDV) of owner-occupied homes.
Data Representation
The dataset is presented in a tabular format with each column representing a feature and the last column containing the house prices, serving as the labels for our prediction model.
Deep Learning Model Architecture
The network designed for predicting house prices includes an input layer with neurons corresponding to the number of features. The hidden layers process the input data, and the output layer yields a continuous value prediction of the house’s price, such as $15,159 for a given set of features.
Tensors in the Model
- Input/Features Tensor: A 2-rank tensor, where the shape is the number of samples by the number of features (n, 13).
rank: 2
shape: [n, 13]
Tensor:
[[0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24],
[0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6],
[0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7],
…
[0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9],
[0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1],
[0.21124,12.5,7.87,0,0.524,5.631,100,6.0821,5,311,15.2,386.63,29.93,16.5],
…]]
- Output/Labels Tensor: A 2-rank tensor, typically with the shape corresponding to the number of samples by 1, as the output is a single continuous value per sample.
rank: 2
shape: [n, 1]
Tensor:
[[24],
[21.6],
[34.7],
…
[33.4],
[36.2],
…]]
The House Price Prediction dataset offers a more complex scenario, ideal for regression tasks. It includes various features such as crime rate, average room number, and others to predict the median value of homes in different Boston areas. This example illustrates how to work with more extensive and intricate datasets, demonstrating the importance of data preprocessing, feature scaling, and the design of neural network architectures tailored for regression tasks. By training a model on this dataset, we can predict house prices based on given features, showcasing the practical applications of deep learning in real estate valuation.
Summary
Understanding the role of data in training deep learning models is crucial for developing accurate and robust AI systems. High-quality, diverse datasets enable models to learn complex patterns and generalize well to new, unseen examples. The examples of Iris Flower Classification and House Price Prediction highlight the importance of proper dataset preparation and utilization. These foundational concepts set the stage for more advanced machine learning applications, demonstrating how to leverage data effectively for training deep learning models.
4 Ways to Learn
1. Read the article: Data for Deep Learning Training
2. Play with the visual tool: Data for Deep Learning Training
3. Watch the video: Data for Deep Learning Training
4. Practice with the code: Data for Deep Learning Training
Previous Article: Key Concepts and Techniques
Next Article: Activation Functions