Introduction to Supervised Machine Learning

Bashir Alam
Bashir Alam

Supervised Machine Learning is a set of algorithms that train on historical data and then predict output using the training dataset. Because of its accuracy and low time complexity, it is one of the most common machine learning types. Spam filtering, facial recognition, disease identification, fraud detection, and many others are the most common problems which Supervised learning can solve. This article will cover different types of Supervised Machine Learning, data sets, and steps involved in the learning process.

In Supervised Machine Learning, you train the model using a labeled dataset containing inputs and corresponding outputs. Here’s how the training process looks like:

Introduction to Supervised machine learning- supervised learning

The labeled data contains pictures of dogs and cats (it’s our job to provide the labels for pictures as dog or cat). The algorithm uses this information for training purposes. Once the model is trained, it can predict the input data (unlabeled data/test data) and classify the new picture as a dog or cat.

Datasets in Supervised Machine Learning

dataset in machine learning is a collection of data pieces that can be considered a single unit by a computer for analytic and prediction purposes. That means that the gathered data should be uniform and understandable by a machine that does not see data in the same manner as humans do. In this article, we will take a sample dataset, apply the splitting method, and divide it into a training dataset and testing dataset.

Here is a simple dataset of hours and scores, where hours are input (independent variable) and scores (dependent variable).

You can download this dataset from this link.

Introduction to supervised learning-dataset

Let’s visualize the given data using Python and the matplotlib library. We will use the pandas module to import the data set and split the given data into inputs and outputs. We’ll use the AWS Sagemaker and Jupyter Notebook to process and visualize the dataset.

# importing the modules
import pandas as pd
import matplotlib.pyplot as plt

# importing the dataset
dataset = pd.read_csv('hours_and_scores.csv')

# get a copy of dataset exclude last column
X = dataset.iloc[:, :-1].values 

# get array of dataset in column 2st
y = dataset.iloc[:, 1].values 

# visualization part 
viz_train = plt
viz_train.scatter(X, y, color='red')
viz_train.title('Hours vs Score')


Introduction to supervised machine learning-visualization

Note: the data seems to have a positive linear correlation because as the input value increases, the output also increases. In Supervised Machine Learning, we feed such a dataset to the machine, and it finds the relation between the inputs and outputs to train itself.

In Supervised Machine Learning, we continually train the machine using the training data set and evaluate the training performance using the testing dataset.

Training dataset in Supervised Machine Learning

The training dataset is the data we’re using to train Machine Learning models. We fed it to Machine Learning algorithms teach the machine how to make future predictions for similar data. The ML algorithms use the training data to form relationships, understand it, and make decisions. The model works better when the training data is adequate.

You can classify training datasets as:

  • Labeled data is a set of data samples labeled with one or more relevant labels. Labels describe certain qualities, traits, categories, or characteristics. In simple words, labled data have inputs and corresponding outputs. In Supervised learning, labeled training data is used by machine learning models to learn the traits associated with specific labels, which can then be used to classify new data points.
  • Unlabeled data: The opposite of labeled data is unlabeled data. It is a raw data that hasn’t been labeled with any classifications, features, or attributes. It is utilized in Unsupervised Machine Learning, where ML models have to detect patterns or similarities in data.

While training the model in Supervised Machine Learning, the dataset is usually divided into a training dataset and a testing dataset. We will use sklearn module for the splitting of our data set into two categories:

# importing train_test_split method from sklearn
from sklearn.model_selection import train_test_split

# 30% data for testing, random state 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)

This piece of Python code will split our dataset into the training and the testing parts.

The following parameters are provided to the train_test_split() method:

  • X contains all the input values of our dataset.
  • y contains all the output values of our dataset.
  • test_size specifies the persentage of the original dataset to use as testing data. In the example above, we have assigned it to 0.3, which means 30% of the whole dataset will be assigned to the testing part and remaining 70% of the dataset will be assiging to training part.
  • random_state controls the shuffling applied to the data before applying the split. Use 0 to prevend data from shuffling and 1 to perform shuffling before splitting data.

The X_train will contain 70% of the input values for the training purposes and y_train also contains 70% of the corresponding outputs values of the dataset. Supervised Machine learning uses these two datasets to train the model.

Testing dataset in Supervised Machine Learning

After the model is built and training has been done, testing data again validates the model to make accurate predictions. After completing the training, it is defined as a separate set of data used to test the model. In simple words, it is a data set that contains only input values. It generates an unbiased final model performance metric in terms of precision, accuracy, and other factors that help us know how well our model predicts the unknown output.

In Supervised Machine Learning, the testing data is provided so that the model can make predictions and give us the predicted outputs. These outputs are then compared with the actual data outputs, which help to calculate the performance of the Supervised Learning model.

The X_train variable contains the input values while y_test variable contains the outputs of these corresponding inputs.

Dataset visualization and splitting using AWS SageMaker Studio

Amazon SageMaker Studio is a fully managed service that allows data scientists and developers to construct, train, and deploy machine learning models easily and at a rapid scale. It is a web-based, integrated development environment (IDE) for machine learning that lets you build, train, debug, deploy, and monitor your Machine Learning models. Amazon Sagemaker contains Jupyter Notebook, which we’ll be using to run all our examples.

The best thing about Amazon SageMaker studio is that we can directly clone any directory for GitHub and start working on it. We can create a new notebook by clicking on the File menu on the top left and then selecting a new notebook section. Or we can manually create our new directory and start working from scratch.

Once Jupyter Notebook is created, we can upload our data set and start the visualization part. For the visualization, importing, and splitting dataset, we’ll need the following Python modules:

  • pandas
  • sklearn
  • matplotlib

If they are not installed, you can write the following command in the cell of the SageMaker studio to install them:

%pip install pandas
%pip install matplotlib
%pip install sklearn

Once the modules are installed successfully, we can jump to the visualization part. We will write the same Python code for the visualization of our data.

# importing the modules
import pandas as pd
import matplotlib.pyplot as plt

# importing the dataset
dataset = pd.read_csv('hours_and_scores.csv')

# get a copy of dataset exclude last column
X = dataset.iloc[:, :-1].values

# get array of dataset in column 2st
y = dataset.iloc[:, 1].values

# visualization part 
viz_train = plt
viz_train.scatter(X, y, color='red')
viz_train.title('Hours vs Score')


introduction to supervised learning-sagemaker

Similarly, we can use the same code to split our dataset into the training and testing parts.

Let’s print the X_train and Y_train:

# importing train_test_split method from sklearn
from sklearn.model_selection import train_test_split

# 30% data for testing, random state 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
print("Input values:", X_train)
print("Output values:",y_train)


Introduction to supervised learning-datasplitting in sagemaker

The x_train is in the form of a multi-dimensional array because there can be multiple inputs, but in our data, there was a single input for the corresponding output.

Types of Supervised Machine Learning

On a broader scale, Supervised Machine Learning can be divided into two main classes: Classification and Regression. When the output class is discrete or categorical, it is called classification, and when the output class contains continuous values, it is called a regression.


Classification is a type of Supervised Machine Learning that helps to learn relationships between input features and an output variable while separating the data into classes or groups, for example, yes/no, dog/cat, True/False, disease prediction, etc. The dataset that we use to feed our model in Supervised learning contains only numbers, not words or phrases.

There are two classes of classification problems in Supervised learning:

  • Binary classification: only two possible output classes are available, for example, yes/no, true/false, 1/0.
  • Multi-class classification: more than two possible output classes are available, for example, grades in class, types of movies, types of diseases, etc


Regression helps to learn relationships and correlations between input features and a continuous output variable (predicting the continuous output variable). For example, predicting temperature depending on weather conditions, stock price depending on other stocks, etc. In comparison with classification, output values can have infinite possible outputs.

There are two classes of regression problems in Supervised learning:

  • Simple Regression: there is a linear relationship between input values and output values, or there is only one input and one corresponding output. For example, the age depends on height, car velocity depends on acceleration, etc.
  • Multi-regression: there are multiple independent variables and one dependent variable in the dataset, For example, market trends, weather forecasting, and many more.

The main difference between regression and classification is that the output variable in the regression is numerical (or continuous) while for classification is categorical (or discrete).

Steps involved in Supervised Machine Learning

The following simple steps are involved in Supervised Machine Learning

  • Review the dataset and understand what kind of predection our model is going to make (classification or regression)
  • Label dataset
  • Split the dataset into training and testing datasets
  • Define the input features of the training dataset (features should have enough data so that the model can accurately predict the output)
  • Define the suitable algorithm for the model depending the type of dataset.
  • Apply the algorithm on the training dataset.
  • Evaluate the accuracy of the model using the test dataset. If the model predicts the correct output, that means the model is accurate and ready for production use.

Advantages and disadvantages of Supervised Machine Learning

There can be many pros and cons of Supervised Machine Learning depending on the type of problem and dataset.


  • The model can predict the result based on prior experiences
  • In Supervised learning, we can have a clear concept about the output classes
  • Supervised learning is a simple process to understand. When it comes to unsupervised learning, we don’t always know what’s going on within the machine and how it is learning
  • We don’t need to keep the training data in our memory after we’ve completed the entire training. Instead, the decision boundary can be kept as a mathematical formula
  • Helps us to optimize performance criteria using experience
  • Supervised learning can be very helpful in classification problems
  • We can use the Supervised learning model to handle a variety of real-world problems, such as fraud detection and spam filtering, for example


  • Models of Supervised learning are not suitable for dealing with complicated tasks
  • If the test data differs from the training dataset, Supervised learning might not be able to predict the proper output
  • When it comes to classification, if we provide an input that is not from any of the classes in the training data, the result could be an incorrect class label. Let’s imagine we used data from cats and dogs to train an image classifier. If we give a giraffe image, the output could be either a cat or a dog, which is incorrect.
  • Training required lots of computation times
  • It cannot cluster or classify data by discovering its features on its own
  • Input features that aren’t relevant to the training data could lead to inaccurate outcomes

AWS Machine Learning services

Amazon Web Services is one of the most popular public cloud providers, offering a wide range of cloud services and technologies. AWS provides a wide and deep variety of Machine Learning and AI services for different businesses. The AWS Machine Learning tools and services are primarily intended to assist customers in overcoming crucial issues that prevent developers from fully utilizing the power of ML. In addition to that, AWS offers AWS Sagemaker Studio for creating, training, and deploying LM models more quickly at scale. Sagemaker’s users can also create custom models while maintaining compatibility with major open-source frameworks.

Here’s a list of some of the AWS built with machine learning support:

  • Amazon Fraud Detector simplifies the time-consuming and costly steps of building, training, and deploying an ML model for fraud detection, allowing customers to take advantage of the technology more quickly. The accuracy of models created by Amazon Fraud Detector is higher than that of current one-size-fits-all machine learning solutions since each model is customized to a customer’s specific dataset.
  • Amazon HealthLake: In the AWS Cloud, healthcare providers can utilize HealthLake to store, transform, query, and analyze data. We can evaluate unstructured clinical material from a variety of sources using the HealthLake integrated medical natural language processing (NLP) capabilities.
  • Amazon Lex is a service for integrating speech and text-based conversational interfaces into any application. Lex offers advanced deep learning capabilities such as automatic speech recognition (ASR) for converting speech to text and natural language understanding (NLU) for recognizing the text’s intent, allowing you to create apps with highly engaging user experiences and lifelike conversational interactions.
  • Amazon Lookout for Equipment analyzes data from our equipment’s sensors to automatically develop a machine learning model for our equipment using just our data.
  • Amazon Lookout for Metrics: Anomalies (i.e. outliers from the norm) in business and operational data, such as a rapid drop in sales revenue or customer acquisition rates, are automatically detected and diagnosed by Amazon Lookout for Metrics.
  • Amazon Lookout for Vision is a machine learning service that uses computer vision to detect flaws and anomalies in visual representations.  Manufacturing companies can improve quality and cut costs with Amazon Lookout for Vision by swiftly detecting discrepancies in images of products at scale.
  • Amazon Monitron is a complete solution that employs machine learning to detect anomalous behavior in industrial machinery, allowing us to perform predictive maintenance and minimize unexpected downtime.


Supervised Machine Learning is a type of Machine Learning that needs historical labeled data to make predictions by training the model. It is considered to be one of the most accurate ML methods. Some of the real-life examples where Supervised Machine Learning takes place are spam filtering, recommendation system, weather prediction, stock value prediction, etc. In this article, we covered Supervised Machine Learning and some Machine Learning services provided by AWS cloud.

How useful was this post?

Click on a star to rate it!

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?

Like this article?

Share on Facebook
Share on Twitter
Share on Linkdin
Share on Pinterest

Want to be an author of another post?

We’re looking for skilled technical authors for our blog!

Leave a comment

If you’d like to ask a question about the code or piece of configuration, feel free to use or a similar tool as Facebook comments are breaking code formatting.