Linear Regression Algorithm using Python

Bashir Alam

Bashir Alam

0
(0)

Linear Regression is one of the easiest and most popular Supervised Machine Learning algorithms. It is a technique for predicting a target value using independent factors. Linear Regression is mostly used for forecasting and determining cause and effect relationships among variables. This article covers the implementation of the Linear Regression algorithm using Python language. We will use AWS SageMaker Studio and AWS Jupyter Notebook for the implementation and visualization. In addition to that, we will describe how to set up the environment for the Linear Regression algorithm in AWS SageMaker studio and AWS Jupyter Notebook.

As discussed in the Overview of Supervised Machine Learning Algorithms article, Linear Regression is a supervised machine learning algorithm that trains the model from data having independent(s) and dependent continuous variables. Based on the number of independent variables, a linear regression can be divided into two main categories:

  • Simple Linear Regression: In simple Linear regression, where there is only one dependent and one corresponding independent variable. That means there is only one possible output Y for every input variable X.
  • Multiple Linear Regression: There are multiple independent variables and one corresponding dependent variable in multiple linear regression. That means for every output variable Y, there are more than one input variable Xi.

Training the model in Linear Regression Algorithm

The linear Regression algorithm performs better when there is a continuous relationship between the inputs and output. We suggest you always analyze the data before applying a linear regression algorithm. You can visualize the data to see the kind of relationship between the inputs and output variables. If the graph is scattered and shows no relationship, it is recommended not to use a Linear Regression algorithm.

For example, we have a linear dataset with one input and the corresponding output. You can download the dataset from here. Now let us visualize this dataset in AWS Sagemaker Jyputer Notebook.

# Importing the modules
import pandas as pd
import matplotlib.pyplot as plt

# Importing the dataset
dataset = pd.read_csv('DataForLR.csv')

#get a copy of dataset exclude last column
Input = dataset.iloc[:, :-1].values 

#get array of dataset in column 2st
output = dataset.iloc[:, 1].values 

# visualization part 
viz_train = plt

# applying scttered graph
viz_train.scatter(Input, output, color='blue')
viz_train.title('Hours vs Score')

# X label and Y label
viz_train.xlabel('Hours')
viz_train.ylabel('Score')

# showing the graph
viz_train.show()

The output:

linear-regression-using-python-visualize-data.

The graph shows a linear relationship between the input and output variable which means this data can be used to train the Linear Regression algorithm.

Positive Linear Relationship

A positive linear relationship is when the dependent variable expands on the Y-axis while the independent variable increases on the X-axis. In simple words, as the input variables increases, the output variables also increases. The slope of such a linear relationship will be positive.

Linear-regression-using-python-positive-linear-relation

Negative Linear Relationship

A negative linear relationship is when the dependent variable decreases on the Y-axis while the independent variable increases on the X-axis. In simple words, as the input variables increases, the output variables decrease. The slope of such a linear relationship will be negative.

Linear-regression-usig-python-negative-relation.

Mathematical calculation of training model

The Linear Regression model provides a sloped straight line representing the relationship between the variables. The following is the simple training model of the Linear Regression:

Linear-regression-using-python-formulae
  • f(x) – The output of the dataset
  • M – Constant value
  • C – The slope of the dataset
  • x – The input value of the dataset

The Linear Regression algorithm will take the labeled training data set and calculate the value of M and C. Once the model finds the accurate values of M and C, then it is said to be a trained model. Then it can take any value of x to give us the predicted output.

Python code for linear regression algorithm

This article will be using the Python 3.8.13 version, 3.7.10 a version in AWS Sagemaker Jypyter Notebook, respectively. You can check the Python version running from your Jupyter Notebook notebook by executing the following code in the cell:

#importing the required module
from platform import python_version

# printing the verison
print(python_version())

This will print out the Python version running on your Jypyter Notebook or SageMaker studio.

Setting up the envrionment in AWS Jupyter notebook and Sagemaker Studio

Before writing the Python program for the Linear Regression algorithm, ensure that you have installed the required Python modules. We will be using the following Python modules in this article to import the data set and train our model:

  • sklearn (v0.24.2)
  • pandas (v1.1.5)
  • matplotlib (v3.3.4)

Use the following command to install the required modules on AWS Jupyter Notebook and SageMaker Studio:

%pip install sklearn
%pip install matplotlib
%pip install pandas

Once the modules are installed successfully, you can check the versions by typing the following Python code.

#importing the required modules
import matplotlib
import sklearn
import pandas

#printing the versions of installed modules
print("matplotlib: ", matplotlib.__version__)
print("sklearn :", sklearn.__version__)
print("pandas :", pandas.__version__)

Implementation of Linear Regression in Python

As we have installed all the required modules for the Linear Regression, we have to import them. Let us jump into the Python program and train our model using the above-mentioned dataset.

# Importing the required modules for
# Linear Regression using Python
import matplotlib.pyplot as plt
import pandas as pd

The matplotlib is used to visualize the training and testing dataset. The pandas module Is used to import the dataset and divide the dataset into input variables and output variables.

The second step is to import the dataset and split it into input and output variables using the pandas module.

# Importing the dataset
dataset = pd.read_csv('DataForLR.csv')

#get a copy of dataset exclude last column
Input = dataset.iloc[:, :-1].values 

#get array of dataset in column 2st
output = dataset.iloc[:, 1].values 

If we look back at our dataset, it has only two columns, one input column named hours and the second is output column named as score. So, in the above program, we assign all the rows and columns excluding the last column to a variable named Input and assign all the rows and only the last column to the output variable.

Now the Input variable contains all the independent values, and the output variable contains the dependent values. The next step is to divide the dependent and independent classes into the training and testing data parts.

# Splitting the dataset into the Training data set and Testing data set
from sklearn.model_selection import train_test_split

# 30% data for testing, random state 1
X_train, X_test, y_train, y_test = train_test_split(Input, output, train_size=.7, random_state=1)

This is an important part of the implementation because we have split our original dataset into a training and testing dataset.

First, we have imported the training_test_split() method from the sklearn module, which splits the data set. We have specified four parameters which are:

  • Input: This provides all the independent values to the function.
  • output: This provides all the corresponding output values to the function.
  • train_size: This is where the actual splitting of dataset occurs. 0.7 means we have specifed 70% of the dataset to the training part and remaining 30% to the testing part. We can change the train_size, depending on the performance of our model.
  • random: It means that the splitting of the data set into training and test part will be random. We can assign any positive integer value. The same integer value will return the same random testing and training dataset each time we run the algorithm.

The X_train and y_train contain 70% of the original dataset, and we will use them for training our model. While X_test and y_test contain the remaining 30% of the original dataset and we will use them to test our model to see if the predictions are accurate or not.

We can verify the splitting of the original dataset by printing either the training part or testing part as shown below:

# Splitting the dataset into the Training data set and Testing data set
from sklearn.model_selection import train_test_split

# 30% data for testing, random state 1
X_train, X_test, y_train, y_test = train_test_split(Input, output, train_size=.7, random_state=1)

#printing the splitted output values
print("training output values: \n",y_train)
print("Testing output values:\n",y_test)

Output:

linear-regression-using-python-splitted-dataset.

Notice that 70% of the data is in the training part, and 30% is in the testing part. Once we have successfully split the dataset, the next step is to train the model by feeding the training dataset.

# Importing linear regression form sklear
from sklearn.linear_model import LinearRegression

# initializing the algorithm
regressor = LinearRegression()

# Fitting Simple Linear Regression to the Training set
regressor.fit(X_train, y_train)

# Predicting the Test set results
y_pred = regressor.predict(X_test)

This is where the training of our model takes place. First, we initialize the linear regression algorithm and then provide the dependent and corresponding independent training dataset to train our model. Then we tested our model by only providing the independent testing dataset to our train model and saving the predicted output in a variable known as y_pred.

Visualizing the results

Our model is trained, and if we provide any testing value, it will give us the predicted output.

# Predicting single value
Pred_Salary= regressor.predict([[3]])
print(Pred_Salary)

Output:

Linear-regressionUsing-Python-predicted-value

This means that students who study for 3 hours will get a 75,5 score. Now, let us visualize the training dataset and the trained model. It is not an actual value but a predicted value that our trained model has predicted.

# Visualizing the Training set results
viz_train = plt

# ploting the training dataset in scattered graph
viz_train.scatter(X_train, y_train, color='red')

# ploting the testing dataset in line line
viz_train.plot(X_train, regressor.predict(X_train), color='blue')
viz_train.title('Hours vs Score')

# labeling the input and outputs
viz_train.xlabel('Hours')
viz_train.ylabel('Score')

# showing the graph
viz_train.show()

Output:

linear-regression-using-python-trained-model

The scattered points show the actual values of the training dataset, and the blue line shows our model trained on the scattered points.

Now, let us provide the testing dataset to our model and visualize the predicted values. We will visualize the testing data using a scattered graph and a line’s predicted output.

# Visualizing the Test set results
viz_test = plt

# red dot colors for actual values
viz_test.scatter(X_test, y_test, color='red')

# Blue line for the predicted values
viz_test.plot(X_test, regressor.predict(X_test), color='blue')

# defining the title
viz_test.title('Hours vs Score')

# x lable
viz_test.xlabel('Hours')

# y label
viz_test.ylabel('Score')

# showing the graph
viz_test.show()

Output:

Linear-regression-using-python-visualize-ouputs.

In this article, we took a sample and small data set to understand the working and implementation of Linear regression. The red dots show the actual values of the testing data, and the blue line shows the predicted outputs of our trained model. Still, the dataset will be huge in real work, and the model will train more accurately.

Linear regression using sklearn and SageMaker

Now let us run the linear regression using python in AWS SageMaker, where we have the Python version of 3.7.10 installed. The interface and running process are similar to that of the AWS Jupyter notebook:

# Importing the required modules for linear regression using python
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('DataForLR.csv')

#get a copy of dataset exclude last column
Input = dataset.iloc[:, :-1].values 

#get array of dataset in column 2st
output = dataset.iloc[:, 1].values 

# Splitting the dataset into the Training data set and Testing data set
from sklearn.model_selection import train_test_split

# 30% data for testing, random state 1
X_train, X_test, y_train, y_test = train_test_split(Input, output, train_size=.7, random_state=1)

# Importing linear regression form sklear
from sklearn.linear_model import LinearRegression

# initializing the algorithm
regressor = LinearRegression()

# Fitting Simple Linear Regression to the Training set
regressor.fit(X_train, y_train)

# Predicting the Test set results
y_pred = regressor.predict(X_test)

# Visualizing the Training set results
viz_train = plt

# ploting the training dataset in scattered graph
viz_train.scatter(X_train, y_train, color='red')

# ploting the testing dataset in line line
viz_train.plot(X_train, regressor.predict(X_train), color='blue')
viz_train.title('Hours vs Score')

# labeling the input and outputs
viz_train.xlabel('Hours')
viz_train.ylabel('Score')

# showing the graph
viz_train.show()

# Visualizing the Test set results
viz_test = plt

# red dot colors for actual values
viz_test.scatter(X_test, y_test, color='red')

# Blue line for the predicted values
viz_test.plot(X_test, regressor.predict(X_test), color='blue')

# defining the title
viz_test.title('Hours vs Score')

# x lable
viz_test.xlabel('Hours')

# y label
viz_test.ylabel('Score')

# showing the graph
viz_test.show()

The output:

Linear regression using python and Sagemaker

The above code shows the implementation of the Linear regression from importing the dataset, splitting it, and training the model, to visualize the results.

Summary

Linear regression is one of the easiest and most popular Machine Learning algorithms. It makes predictions based on continuous variables. It uses a simple mathematical formula to train the model based on the training data. This article covered the implementation of the Linear Regression algorithm using Python language, AWS SageMaker Studio, and Jupyter Notebook.

How useful was this post?

Click on a star to rate it!

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?

Like this article?

Share on facebook
Share on Facebook
Share on twitter
Share on Twitter
Share on linkedin
Share on Linkdin
Share on pinterest
Share on Pinterest

Want to be an author of another post?

We’re looking for skilled technical authors for our blog!

Leave a comment

If you’d like to ask a question about the code or piece of configuration, feel free to use https://codeshare.io/ or a similar tool as Facebook comments are breaking code formatting.