Linear Regression

Linear regression is a fundamental machine learning algorithm used for predictive analysis. This post covers the basics of linear regression, its mathematical foundations, and practical applications. We’ll also provide a step-by-step guide to implementing linear regression in your projects, helping you harness its predictive power effectively.

Simple Linear Regression

The equation for Simple Linear Regression, which models the relationship between a dependent variable \(y\) and an independent variable \(x\) is typically represented as follows:

\(y = \beta_0 + \beta_1x + \epsilon\)

Here’s the breakdown of the terms:

\(y\) is the dependent variable (the variable we are trying to predict).
\(x\) is the independent variable (the variable we use to make predictions).
\(\beta_0\) is the y-intercept (the value of \(y\) when \(x=0\)).
\(\beta_1\) is the slope of the line (it represents the change in \(y\) for a one-unit change in \(x\)).
\(\epsilon\) represents the error term (the difference between the observed values and the values predicted by the model).

This equation forms the basis for the line of best fit in Simple Linear Regression, where \(\beta_0\) and \(\beta_1\) are coefficients determined during the model fitting process to minimize the error term \(\epsilon\).

For example, let’s consider a simple dataset. Suppose we’re looking at a dataset where \(x\) represents the number of hours studied, and \(y\) represents the test score obtained:

Hours Studied (x)	Test Score (y)
1	51
2	55
3	61
4	63
5	66
6	70
7	75
8	76
9	79
10	82

In this dataset, the “Hours Studied” is the independent variable (x), and the “Test Score” is the dependent variable (y). We can use simple linear regression to model the relationship between hours studied and test scores, predicting the test score based on the number of hours studied.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Dataset
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)  # Hours Studied
y = np.array([51, 55, 61, 63, 66, 70, 75, 76, 79, 82])  # Test Scores
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Linear Regression Model
model = LinearRegression()
model.fit(X_train, y_train)
# Slope and Intercept
slope = model.coef_[0]
intercept = model.intercept_
# Plotting the regression line
y_pred = model.predict(X_test)
# Visualizing results
plt.scatter(X_train, y_train, color='blue', label='Traint set')
plt.scatter(X_test, y_test, color='red', label='Test set')
plt.plot(X_train, model.predict(X_train), color='grey', label=f'Linear Regression Line: y = {slope:.2f}x + {intercept:.2f}')
# Adding labels and legend
plt.xlabel('Hours Studied')
plt.ylabel('Test Score')
plt.legend()
# Display the plot
plt.show()

Simple Linear Regression - Hours Study vs Score Example

For the dataset above, regression demonstrates that, on average, the student gets a 48.96 score without studying the subject (\(x=0\)), and with every hour studied, the student score increases by 3.37.

Multiple Linear Regression

In the case of Multiple Linear Regression, where there are multiple independent variables, the equation expands to accommodate each of these variables:

\(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_nx_n + \epsilon\)

In this equation:

\(y\) is the dependent variable you are trying to predict or explain.
\(\beta_0\) is the intercept term. It represents the value of \(y\) when all \(x\) variables are 0.
\(\beta_1\),\(\beta_2\),…,\(\beta_n\) are the coefficients of the independent variables \(x_1\),\(x_2\),…,\(x_n\). Each coefficient represents the change in \(y\) for a one-unit change in the corresponding \(x\) variable, holding all other variables constant.
\(x_1\),\(x_2\),…,\(x_n\) are the independent variables used to predict \(y\).
\(\epsilon\) is the error term, representing the portion of \(y\) that cannot be explained by the independent variables.

Polynomial Linear Regression

In Polynomial Linear Regression, the relationship between the independent variable and the dependent variable is modeled as an \(n\)th degree polynomial:

\(y = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + \ldots + \beta_nx^n + \epsilon\)

Here’s the breakdown of the terms:

\(y\) is the dependent variable that you are trying to predict or explain.
\(\beta_0\) is the intercept term. It represents the value of \(y\) when \(x\) is 0.
\(\beta_1\),\(\beta_2\),…,\(\beta_n\) are the coefficients for the respective terms of the polynomial. Each coefficient represents the impact of a change in the corresponding power of \(x\) on \(y\).
\(x\) is the independent variable. In polynomial regression, instead of having multiple different independent variables, you have multiple powers of a single independent variable.
\(x^2\),\(x^3\),…,\(x^n\) represent the independent variable raised to the power of 2, 3, …, \(n\), respectively.
\(\epsilon\) is the error term, accounting for the variation in \(y\) not explained by the polynomial terms of \(x\).