Implementing anomaly detection using Python

Laptop with code
Related Content

Anomaly detection identifies unusual items, data points, events, or observations that are significantly different from the norm. In Machine Learning and Data Science, you can use this process for cleaning up outliers from your datasets during the data preparation stage or build computer systems that react to unusual events. Examples of use-cases of anomaly detection might be analyzing network traffic spikes, application monitoring metrics deviations, or even security threads detection. This article will cover how to detect anomalies in your datasets, their effect on prediction algorithms, and automatic anomaly detection using Unsupervised Learning algorithms. We will be using Python, AWS SageMaker, and Jupyter Notebook for implementation and visualization purposes.

The performance of any Machine Learning algorithm is highly dependent on the accuracy of provided dataset. In real-world scenarios, we usually deal with raw data to be analyzed and preprocessed before running Machine Learning tasks. Preparing a dataset for training is called Exploratory Data Analysis (EDA), and anomaly detection is one of the steps of this process.

Best Machine Learning Books for Beginners and Experts

Best Machine Learning Books for Beginners and Experts

As Machine Learning becomes more and more widespread, both beginners and experts need to stay up to date on the latest advancements. For beginners, check out the best Machine Learning books that can help to get a solid understanding of the basics. For experts, reading these books can help to keep pace with the ever-changing landscape. In either case, a few key reasons for checking out these books can be beneficial. First, they provide a comprehensive overview of the subject matter,mainly about machine learning algorithms. Second, they offer insights from leading experts in the field. And third, they offer concrete advice on how to apply machine learning concepts in real-world scenarios. As machine learning continues to evolve, there’s no doubt that these books will continue to be essential resources for anyone with prior knowledge looking to stay ahead of the curve.

What is an anomaly?

Exploratory Data Analysis with Pandas Profiling

Exploratory Data Analysis with Pand...
Exploratory Data Analysis with Pandas Profiling

An anomaly is an unusual item, data point, event, or observation significantly different from the norm. Anomaly detection algorithms help to automatically identify data points in the dataset that do not match other data points. In Data Science and Machine Learning, the anomaly data point in the dataset is also called the “outlier,” and these terms are used interchangeably.

Here’s how anomalies or outliers from the dataset usually look in the charts:

anomaly-detection-using-python-outliers

There are several types of anomalies:

  • Point anomaly are objects that lay far away from the mean or median of a distribution in the dataset. An example of a point anomaly might be a single transaction of a huge amount of money from a credit card.
  • Contextual anomaly is context-specific and commonly occures in the time-series datasets. For example, high traffic volume to a website might be a common thing during any weekday, but not during a weekend. So, unexpected traffic spike during the weekend might represent a contextual anomaly.
  • Collective anomaly describes a group of related anomaly objects, The individual data instance in a collective anomaly may not be an anomaly by itself, but multiple occurrences of such data points together might be an anomaly. For example, a single slow network connection to the website might not be an issue in general, but thousands of such connections might represent a DDoS attack.

Visual anomalies detection

The quickest way to find anomalies in the dataset is to visualize its data points. For example, outliers are easily identifiable by visualizing data series using box plots, scatter plots or line charts.

Box plot

The box plot is a standardized way of displaying data distribution based on five metrics: minimum, first quartile (Q1), median, third quartile (Q3), and maximum:

Box plot - Outlier example

The box plot doesn’t show the data distribution and the histogram. However, it’s still handy for indicating whether a distribution contains potential unusual data points (outliers) in the dataset.

The box plot has the following characteristics:

  • The bottom and top sides of the box are the lower and upper quartiles. The box covers the interquartile interval which contains 50% of the data.
  • The median is the vertical line that splits the box into two parts.
  • The whiskers are the two lines outside the box, that go from the minimum to the lower quartile and then from the upper quartile to the maximum.
  • Any data point that lies outside of the whiskers is considered to be an outlier.
  • A variation of the box and whisker plot restricts the length of the whiskers to a maximum of 1.5 times the interquartile range. Data points that are outside this interval are represented as points on the graph and considered as potential outliers.

Line chart

The line chart is ideal for visualizing a series of data points. If the data series contains any anomalies, they can be easily visually identifiable.

Line chart - Outlier example

Scatter plot

A scatter plot uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables.

charts.io – What is a scatter plot

If the dataset contains anomalies, you can see them on that chart. Here’s a visualization of the famous Iris dataset where we can easily see at least one outlier:

Scatter plot - Outlier example

Detecting and fixing anomalies in datasets

In this section of the article, we’ll show how anomalies (or outliers) can significantly affect the outcomes of any Machine Learning model by analyzing a simple dataset.

Let’s install several required Python modules by running the following commands in the cell of the Jupyter Notebook:

%pip install sklearn
%pip install pandas
%pip install numpy
%pip install matplotlib
%pip install plotly
%pip install seaborn
%pip install sktime
%pip install statsmodels

Exploring dataset

The first step is to import the dataset and familiarize ourselves with the data type. We will analyze a simple dataset containing catfish sales from 1986 to 2001. You can download the dataset from this link.

# Import pandas
import pandas as pd


# Read data
dataset = pd.read_csv('catfish_sales_1986_2001.csv', parse_dates=[0])

# Printing head of the DataFrame
dataset.head()

Output:

Catfish sales dataset 1986-2001

The output shows that our data has two columns containing the date and number of sales each month. Now let us visualize the dataset to see sales information more clearly:

import plotly.express as px

# Limiting DataFrame to specific date
mask = (dataset['Date'] <= '2000-01-01')
dataset = dataset.loc[mask]

# Plotting a part of DataFrame
fig = px.line(dataset, x='Date', y="Sales", title='Catfish sales 1986-2000')
fig.show()

Output:

The output looks good, and it looks like we don’t have any anomalies in the dataset. Let’s double-check it using box plot:

import plotly.express as px
fig = px.box(dataset, y="Sales", title='Catfish sales 1986-2000')
fig.show()

The box plot chart does not show any outliers.

Re-indexing dataset

As you’ve seen above, the DataFrame’s index is an integer type. It would be helpful to re-index the entire DataFrame using the information from the Date column as a new index:

# convert the column (it's a string) to datetime type
datetime_series = pd.to_datetime(dataset['Date'])

# create datetime index passing the datetime series
datetime_index = pd.DatetimeIndex(datetime_series.values)

# datetime_index
period_index = pd.PeriodIndex(datetime_index, freq='M')

# period_index
dataset = dataset.set_index(period_index)

# we don't need the column anymore
dataset.drop('Date',axis=1,inplace=True)

dataset.head()

Output:

Catfish sales dataset 1986-2000 (re-indexed)

Price prediction (dataset without anomalies)

Now, let’s predict the 1999 prices based on existing historical time-series data from the 1986 year.

First, let’s split our dataset:

import plotly.graph_objects as go
from sktime.forecasting.model_selection import temporal_train_test_split

# Splitting dataset (test dataset size is last 12 periods/months)
y_train, y_test = temporal_train_test_split(dataset, test_size=12)

# Visualizing train/test dataset
fig = go.Figure()
fig.add_trace(go.Scatter(
    name="Train DataSet", x=y_train.index.astype(str), y=y_train['Sales']
))
fig.add_trace(go.Scatter(
    name="Test DataSet", x=y_test.index.astype(str), y=y_test['Sales']
))
fig.update_layout(
    title="Splitted dataset"
)
fig.show()

Output:

We’ll use the SARIMA algorithm to model and estimate prices for the catfish market based on our historical dataset for demo purposes.

from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(y_train['Sales'], order=(1, 1, 1), seasonal_order=(1,0,1,12))
model_fit = model.fit()
y_pred = model_fit.predict(start=len(y_train), end=len(y_train)+11, exog=None, dynamic=True)

Let’s visualize prediction results:

fig = go.Figure()
fig.add_trace(go.Scatter(
    name="Train DataSet", x=y_train.index.astype(str), y=y_train['Sales']
))
fig.add_trace(go.Scatter(
    name="Test DataSet", x=y_test.index.astype(str), y=y_test['Sales']
))
fig.add_trace(go.Scatter(
    name="Prediction", x=y_pred.index.astype(str), y=y_pred.values
))
fig.update_layout(
    title="Predicted vs actual values"
)
fig.show()

Output:

As you can see, the SARIMA algorithm predicted future prices with high accuracy.

Here are Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) values:

from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
mae = mean_absolute_error(list(y_test['Sales']), list(y_pred))
mape = mean_absolute_percentage_error(list(y_test['Sales']), list(y_pred))
print('MAE: %.3f' % mae)
print('MAPE: %.3f' % mape)

Output:

  • MAE: 930.138
  • MAPE: 0.036

Price prediction (dataset with anomalies)

Let’s break the dataset and introduce an anomaly point to see the influence of anomalies on the same prediction algorithm:

from datetime import datetime

# Cloning good dataset
broken_dataset = dataset.copy()

# Breaking clonned dataset with random anomaly
broken_dataset.loc[datetime(1998, 12, 1),['Sales']] = 1000

Here’s the visualization of the broken dataset:

import plotly.express as px

# Plotting DataFrame
fig = px.line(
    broken_dataset,
    x=broken_dataset.index.astype(str),
    y=broken_dataset['Sales']
)
fig.update_layout(
    yaxis_title='Sales',
    xaxis_title='Date',
    title='Catfish sales 1986-2000 (broken)'
)
fig.show()

Output:

Let’s use the box plot to see the outlier:

import plotly.express as px
fig = px.box(broken_dataset, y="Sales")
fig.show()

Output:

The box plot shows one anomaly point under a lower whisker.

We can run the same algorithm to visualize the difference in predictions.

Let’s split the dataset:

import plotly.graph_objects as go
from sktime.forecasting.model_selection import temporal_train_test_split

# Splitting dataset (test dataset size is last 12 periods/months)
y_train, y_test = temporal_train_test_split(broken_dataset, test_size=12)

# Visualizing train/test dataset
fig = go.Figure()
fig.add_trace(go.Scatter(
    name="Train DataSet", x=y_train.index.astype(str), y=y_train['Sales']
))
fig.add_trace(go.Scatter(
    name="Test DataSet", x=y_test.index.astype(str), y=y_test['Sales']
))
fig.update_layout(
    title="Splitted dataset"
)
fig.show()

Now, let’s run the SARIMA algorithm:

from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(y_train['Sales'], order=(1, 1, 1), seasonal_order=(1,0,1,12))
model_fit = model.fit()
y_pred = model_fit.predict(start=len(y_train), end=len(y_train)+11, exog=None, dynamic=True)

And visualize prediction results:

import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Scatter(
    name="Train DataSet", x=y_train.index.astype(str), y=y_train['Sales']
))
fig.add_trace(go.Scatter(
    name="Test DataSet", x=y_test.index.astype(str), y=y_test['Sales']
))
fig.add_trace(go.Scatter(
    name="Prediction", x=y_pred.index.astype(str), y=y_pred.values
))
fig.update_layout(
    yaxis_title='Sales',
    xaxis_title='Date',
    title='Catfish sales 1986-2000 incorrect predictions'
)
fig.show()

As you can see, predictions follow the pattern but are not even close to the actual values.

from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
mae = mean_absolute_error(list(y_test['Sales']), list(y_pred))
mape = mean_absolute_percentage_error(list(y_test['Sales']), list(y_pred))
print('MAE: %.3f' % mae)
print('MAPE: %.3f' % mape)

Output:

  • MAE: 8401.406
  • MAPE: 0.345

Algorithms for finding anomalies

It is challenging to find data anomalies, especially when dealing with large datasets. Fortunately, the sklearn Python module has many built-in algorithms to help us solve this problem, such as Isolation Forests, DBSCAN, Local Outlier Factors (LOF), and many others.

Isolation Forests

Isolation Forests is an unsupervised learning algorithm that identifies anomalies by isolating outliers in the data based on the Decision Tree Algorithm. It separates the outliers by randomly selecting a feature from the given set of features and then selecting a split value between the max and min values. This random partitioning of features will produce shorter paths in trees for the anomalous data points, thus distinguishing them from the rest of the data.

Let’s implement the Isolation Forests algorithm on the same broken dataset to find anomalies using Python.

# importing the isloation forest
from sklearn.ensemble import IsolationForest

# copying dataset
isf_dataset = broken_dataset.copy()

# initializing Isolation Forest
clf = IsolationForest(max_samples='auto', contamination=0.01)

# training
clf.fit(isf_dataset)

# finding anomalies
isf_dataset['Anomaly'] = clf.predict(isf_dataset)

# saving anomalies to a separate dataset for visualization purposes
anomalies = isf_dataset.query('Anomaly == -1')

Let’s visualize our findings:

import plotly.graph_objects as go

b1 = go.Scatter(x=isf_dataset.index.astype(str),
                y=isf_dataset['Sales'],
                name="Dataset",
                mode='markers'
               )
b2 = go.Scatter(x=anomalies.index.astype(str),
                y=anomalies['Sales'],
                name="Anomalies",
                mode='markers',
                marker=dict(color='red', size=6,
                            line=dict(color='red', width=1))
               )

layout = go.Layout(
    title="Isolation Forest results",
    yaxis_title='Sales',
    xaxis_title='Date',
    hovermode='closest'
)

data = [b1, b2]

fig = go.Figure(data=data, layout=layout)
fig.show()

Output:

As you can see, the Isolation Forests algorithm detected two anomalies including the one that we’ve introduced ourselves.

Local Outlier Factor (LOF)

The Local Outlier Factor (LOF) algorithm helps identify outliers based on the density of data points for every local data point in the dataset. The algorithm performs well when the data density is not the same throughout the dataset.

Let’s apply the Local Outlier Factor algorithm to our dataset and find anomalies.

# Importing then local outlier factor
from sklearn.neighbors import LocalOutlierFactor

# copying dataset
lof_dataset = broken_dataset.copy()

# initializing the Local Outlier Factor algorithm
clf = LocalOutlierFactor(n_neighbors=10)

# training and finding anomalies
lof_dataset['Anomaly'] = clf.fit_predict(lof_dataset)

# saving anomalies to another dataset for visualization purposes
anomalies = isf_dataset.query('Anomaly == -1')

Let’s visualize our findings:

As in the case with the Isolation Forests algorithm, the Local Outlier Factor algorithm detected two anomalies including the one that we’ve introduced ourselves.

Summary

Anomaly detection is the process of locating unusual items, data points, occurrences, or observations that make suspicions because they differ from the rest of the data points or observations. In this article, we’ve covered anomalies (outliers) and their effect on the prediction algorithms.

LIKE THIS ARTICLE?
Facebook
Twitter
LinkedIn
Pinterest
WANT TO BE AN AUTHOR OF ANOTHER POST?

We’re looking for skilled technical authors for our blog!

Table of Contents