Time-series forecasting using Amazon SageMaker Canvas

Laptop with code
Related Content

Time-series forecasting is a challenging, compute, and time-consuming task, which is hard to implement to achieve accurate results. At re:Invent2021 Amazon announced the Amazon SageMaker Canvas service that gives you the ability to use Machine Learning to generate predictions without code. This article will cover how to use Amazon SageMaker Canvas to create a forecasting model and make predictions for time-series datasets.

In this article, we’ll load Amazon stock historical data, prepare it for Amazon SageMaker Canvas, run Canvas workflow, import predictions data back, and compare it with real stock future data and SARIMA algorithm predictions.

Working with DynamoDB in Python using Boto3

Working with DynamoDB in Python usi...
Working with DynamoDB in Python using Boto3

We highly recommend you check out the Time Series Forecasting Principles with Amazon Forecast guide as soon as Amazon SageMaker Canvas uses Amazon Forecast to make predictions out of time-series datasets.

Best Machine Learning Books for Beginners and Experts

Best Machine Learning Books for Beginners and Experts

As Machine Learning becomes more and more widespread, both beginners and experts need to stay up to date on the latest advancements. For beginners, check out the best Machine Learning books that can help to get a solid understanding of the basics. For experts, reading these books can help to keep pace with the ever-changing landscape. In either case, a few key reasons for checking out these books can be beneficial. First, they provide a comprehensive overview of the subject matter,mainly about machine learning algorithms. Second, they offer insights from leading experts in the field. And third, they offer concrete advice on how to apply machine learning concepts in real-world scenarios. As machine learning continues to evolve, there’s no doubt that these books will continue to be essential resources for anyone with prior knowledge looking to stay ahead of the curve.

Requirements

For demo purposes, we’ll use the following Python modules:

  • pandas – a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool
  • plotly – a data analytics and visualization library
  • pandas_datareader – a remote data access module for pandas that allows reading data from various sources
  • statsmodels – this module provides classes and functions for the estimation of many different statistical models (we’ll use its SARIMA algorithm implementation)

You can install the required dependencies by executing the following code in your Jupyter notebook:

%pip install pandas
%pip install pandas_datareader
%pip install plotly
%pip install statsmodels

Loading dataset

As a time-series dataset, we’ll take Amazon stocks data from 2008-01-01 to 2022-02-09 and try to make predictions for the next couple of days.

To load historical stock data, we’ll use the pandas_datareader module:

from pandas_datareader import data

start_date = '2008-01-07'
end_date = '2022-02-09'

stock_data = data.get_data_yahoo('AMZN', start_date, end_date)

stock_data.head(10)

As a result, we’ll get a pandas DataFrame named stock_data:

Amazon historical stock data 2008-01-01 - 2022-02-09

Let’s visualize the dataset using Plotly:

import pandas as pd

pd.options.plotting.backend = "plotly"

columns = ['High', 'Low', 'Open', 'Close', 'Adj Close']

fig = stock_data[columns].plot()
fig.update_layout(title_text=f"Amazon stocks historical data: {start_date} - {end_date}")
fig.show()

Preparing dataset for Amazon SageMaker Canvas

Next, we need to prepare our time-series dataset for Amazon SageMaker Canvas.

The Amazon SageMaker Canvas has several requirements for the dataset:

  • Column to predict – The column that contains examples of data for forecasting
  • Item ID column – The column that contains unique identifiers for each item in your dataset. In our example, a number uniquely identifies the stock
  • Timestamp column – The column containing the timestamps in your dataset (Format: yyyy-MM-dd HH:mm:ss)

For additional information about the model configuration, check out the Make a time series forecast documentation.

We’ll use the period’s close price (the Close column) to predict the stock price. Let’s drop unnecessary columns from the dataset and add the Amazon stock index (we’ll use 0, for example).

# create a copy of the dataframe
df = stock_data.copy()

# add column with unique identifier for the stock
df['ID'] = 0

# reindex dataframe to the datetime
# format required by AWS SageMaker Canvas (yyyy-MM-dd HH:mm:ss)
df.index = df.index.strftime('%Y-%m-%d %H:%M:%S')

# setting up dataframe index as a column
df = df.reset_index()

df.head()

Now, our dataset looks like this:

Amazon historical stock data 2008-01-01 - 2022-02-09 (prepared for Amazon SageMaker Canvas)

Let’s save our dataset for Amazon SageMaker Canvas:

df.to_csv(f'amazon_stock_daily_{start_date}_{end_date}.csv', index=False)

Amazon SageMaker Canvas workflow

In this section of the article, we’ll import the prepared dataset to Amazon SageMaker Canvas, run its workflow, generate predictions, and export them back to the Jupyter Notebook for comparison with the SARIMA algorithm predictions and actual stock values.

Creating model

First, we need to open Amazon SageMaker Canvas and press the New Model button:

Executing Amazon SageMaker Canvas workflow - New model

Provide the new model name (do not use brackets or quotes, as that will cause strange hard to investigate issues at the build step later):

Executing Amazon SageMaker Canvas workflow - Create new model

Importing dataset

Press the Import data to Canvas to start importing process:

Executing Amazon SageMaker Canvas workflow - Import data to Canvas

Upload the CSV file from your laptop (you can use the S3 bucket as your data source too):

Executing Amazon SageMaker Canvas workflow - Upload files to import

Preview the imported data:

Executing Amazon SageMaker Canvas workflow - Preview import

If everything’s OK, hit the Import data button:

Executing Amazon SageMaker Canvas workflow - Import data

Building the model

To build the model, we need to configure it first. Set Target column value to Close and make sure that only ID and Date columns are selected in the columns list. Press Configure link to set up model configuration.

Executing Amazon SageMaker Canvas workflow - Build - Selecting columns

Set up the following parameters for the time series forecasting configuration:

  • Item ID column: ID
  • Timestamp column: Date
  • Duration: the number of days you’re willing to get predictions for (we’ll use 10)
  • Use holiday schedule: United States
Executing Amazon SageMaker Canvas workflow - Build - Time series forecasting configuration

Hit the Save button and start the model build process by pressing the Standard build button.

The process will take a couple of hours. It’s time for a coffee break.

Executing Amazon SageMaker Canvas workflow - Building process

Generating predictions

As soon as the model build process is complete, press the Predict button:

Executing Amazon SageMaker Canvas workflow - Predict

Press the Start predictions button to generate forecasted values (the process will take several minutes):

Executing Amazon SageMaker Canvas workflow - Start predictions

Download the generated predictions CSV file and upload it to your Jupyter Studio:

Executing Amazon SageMaker Canvas workflow - Download predictions

While Amazon SageMaker Canvas generates the file, you can review forecasts by clicking on the Single item prediction type:

Executing Amazon SageMaker Canvas workflow - Predictions preview

Importing data to Jupyrer Notebook

To import the Amazon SageMaker Canvas predictions to Jupyter Notebook, use the following code:

# loading data from CSV file
canvas_predictions = pd.read_csv('Canvas_1645191819_2022-02-18T13-58-30Z_part0.csv')

# droping ID column (we con't need it)
canvas_predictions.drop(['ID'], axis=True, inplace=True)

# setting up Date column as a new index
canvas_predictions= canvas_predictions.set_index('Date')

# removing weekend datapoints with negative values from the dataset
canvas_predictions = canvas_predictions[canvas_predictions.p10 > 0]
canvas_predictions = canvas_predictions[canvas_predictions.p50 > 0]
canvas_predictions = canvas_predictions[canvas_predictions.p90 > 0]

canvas_predictions.head()
Amazon SageMaker Canvas imported predictions DataFrame

Let’s quickly visualize a part of this dataset:

pd.options.plotting.backend = "plotly"

canvas_predictions['2022-02-10':'2022-02-16'].plot()

Generating predictions using SARIMA algorithm

Let’s generate our predictions using the SARIMA algorithm.

First, let’s check out which model parameters works better. To list all possible combinations, run the following code:

import itertools

p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 5) for x in list(itertools.product(p, d, q))]

print('Examples of parameter for SARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))

Now, we can run the SARIMA algorithm with all varieties of parameters and choose the best one (minimal Akaike Information Criterion value):

import statsmodels.api as sm

close = stock_data['Close'].copy()
close.index = pd.DatetimeIndex(close.index).to_period('d')

params = {'param': None, 'param_seasonal': None, 'lowest': None}

for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            mod = sm.tsa.statespace.SARIMAX(close, order=param, seasonal_order=param_seasonal, enforce_stationarity=False, enforce_invertibility=False)
            results = mod.fit()
            if params['lowest'] == None:
                params = {'param': param, 'param_seasonal': param_seasonal, 'lowest': results.aic}
            else:
                if results.aic < params['lowest']:
                    params = {'param': param, 'param_seasonal': param_seasonal, 'lowest': results.aic}
        except:
            continue
print('====> SARIMA{}x{}5 - AIC:{}'.format(params['param'],params['param_seasonal'], params['lowest']))

After that, we can execute the SARIMA algorithm with the best working parameters and print the summary:

mod = sm.tsa.statespace.SARIMAX(close,
                                order=params['param'],
                                seasonal_order=params['param_seasonal'],
                                enforce_stationarity=False,
                                enforce_invertibility=False)
results = mod.fit()
print(results.summary().tables[1])
Amazon stocks prediction - SARIMA algorithm summary

Now, we can generate SARIMA predictions DataFrame. Please, pay attention that we’ve adjusted start_date, as soon as the SARIMA algorithm requires at least one data point from the original dataset:

start_date = '2022-02-09'
end_date = '2022-02-16'

pred = results.get_prediction(start=pd.to_datetime(start_date), end_date=pd.to_datetime(end_date), dynamic=False)
pred_ci = pred.conf_int()

pred_uc = results.get_forecast(steps=7)
pred_ci = pred_uc.conf_int()

sarimax_predictions = pd.DataFrame(
    {
        'Sarimax': pred_uc.predicted_mean,
        'Date': pd.date_range(start='02/10/2022', periods=7, freq='D')
    }
)
sarimax_predictions.set_index("Date", inplace=True)
sarimax_predictions.head()
Amazon stocks prediction - SARIMA predictions DataFrame

Comparing results

Let’s load additional stock values and compare Amazon SageMaker Canvas and SARIMA algorithm predictions with the actual stock values:

start_date = '2022-02-10'
end_date = '2022-02-16'

stock_data_extended = data.get_data_yahoo('AMZN', start_date, end_date)
stock_data_extended = stock_data_extended[start_date:end_date]
stock_data_extended.head()
Amazon stocks extended DataFrame

Let’s select only required data from the Amazon SageMaker Canvas DataFrame for the date range we’re interested in:

start_date = '2022-02-10'
end_date = '2022-02-17'

canvas_predictions = canvas_predictions[start_date:end_date]

Now, we can put all data points to the same plot to see the results:

import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Scatter(x=stock_data_extended.index.astype(str), y=stock_data_extended['Close'],
                    mode='lines',
                    name='Real values'))
fig.add_trace(go.Scatter(x=canvas_predictions.index.astype(str), y=canvas_predictions['p10'],
                    mode='lines',
                    name='Predicted (p10)'))
fig.add_trace(go.Scatter(x=canvas_predictions.index.astype(str), y=canvas_predictions['p50'],
                    mode='lines',
                    name='Predicted (p50)'))
fig.add_trace(go.Scatter(x=canvas_predictions.index.astype(str), y=canvas_predictions['p90'],
                    mode='lines',
                    name='Predicted (p90)'))
fig.add_trace(go.Scatter(x=sarimax_predictions.index.astype(str), y=sarimax_predictions['Sarimax'],
                    mode='lines',
                    name='SARIMA'))

fig.update_layout(
    showlegend=True
)

fig.show()

We can see that the SARIMA algorithm and Amazon SageMaker Canvas (50% percentile values) generated quite close predictions’ values of the Amazon stock close prices for the specified date range.

Summary

This article covered how to use Amazon SageMaker Canvas to create a forecasting model and make predictions for time-series datasets. We used Jupyter Notebook to prepare data for Amazon SageMaker Canvas, SARIMA algorithm execution, and compare obtained predictions with actual Amazon stock values.

LIKE THIS ARTICLE?
Facebook
Twitter
LinkedIn
Pinterest
WANT TO BE AN AUTHOR OF ANOTHER POST?

We’re looking for skilled technical authors for our blog!

Table of Contents