Working with AWS Glue in Python using Boto3

Sathiya Sarathi

Sathiya Sarathi

0
(0)

AWS Glue is a serverless and fully-managed Extract Transform and Load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. In this article, we’ll cover how to use AWS SDK for Python (Boto3 library) to interact with AWS Glue to start automating ETL jobs, crawlers and defining the Metadata Catalogs.

Prerequisites

To start working with AWS Glue using Boto3, you need to set up your Python environment on your laptop.

In summary, this is what you will need:

  • Python 3
  • Boto3
  • AWS CLI tools

Alternatively, you can set up and launch a Cloud9 IDE.

Working with AWS Glue Data Crawlers

AWS Glue allows you to use crawlers to populate the AWS Glue Data Catalog tables. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these Data Catalog tables as sources and targets. The ETL job reads from and writes to the data stores that are specified in the source and target Data Catalog tables.

In this section of the article, we’ll explain how to manage AWS Glue crawlers by using the Boto3 library.

Creating an AWS Glue Crawler

To start managing AWS Glue service through the API, you need to instantiate the Boto3 client:

import boto3

client = boto3.client('glue', region_name="us-east-1")

To create an AWS Glue Data Crawler, you need to use the create_crawler() method of the Boto3 library. This method creates the crawler, that can retrieve the metadata information from the data sources and store it in the AWS Glue Data Catalog. Crawlers can process multiple data sources at a time.

In the following example, the defined crawler can read from two locations in an S3 bucket. It also has a specific schedule associated with it, which defines the crawling intervals. Depending on your case, you can configure the crawler to update the AWS Glue Data Catalog structure or log the change if the schema of your data changes.

import boto3
import json

client = boto3.client('glue', region_name="us-east-1")

response = client.create_crawler(
    Name='S3Crawler',
    Role='GlueFullAccess',
    DatabaseName='S3CrawlerHOC',
    Targets={
        'S3Targets': [
            {
                'Path': 's3://glue-source-hoc/read',
                'Exclusions': [
                    'string',
                ],
                'SampleSize': 2
            },
            {
                'Path': 's3://glue-source-hoc/write',
                'Exclusions': [
                    'string',
                ],
                'SampleSize': 2
            },
        ]
    },
    Schedule='cron(15 12 * * ? *)',
    SchemaChangePolicy={
        'UpdateBehavior': 'UPDATE_IN_DATABASE',
        'DeleteBehavior': 'DEPRECATE_IN_DATABASE'
    },
    RecrawlPolicy={
        'RecrawlBehavior': 'CRAWL_EVERYTHING'
    },
    LineageConfiguration={
        'CrawlerLineageSettings': 'DISABLE'
    }
)

print(json.dumps(response, indent=4, sort_keys=True, default=str))

Here’s an execution output:

Creating an AWS Glue Data Crawler using Boto3
Creating an AWS Glue Data Crawler using Boto3

The created S3Crawler becomes visible in the AWS Glue console as well:

Creating an AWS Glue Data Crawler using Boto3 - AWS Console
Creating an AWS Glue Data Crawler using Boto3 – AWS Console

Listing AWS Glue Crawlers

To list AWS Glue Crawlers, you need to use the list_crawlers() method of the Boto3 client:

import boto3
import json

client = boto3.client('glue', region_name="us-east-1")

response = client.list_crawlers()

print(json.dumps(response, indent=4, sort_keys=True, default=str))

The execution returns the listing of all the existing AWS Glue data crawlers in the particular AWS Region:

Listing AWS Glue Crawlers using Boto3
Listing AWS Glue Crawlers using Boto3

Starting an AWS Glue Data Crawler

To start the AWS Glue Data Crawler execution, you need to use the start_crawler() method of the Boto3 client. This method requires the name argument, which defines the crawler to start.

In the following example, we’ll run the first crawler from the list of available crawlers:

import boto3
import json

client = boto3.client('glue', region_name="us-east-1")

response = client.list_crawlers()

response2 = client.start_crawler(
    Name=response['CrawlerNames'][0]
)

print(json.dumps(response2, indent=4, sort_keys=True, default=str))

Here’s an execution output:

Starting an AWS Glue Data Crawler using Boto3
Starting an AWS Glue Data Crawler using Boto3

After execution, the crawler generates the database:

Starting an AWS Glue Data Crawler using Boto3 - Created Database
Starting an AWS Glue Data Crawler using Boto3 – Created Database

The created database will contain two tables, which structure describes the metadata stored in the S3 bucket:

Starting an AWS Glue Data Crawler using Boto3 - Created Tables
Starting an AWS Glue Data Crawler using Boto3 – Created Tables

Finally, you can explore every table to see the metadata information:

Starting an AWS Glue Data Crawler using Boto3 - Table View - Columns
Starting an AWS Glue Data Crawler using Boto3 – Table View – Columns

The AWS Glue crawler grubs the schema of the data from uploaded CSV files, detects CSV data types, and saves this information in the form of regular tables for future usage.

Deleting an AWS Glue Data Crawler

To delete an AWS Glue Data Crawler, you need to use the delete_crawler() method of the Boto3 client. This method requires the name argument, which represents the actual crawler name:

import boto3
import json

client = boto3.client('glue', region_name="us-east-1")

response = client.list_crawlers()

response2 = client.delete_crawler(
    Name=response['CrawlerNames'][0]
)

print(json.dumps(response2, indent=4, sort_keys=True, default=str))

Here’s an execution output:

Deleting an AWS Glue Data Crawler using Boto3
Deleting an AWS Glue Data Crawler using Boto3

Working with AWS Glue Jobs

An AWS Glue job drives the ETL from source to target based on on-demand triggers or scheduled runs. The job runs will trigger the Python scripts stored at an S3 location. The Glue interface generates this code dynamically, just as a boilerplate to edit and include new logic.

An AWS Glue job can be either be one of the following:

  • Batch job – runs on Spark environment
  • Streaming job – runs on Spark Structured Streaming environment
  • Plain Python shell job – runs in a simple Python environment

For this exercise, let’s clone this repository by invoking the following command.

git clone https://github.com/datawrangl3r/hoc-glue-example.git

Upload the Python file to the root directory and the CSV data file to the read directory of your S3 bucket. The script reads the CSV file present inside the read directory.

Here’s an S3 bucket structure example:

AWS Glue Jobs - S3 Bucket layout for the exercise
AWS Glue Jobs – S3 Bucket layout for the exercise

Creating an AWS Glue Job

To create an AWS Glue job, you need to use the create_job() method of the Boto3 client. This method accepts several parameters such as the Name of the job, the Role to be assumed during the job execution, set of commands to run, arguments for those commands, and other parameters related to the job execution.

In the following example, we will upload a Glue job script to an S3 bucket and use a standard worker to execute the job script. You can adjust the number of workers if you need to process a massive amount of data.

import boto3
import json

client = boto3.client('glue', region_name="us-east-1")

response = client.create_job(
    Name='IrisJob',
    Role='AWSGlueServiceRole-Demo',
    Command={
        'Name': 'glueetl',
        'ScriptLocation': 's3://glue-source-hoc/iris_onboarder.py',
        'PythonVersion': '3'
    },
    DefaultArguments={
      '--TempDir': 's3://glue-source-hoc/temp_dir',
      '--job-bookmark-option': 'job-bookmark-disable'
    },
    MaxRetries=1,
    GlueVersion='3.0',
    NumberOfWorkers=2,
    WorkerType='Standard'
)

print(json.dumps(response, indent=4, sort_keys=True, default=str))

Here’s an execution output:

Creating an AWS Glue Job using Boto3
Creating an AWS Glue Job using Boto3

You can review the created job in the AWS console:

Creating an AWS Glue Job using Boto3 - AWS Console
Creating an AWS Glue Job using Boto3 – AWS Console

Note: job creation process might take some time, so you have to wait for a few minutes before running the job. At the time of the article writing, there are no waiters defined for the Glue service.

Listing AWS Glue Jobs

To list AWS Glue jobs, you need to use the list_jobs() method of the Boto3 Glue client.

import boto3
import json

client = boto3.client('glue', region_name="us-east-1")

response = client.list_jobs()

print(json.dumps(response, indent=4, sort_keys=True, default=str))

The method execution returns all the existing AWS Glue jobs for the particular AWS Region:

Listing AWS Glue Jobs using Boto3
Listing AWS Glue Jobs using Boto3

Starting an AWS Glue Job

To start an AWS Glue Job, you need to use the start_job_run() method of the Boto3 Glue client. This method triggers the job execution, which will invoke the Python script located in the S3 bucket.

import boto3
import json

client = boto3.client('glue', region_name="us-east-1")

response = client.start_job_run(
    JobName='IrisJob'
)

print(json.dumps(response, indent=4, sort_keys=True, default=str))

Here’s an execution output:

Starting an AWS Glue Job using Boto3
Starting an AWS Glue Job using Boto3

The AWS console shows the job execution status under the Jobs tab for every single job:

Starting an AWS Glue Job using Boto3 - AWS Console
Starting an AWS Glue Job using Boto3 – AWS Console

Deleting an AWS Glue Job

To delete an AWS Glue job, you need to use the delete_job() method of the Boto3 Glue client:

import boto3
import json

client = boto3.client('glue', region_name="us-east-1")

response = client.delete_job(
    JobName='IrisJob'
)

print(json.dumps(response, indent=4, sort_keys=True, default=str))

Here’s an execution output:

Deleting an AWS Glue Job using Boto3
Deleting an AWS Glue Job using Boto3

Working with AWS Glue Blueprints and Workflows

In AWS Glue, you can use workflows to create and visualize complex extract, transform, and load (ETL) activities involving multiple crawlers, jobs, and triggers. Each workflow manages the execution and monitoring of all its jobs and crawlers. As a workflow runs each component, it records execution progress and status. This provides you with an overview of the larger task and the details of each step.

AWS Glue blueprints provide a way to create and share AWS Glue workflows. When there is a complex ETL process that could be used for similar use cases, rather than creating an AWS Glue workflow for each use case, you can create a single blueprint.

The blueprint specifies the jobs and crawlers to include in a workflow and specifies parameters that the workflow user supplies when they run the blueprint to create a workflow.

Check out the official AWS documentation on Developing AWS Glue Blueprints for more information.

In the following sections, we will deploy a demo blueprint that will create a workflow to crawl multiple S3 locations using Boto3.

git clone https://github.com/awslabs/aws-glue-blueprint-libs.git

cd aws-glue-blueprint-libs/samples/

zip crawl_s3_locations.zip crawl_s3_locations/*

Upload the crawl_s3_locations.zip file to your S3 bucket.

Creating an AWS Glue Blueprint

To create an AWS Glue Blueprint, you need to use the create_blueprint() method of the Boto3 Glue client:

import boto3
import json

client = boto3.client('glue', region_name="us-east-1")

response = client.create_blueprint(
    Name='Crawler_Blueprint_From_S3',
    BlueprintLocation='s3://glue-source-hoc/crawl_s3_locations.zip'
)

print(json.dumps(response, indent=4, sort_keys=True, default=str))

Here’s an execution output:

Creating an AWS Glue Blueprint using Boto3
Creating an AWS Glue Blueprint using Boto3

It might take a couple of seconds to deploy an AWS Glue Blueprint. On successful deployment, the Blueprint status will change to ACTIVE:

Creating an AWS Glue Blueprint using Boto3 - AWS Console
Creating an AWS Glue Blueprint using Boto3 – AWS Console

Listing AWS Glue Blueprints

To list AWS Glue Blueprints, you need to use the list_blueprints() method of the Boto3 Glue client:

import boto3
import json

client = boto3.client('glue', region_name="us-east-1")

response = client.list_blueprints()

print(json.dumps(response, indent=4, sort_keys=True, default=str))

This method returns the names of the blueprints that are currently available in the specified AWS Region in your account:

Listing AWS Glue Blueprints using Boto3
Listing AWS Glue Blueprints using Boto3

Creating an AWS Glue Workflow from a Blueprint

Once the blueprint is ready, you need to invoke the start_blueprint_run() method of the Boto3 Glue client with the parameters defined in the blueprint.cfg config file from the cloned repository. These parameters include the WorkflowName, IAMRole, S3Paths, and DatabaseName.

import boto3
import json

client = boto3.client('glue', region_name="us-east-1")

response = client.start_blueprint_run(
    BlueprintName='Crawler_Blueprint_From_S3',
    Parameters='{"WorkflowName": "s3_crawl_wflow", \
        "IAMRole": "arn:aws:iam::585584209241:role/GlueFullAccess", \
        "S3Paths": ["s3://covid19-lake/enigma-aggregation/json/global/", \
        "s3://covid19-lake/enigma-aggregation/json/global_countries/", \
        "s3://covid19-lake/enigma-aggregation/json/us_counties/", \
        "s3://covid19-lake/enigma-aggregation/json/us_states/"], \
        "DatabaseName": "blueprint_tutorial"}',
    RoleArn='arn:aws:iam::585584209241:role/GlueForDemo'
)

print(json.dumps(response, indent=4, sort_keys=True, default=str))

Here’s an execution output:

Creating an AWS Glue Workflow from a Blueprint using Boto3
Creating an AWS Glue Workflow from a Blueprint using Boto3

To review the workflow creation status, select the AWS Glue Blueprint in the AWS console and hit the View from the Actions drop-down menu:

Creating an AWS Glue Workflow from a Blueprint using Boto3 - Blueprint details
Creating an AWS Glue Workflow from a Blueprint using Boto3 – Blueprint details

Scrolling down to see the blueprint runs:

Creating an AWS Glue Workflow from a Blueprint using Boto3 - Running Blueprints
Creating an AWS Glue Workflow from a Blueprint using Boto3 – Running Blueprints

On completion, AWS Glue assigns the workflow name to the Blueprint run:

Creating an AWS Glue Workflow from a Blueprint using Boto3 - Succeeded Blueprint run
Creating an AWS Glue Workflow from a Blueprint using Boto3 – Succeeded Blueprint run

Click on the workflow to see the graph containing the sequence of steps that will be executed during the workflow run:

Creating an AWS Glue Workflow from a Blueprint using Boto3 - Workflow Graph
Creating an AWS Glue Workflow from a Blueprint using Boto3 – Workflow Graph

Listing AWS Glue Workflows

To list AWS Glue Workflows, you need to use the list_workflows() method of the Boto3 Glue client:

import boto3
import json

client = boto3.client('glue', region_name="us-east-1")

response = client.list_workflows()

print(json.dumps(response, indent=4, sort_keys=True, default=str))

This method returns the workflows in the specified AWS Region:

Listing AWS Glue Workflows using Boto3
Listing AWS Glue Workflows using Boto3

Starting an AWS Glue Workflow

To start an AWS Glue Workflow, you need to use the start_workflow_run() method of the Boto3 Glue client:

import boto3
import json

client = boto3.client('glue', region_name="us-east-1")

response = client.start_workflow_run(
    Name='s3_crawl_wflow'
)

print(json.dumps(response, indent=4, sort_keys=True, default=str))

You can check the status of the workflow in the AWS console:

Running an AWS Glue Workflow using Boto3 - AWS Console
Running an AWS Glue Workflow using Boto3 – AWS Console

Click on the View run details button to see the workflow execution graph, which shows the current status of the workflow run.

Running an AWS Glue Workflow using Boto3 - AWS Console - Workflow execution graph
Running an AWS Glue Workflow using Boto3 – AWS Console – Workflow execution graph

Deleting an AWS Glue Workflow

To delete an AWS Glue Workflow, you need to use the delete_workflow() method of the Boto3 Glue client:

import boto3
import json

client = boto3.client('glue', region_name="us-east-1")

response = client.delete_workflow(Name='s3_crawl_wflow')

print(json.dumps(response, indent=4, sort_keys=True, default=str))

Here’s an execution output:

Deleting an AWS Glue Blueprint

To deleting an AWS Glue Blueprint, you need to use the delete_blueprint() method of the Boto3 library. This method takes the Name of the Blueprint as an argument.

import boto3
import json

client = boto3.client('glue', region_name="us-east-1")

response = client.delete_blueprint(Name='Crawler_Blueprint_From_S3')

print(json.dumps(response, indent=4, sort_keys=True, default=str))

Here’s an execution output:

Deleting an AWS Glue Blueprint using Boto3
Deleting an AWS Glue Blueprint using Boto3

Summary

In this article, we’ve covered how to use the Boto3 library to interact with AWS Glue and automate ETL jobs, crawlers and define the Metadata Catalogs.

How useful was this post?

Click on a star to rate it!

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?

Top rated Udemy Courses to improve you career

Subscribe to our updates

Like this article?

Share on facebook
Share on Facebook
Share on twitter
Share on Twitter
Share on linkedin
Share on Linkdin
Share on pinterest
Share on Pinterest

Want to be an author of another post?

We’re looking for skilled technical authors for our blog!

Leave a comment

If you’d like to ask a question about the code or piece of configuration, feel free to use https://codeshare.io/ or a similar tool as Facebook comments are breaking code formatting.