Working with AWS Batch in Python using Boto3

Tuvshinsanaa Tuul

Tuvshinsanaa Tuul

0
(0)

AWS Batch enables developers, scientists, and engineers to quickly and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of computing resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted. This article will show how to work with AWS Batch in Python using the Boto3 library by implementing a job that imports records into the DynamoDB table from a file uploaded into the S3 bucket.

What is AWS Batch?

AWS Batch plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as AWS Fargate, Amazon EC2, and Spot Instances.

AWS Batch organizes its work into four components:

  • Jobs – the unit of work submitted to AWS Batch, whether it be implemented as a shell script, executable, or Docker container image.
  • Job Definition – describes how your work is executed, including the CPU and memory requirements and IAM role that provides access to other AWS services.
  • Job Queues – listing of work to be competed by your Jobs. You can leverage multiple queues with different priority levels.
  • Compute Environment – the compute resources that run your Jobs. Environments can be configured to be managed by AWS or on your own as well as number of and types of instances on which Jobs will run. You can also allow AWS to select the right instance type.

Features

  • EC2 Instances will run only for the time that’s needed, taking advantage of per-second billing. You can also lower your costs by using spot instances.
  • It’s possible to configure how many retries you’d like for any job.
  • It offers queues where you send the jobs. Each queue could be configured with a certain priority so you can configure which jobs will run first. You can also have queues that use better resources to speed up the process.
  • It supports Docker containers so that you can focus only on your code.

Job states

Here’s a list of AWS Batch job’s states:

  • SUBMITTED: Accepted into the queue, but not yet evaluated for execution
  • PENDING: Your job as dependencies on other jobs which have not yet completed
  • RUNNABLE: Your job has been evaluated by the scheduler and is ready to run
  • STARTING: Your job is in the process of being scheduled to a compute resource
  • RUNNING: Your job is currently running
  • SUCCEEDED: Your job has finished with exit code 0
  • FAILED: Your job finished with a non-zero exit code, was cancelled or terminated

Job scheduler

The AWS Batch scheduler evaluates when, where, and how to run jobs that have been submitted to a job queue. Jobs run in the order they are introduced as long as all dependencies on other jobs have been met.

Prerequisites

Let’s create a Docker container and IAM role for AWS Batch job execution, DynamoDB table, and S3 bucket.

Docker container

You can skip this section and use already existing Docker image from Docker Hub: luckytuvshee/importuser:latest.

First of all, we need to create a Docker image, which is to be responsible for the computing task, which we’ll run as an AWS Batch job.

Here’s a working folder structure:

Creating Docker image for AWS Batch
Creating Docker image for AWS Batch

The content of the Dockerfile:

FROM amazonlinux:latest

RUN yum -y install which unzip python3 pip3

RUN pip3 install boto3

ADD importUser.py /usr/local/bin/importUser.py

WORKDIR /tmp

USER nobody

ENTRYPOINT ["/usr/local/bin/importUser.py"]

Now, let’s create the importUser.py Python script that imports data from a CSV file uploaded to the S3 bucket into the DynamoDB table:

#!/usr/bin/python3

import os
import boto3
import csv 
from datetime import datetime, timezone


s3_resource = boto3.resource('s3')

print('os environ:', os.environ)

table_name = os.environ['table_name']
bucket_name = os.environ['bucket_name']
key = os.environ['key']

table = boto3.resource('dynamodb').Table(table_name)
csv_file = s3_resource.Object(bucket_name, key)

items = csv_file.get()['Body'].read().decode('utf-8').splitlines()
reader = csv.reader(items)
header = next(reader)

current_date = datetime.now(timezone.utc).isoformat()[:-6] + 'Z'

for row in reader:
    table.put_item(
        Item={
            'id': row[header.index('id')],
            'number': row[header.index('number')],
            'createdAt': current_date,
        }
    )

print('records imported successfully')

Additional information:

Let’s build a Docker image:

docker build -f Dockerfile -t luckytuvshee/importuser .

As soon as the image has been built, you can push it to the Docker registry:

docker push luckytuvshee/importuser

Additional information:

DynamoDB table

Let’s create a DynamoDB table, which will store records imported by the AWS Batch job.

Additional information:

import boto3 

dynamodb = boto3.resource('dynamodb')

response = dynamodb.create_table(
    TableName='batch-test-table',
    KeySchema=[
        {
            'AttributeName': 'id',
            'KeyType': 'HASH'
        }
    ],
    AttributeDefinitions = [
        {
            'AttributeName': 'id',
            'AttributeType': 'S'
        },
    ],
    ProvisionedThroughput={
        'ReadCapacityUnits':1,
        'WriteCapacityUnits':1
    }
)

print(response)
Create DynamoDB table for AWS Batch
Create DynamoDB table

S3 bucket

Now, we need to create an S3 bucket, which will store uploaded CSV files. The AWS Batch job will process these files.

Additional information:

import boto3 

s3 = boto3.resource('s3')

response = s3.create_bucket(
    Bucket='batch-test-bucket-ap-1',
    CreateBucketConfiguration={
        'LocationConstraint': 'ap-northeast-1'
    }
)

print(response)
Create S3 bucket

CSV file example

Here’s an example of the CSV file data, which we’ll upload to the S3 bucket:

Sample CSV file
Sample CSV file

We’ll name this file sample-zip.csv. Let’s put it to the S3 bucket:

AWS Batch - Uploaded sample CSV file
Uploaded sample CSV file

AWS Batch job’s IAM role

Now, let’s create the IAM role for the Docker Container to run the Python Boto3 script.

This role requires access to the DynamoDB, S3, and CloudWatch services. For simplicity, we’ll use the AmazonDynamoDBFullAccess, AmazonS3FullAccess, and CloudWatchFullAccess managed policies, but we strongly encourage you to make a custom role with only the necessary permissions.

Additional information:

import boto3
import json 

client = boto3.client('iam')

assume_role_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ecs-tasks.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]   
}

response = client.create_role(
    RoleName='dynamodbImportRole',
    AssumeRolePolicyDocument=json.dumps(assume_role_policy)
)

client.attach_role_policy(
    RoleName=response['Role']['RoleName'],
    PolicyArn='arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess'
)

client.attach_role_policy(
    RoleName=response['Role']['RoleName'],
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess'
)

client.attach_role_policy(
    RoleName=response['Role']['RoleName'],
    PolicyArn='arn:aws:iam::aws:policy/CloudWatchFullAccess'
)

print(response)
AWS Batch - Create IAM Role for Docker Container
Create IAM Role for Docker Container

Managing AWS Batch using Boto3

This section of the article will cover how to manage the AWS Batch service and create and run the AWS Batch job.

Create AWS Batch compute environment

To create a computing environment for AWS Batch, you need to use the create_compute_environment() method of the AWS Batch Boto3 client.

AWS Batch job queues are mapped to one or more compute environments:

  • MANAGED – Managed compute environments launch Amazon ECS container instances into the VPC and subnets that you specify when you create the compute environment. Amazon ECS container instances need external network access to communicate with the Amazon ECS service endpoint.
  • UNMANAGED – In an unmanaged compute environment, you manage your own compute resources. You must verify that the AMI you use for your compute resources meets the Amazon ECS container instance AMI specification.

You can also set the instance type to optimal, so that means AWS will evaluate the job and look at what kind of job is it CPU requirement job, is it memory dependent job or is it a combination of different other requirements and will select the correct instance for the job to be executed.

import boto3

client = boto3.client('batch')

response = client.create_compute_environment(
    computeEnvironmentName='dynamodb_import_environment',
    type='MANAGED',
    state='ENABLED',
    computeResources={
        'type': 'EC2',
        'allocationStrategy': 'BEST_FIT',
        'minvCpus': 0,
        'maxvCpus': 256,
        'subnets': [
            'subnet-0be50d51',
            'subnet-3fd16f77',
            'subnet-0092132b',
        ],
        'instanceRole': 'ecsInstanceRole',
        'securityGroupIds': [
            'sg-851667c7',
        ],
        'instanceTypes': [
            'optimal',
        ]
    }
)

print(response)
Create AWS Batch Compute Environment
Create AWS Batch Compute Environment

Create AWS Batch job queue

To create a job queue for AWS Batch, you need to use the create_job_queue() method of the AWS Batch Boto3 client.

Jobs are submitted to a job queue, where they reside until they can be scheduled to a compute resource. Information related to completed jobs persists in the queue for 24 hours.

When you’re creating a queue, you have to define the queue state (ENABLED or DISABLED).

You can have different types of queues with varying kinds of priorities.

import boto3

client = boto3.client('batch')

response = client.create_job_queue(
    jobQueueName='dynamodb_import_queue',
    state='ENABLED',
    priority=1,
    computeEnvironmentOrder=[
        {
            'order': 100,
            'computeEnvironment': 'dynamodb_import_environment'
        },
    ],
)
print(response)
AWS Batch - Create Job Queue
Create Job Queue

Register AWS Batch job definition

To register a job definition in AWS Batch, you need to use the register_job_definition() method of the AWS Batch Boto3 client.

AWS Batch job definitions specify how batch jobs need to be run.

Here are some of the attributes that you can specify in a job definition:

  • IAM role associated with the job
  • vCPU and memory requirements
  • Container properties
  • Environment variables
  • Retry strategy
import boto3

iam = boto3.client('iam')
client = boto3.client('batch')

dynamodbImportRole = iam.get_role(RoleName='dynamodbImportRole')

response = client.register_job_definition(
    jobDefinitionName='dynamodb_import_job_definition',
    type='container',
    containerProperties={
        'image': 'luckytuvshee/importuser:latest',
        'memory': 256,
        'vcpus': 16,
        'jobRoleArn': dynamodbImportRole['Role']['Arn'],
        'executionRoleArn': dynamodbImportRole['Role']['Arn'],
        'environment': [
            {
                'name': 'AWS_DEFAULT_REGION',
                'value': 'ap-northeast-1',
            }
        ]
    },
)

print(response)
AWS Batch - Register Job Definition
Register Job Definition

Submit AWS Batch job for execution

Jobs are the unit of work executed by AWS Batch as containerized applications running on Amazon EC2 or ECS Fargate.

Containerized jobs can reference a container image, command, and parameters.

With containerOverrides parameter, at job submission, you can override some parameters that you defined in the container. You make a general-purpose container, and then you can pass some extra override configurations at initialization.

You can also specify the retryStrategy, which allows you to define how many times you want the job to be restarted before it fails.

import boto3

client = boto3.client('batch')

response = client.submit_job(
    jobDefinition='dynamodb_import_job_definition',
    jobName='dynamodb_import_job1',
    jobQueue='dynamodb_import_queue',
    containerOverrides={
        'environment': [
            {
                'name': 'table_name',
                'value': 'batch-test-table',
            },
            {
                'name': 'bucket_name',
                'value': 'batch-test-bucket-ap-1',
            },
            {
                'name': 'key',
                'value': 'sample-zip.csv',
            }
        ]
    },
)

print(response)
Submit AWS Batch job for execution
Submit AWS Batch job for execution

You can check AWS Batch job status in the AWS console:

AWS Batch job status
AWS Batch job status

As soon as the AWS Batch job finishes its execution, you may check the imported data in the DynamoDB table.

AWS Batch - DynamoDB imported records
DynamoDB imported records

Summary

This article covered the fundamentals of AWS Batch and how to use Python and the Boto3 library to manage AWS Batch Jobs. We’ve created a Demo Job that imports a CSV file from the S3 bucket to the DynamoDB table.

How useful was this post?

Click on a star to rate it!

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?

Top rated Udemy Courses to improve you career

Subscribe to our updates

Like this article?

Share on facebook
Share on Facebook
Share on twitter
Share on Twitter
Share on linkedin
Share on Linkdin
Share on pinterest
Share on Pinterest

Want to be an author of another post?

We’re looking for skilled technical authors for our blog!

Leave a comment

If you’d like to ask a question about the code or piece of configuration, feel free to use https://codeshare.io/ or a similar tool as Facebook comments are breaking code formatting.