Using SNS And SQS As Target For AWS Lambda Dead Letter Queue

Laptop with code
Related Content

As soon as you start developing microservice applications in the Serverless world, you start accepting the idea that sometimes your microservices may fail. And it’s OK if it does not affect your application or customer. In this article, we’re covering the pros and cons of the usage of SQS and SNS as Lambda Dead Letter Queue (DLQ).

If your Lambda function does something important, it becomes critical to know if the execution failed.

Working with SNS in Python using Boto3

Working with SNS in Python using Bo...
Working with SNS in Python using Boto3

The first way to get notified about the failures is to start using monitoring solutions for your Lambda functions.

There’re several approaches to follow:

The second way to solve this problem is to build a monitoring solution yourself using SNS or SQS as a transport.

Lambda Dead Letter Queue (DLQ) is a special feature released on Dec 1, 2016. This feature allows you to collect information about asynchronous invocation events which your Lambda failed to process.

Currently, you have 2 options to process the information:

  • SQS.
  • SNS.
Dead Letter Queue Options

SQS as Dead Letter Queue

You can use SQS as a Lambda DLQ as a durable store for failed events that can be monitored and picked up for resolution at your convenience. You can process information about Lambda failure events in bulk, have a defined wait period before re-triggering the original event, or you may do something else instead.

Here’s how it works:

SQS as Dead Letter Queue
  • Lambda receives any information from AWS service from the service itself or Eventbridge.
  • Then it attempts to do something meaningful in response to the event but fails.
  • Finally, Lambda sends incoming event information (JSON document) to DLQ in case of failure
  • You can configure CloudWatch Alarm to trigger an alarm if the number of messages in SQS exceeds a certain limit.

SQS Pros

  • Bulk processing – you may collect error messages in the queue and process them in bulk later.
  • Guaranteed delivery – messages deleted from the queue only when they are processed by some other process or after 14 days by timeout.

SQS Cons

  • Not event-driven – messages must be pulled from the queue.

SNS as Dead Letter Queue

SNS or Simple Notification Service, on the other side, is a key part of any event-driven architecture in AWS. It allows you to instantly process its events and fan them out to multiple subscribers.

You can use an SNS Topic as a Lambda Dead Letter Queue. This allows you to take action on the failure instantly. For example, you can attempt to re-process the event, alert an individual or a process, or store the event message in SQS for later follow-up. And you can do all those things at the same time in parallel.

Here’s how it works:

SNS as Dead Letter Queue
  • Lambda receives any information from AWS service from the service itself or Eventbridge.
  • Then it attempts to do something meaningful in response to the event but fails.
  • AWS Lambda sends incoming event information in the form of JSON document to DLQ
  • SNS immediately sends the incoming message to multiple destinations.

The advantage of using SNS is its ability to send messages to multiple subscribers almost instantaneously in parallel.

SNS Pros

  • Event-driven: SNS will take action instantly upon receiving a message.
  • Fan-out: SNS allows multiple actions to be taken by different subscribers simultaneously.

SNS Cons

  • SNS is non-durable storage – it will delete the received event in 1 hour if it was not processed for any reason.

Terraform Implementation

Here’s Terraform’s implementation of using SNS as Lambda DLQ. Complete source code, including scripts and Lambda function, is available at our GitHub repository:

variable "region" {
    default = "us-east-1"
    description = "AWS Region to deploy to"
}

variable "app_env" {
    default = "failure_detection_example"
    description = "AWS Region to deploy to"
}

variable "sns_subscription_email_address_list" {
    type = string
    description = "List of email addresses as string(space separated)"
}

data "aws_caller_identity" "current" {}

data "archive_file" "lambda_zip" {
    source_dir  = "${path.module}/lambda/"
    output_path = "${path.module}/lambda.zip"
    type        = "zip"
}

provider "aws" {
    region = "${var.region}"
}

resource "aws_iam_policy" "lambda_policy" {
    name        = "${var.app_env}-lambda-policy"
    description = "${var.app_env}-lambda-policy"
 
    policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "sns:Publish"
      ],
      "Effect": "Allow",
      "Resource": "${aws_sns_topic.dlq.arn}"
    },
    {
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}
EOF
}

resource "aws_iam_role" "iam_for_terraform_lambda" {
    name = "${var.app_env}-lambda-role"
    assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Effect": "Allow"
    }
  ]
}
EOF
}

resource "aws_iam_role_policy_attachment" "terraform_lambda_iam_policy_basic_execution" {
    role = "${aws_iam_role.iam_for_terraform_lambda.id}"
    policy_arn = "${aws_iam_policy.lambda_policy.arn}"
}

resource "aws_lambda_function" "error_function" {
    filename = "lambda.zip"
    source_code_hash = data.archive_file.lambda_zip.output_base64sha256
    function_name = "${var.app_env}-lambda"
    role = "${aws_iam_role.iam_for_terraform_lambda.arn}"
    handler = "index.handler"
    runtime = "python3.6"

    dead_letter_config {
        target_arn = aws_sns_topic.dlq.arn
    }
}

resource "aws_sns_topic" "dlq" {
    name = "${var.app_env}-errors-sns"

    provisioner "local-exec" {
        command = "sh sns_subscription.sh"
        environment = {
            sns_arn = self.arn
            sns_emails = var.sns_subscription_email_address_list
        }
    }
}

resource "aws_cloudwatch_log_group" "lambda_loggroup" {
    name = "/aws/lambda/${aws_lambda_function.error_function.function_name}"
    retention_in_days = 14
}

///////////////////////// CloudWatch Events /////////////////////////

resource "aws_cloudwatch_log_metric_filter" "lambda_exceptions" {
    name = "${var.app_env}_lambda_exceptions"
    pattern = "\"Exception\""
    log_group_name = "${aws_cloudwatch_log_group.lambda_loggroup.name}"

    metric_transformation {
        name = "${var.app_env}_lambda_exceptions"
        namespace = "MyCustomMetrics"
        value = 1
    }
}

resource "aws_cloudwatch_metric_alarm" "lambda_exceptions" {
    alarm_name = "${var.app_env}_lambda_exceptions"
    comparison_operator = "GreaterThanOrEqualToThreshold"
    evaluation_periods = "1"
    metric_name = "${var.app_env}_lambda_exceptions"
    namespace = "MyCustomMetrics"
    period = "10"
    statistic = "Average"
    threshold = "1"
    alarm_description = "This metric monitors Lambda logs for 'Exception' keyword"
    insufficient_data_actions = []
    alarm_actions = [aws_sns_topic.dlq.arn]
}

output "lambda_name" {
    value = "${aws_lambda_function.error_function.id}"
}

This Terraform configuration deploys errored Lambda function, which returns an error during every execution. Lambda function has permissions to send messages to SNS topic and log its errors to CloudWatch.

Now, you may use the following code block to add CloudWatch Metric Filter and Alarm to the Lambda function logs as well:

resource "aws_cloudwatch_log_metric_filter" "lambda_exceptions" {
   name = "${var.app_env}_lambda_exceptions"
   pattern = "\"Exception\""
   log_group_name = "${aws_cloudwatch_log_group.lambda_loggroup.name}"
   metric_transformation {
       name = "${var.app_env}_lambda_exceptions"
       namespace = "MyCustomMetrics"
       value = 1
   }
}

resource "aws_cloudwatch_metric_alarm" "lambda_exceptions" {
   alarm_name = "${var.app_env}_lambda_exceptions"
   comparison_operator = "GreaterThanOrEqualToThreshold"
   evaluation_periods = "1"
   metric_name = "${var.app_env}_lambda_exceptions"
   namespace = "MyCustomMetrics"
   period = "10"
   statistic = "Average"
   threshold = "1"
   alarm_description = "This metric monitors Lambda logs for 'Exception' keyword"
   insufficient_data_actions = []
   alarm_actions = [aws_sns_topic.dlq.arn]
}

Summary

This article covered differences in the usage of SNS and SQS as targets for your Lambda functions.

We hope that this article was helpful. If yes, please, help us spread it to the world!

If you have any questions, which are not covered by this blog, please, feel free to reach out. We’re willing to help.

LIKE THIS ARTICLE?
Facebook
Twitter
LinkedIn
Pinterest
WANT TO BE AN AUTHOR OF ANOTHER POST?

We’re looking for skilled technical authors for our blog!

Table of Contents