Using SNS And SQS As Target For AWS Lambda Dead Letter Queue

Andrei Maksimov

Andrei Maksimov

0
(0)

As soon as you start developing microservice applications in the Serverless world, you start accepting the idea that sometimes your microservices may fail. And it’s OK if it does not affect your application or your customer. In this article, we’re covering the pros and cons of the usage of SQS and SNS as Lambda Dead Letter Queue (DLQ).

If your Lambda function does something important, it becomes critical to know if the function execution failed.

The first way to get notified about the failures is to start using monitoring solutions for your Lambda functions.

There’re several approaches to follow:

  • Collecting and analyzing logs – you can set up CloudWatch Log Metric Filter and Alarm in the response to the world “Error” or “Exception” occurrence during some time.
  • Collecting and analyzing monitoring metrics – AWS provides us with a very comprehensive list of Lambda invocation, performance, and concurrency metrics, which you may put to CloudWatch Dashboard. You may set up CloudWatch Alarms on them as well.

The second way to solve this problem is to build a monitoring solution yourself using SNS or SQS as a transport.

Lambda Dead Letter Queue (DLQ) is a special feature released on Dec 1, 2016. This feature allows you to collect information about asynchronous invocation events, which your Lambda failed to process.

Currently, you have 2 options to process is the information:

  • SQS.
  • SNS.
Dead Letter Queue Options

SQS as Dead Letter Queue

You can use SQS as a Lambda DLQ as a durable store for failed events that can be monitored and picked up for resolution at your convenience. You can process information about Lambda failure events in bulk, have a defined wait period before re-triggering the original event, or you may do something else instead.

Here’s how it works:

SQS as Dead Letter Queue
  • Lambda receives any information from AWS service from the service itself or Eventbridge.
  • Then it attempts to do something meaningful in response to the event, but fails.
  • Finally, Lambda sends an incoming event information (JSON document) to DLQ in case of failure
  • You can configure CloudWatch Alarm to trigger an alarm if the number of messages in SQS is greater than a certain limit.

SQS Pros

  • Bulk processing – you may collect error messages in the queue and process them in a bulk later.
  • Guaranteed delivery – messages deleted from the queue only when they are processed by some other process or after 14 days by timeout.

SQS Cons

  • Not event-driven – messages must be pulled from the queue.

SNS as Dead Letter Queue

SNS or Simple Notification Service from the other side is a key part of any event-driven architecture in AWS. It allows you to process its events almost instantaneously and fan them out to multiple subscribers.

You can use an SNS Topic as a Lambda Dead Letter Queue. This allows you to take action on the failure instantly. For example, you can attempt to re-process the event, alert an individual or a process, or store the event message in SQS for later follow-up. And you can do all those things at the same time in parallel.

Here’s how it works:

SNS as Dead Letter Queue
  • Lambda receives any information from AWS service from the service itself or Eventbridge.
  • Then it attempts to do something meaningful in response to the event, but fails.
  • AWS Lambda sends an incoming event information in the form of JSON document to DLQ
  • SNS immediately sends the incoming message to multiple destinations.

The advantage of using SNS is its ability to send messages to multiple subscribers almost instantaneously in parallel.

SNS Pros

  • Event-driven: SNS will take action instantly upon receiving a message.
  • Fan-out: SNS allows multiple actions to be taken by different subscribers at the same time in parallel.

SNS Cons

  • SNS is non-durable storage – it will delete received event in 1 hour if it was not processed by any reason.

Terraform Implementation

Here’s Terraform’s implementation of using SNS as Lambda DLQ. Complete source code, including scripts and Lambda function, is available at our GitHub repository:

variable "region" {
    default = "us-east-1"
    description = "AWS Region to deploy to"
}

variable "app_env" {
    default = "failure_detection_example"
    description = "AWS Region to deploy to"
}

variable "sns_subscription_email_address_list" {
    type = string
    description = "List of email addresses as string(space separated)"
}

data "aws_caller_identity" "current" {}

data "archive_file" "lambda_zip" {
    source_dir  = "${path.module}/lambda/"
    output_path = "${path.module}/lambda.zip"
    type        = "zip"
}

provider "aws" {
    region = "${var.region}"
}

resource "aws_iam_policy" "lambda_policy" {
    name        = "${var.app_env}-lambda-policy"
    description = "${var.app_env}-lambda-policy"
 
    policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "sns:Publish"
      ],
      "Effect": "Allow",
      "Resource": "${aws_sns_topic.dlq.arn}"
    },
    {
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}
EOF
}

resource "aws_iam_role" "iam_for_terraform_lambda" {
    name = "${var.app_env}-lambda-role"
    assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Effect": "Allow"
    }
  ]
}
EOF
}

resource "aws_iam_role_policy_attachment" "terraform_lambda_iam_policy_basic_execution" {
    role = "${aws_iam_role.iam_for_terraform_lambda.id}"
    policy_arn = "${aws_iam_policy.lambda_policy.arn}"
}

resource "aws_lambda_function" "error_function" {
    filename = "lambda.zip"
    source_code_hash = data.archive_file.lambda_zip.output_base64sha256
    function_name = "${var.app_env}-lambda"
    role = "${aws_iam_role.iam_for_terraform_lambda.arn}"
    handler = "index.handler"
    runtime = "python3.6"

    dead_letter_config {
        target_arn = aws_sns_topic.dlq.arn
    }
}

resource "aws_sns_topic" "dlq" {
    name = "${var.app_env}-errors-sns"

    provisioner "local-exec" {
        command = "sh sns_subscription.sh"
        environment = {
            sns_arn = self.arn
            sns_emails = var.sns_subscription_email_address_list
        }
    }
}

resource "aws_cloudwatch_log_group" "lambda_loggroup" {
    name = "/aws/lambda/${aws_lambda_function.error_function.function_name}"
    retention_in_days = 14
}

///////////////////////// CloudWatch Events /////////////////////////

resource "aws_cloudwatch_log_metric_filter" "lambda_exceptions" {
    name = "${var.app_env}_lambda_exceptions"
    pattern = "\"Exception\""
    log_group_name = "${aws_cloudwatch_log_group.lambda_loggroup.name}"

    metric_transformation {
        name = "${var.app_env}_lambda_exceptions"
        namespace = "MyCustomMetrics"
        value = 1
    }
}

resource "aws_cloudwatch_metric_alarm" "lambda_exceptions" {
    alarm_name = "${var.app_env}_lambda_exceptions"
    comparison_operator = "GreaterThanOrEqualToThreshold"
    evaluation_periods = "1"
    metric_name = "${var.app_env}_lambda_exceptions"
    namespace = "MyCustomMetrics"
    period = "10"
    statistic = "Average"
    threshold = "1"
    alarm_description = "This metric monitors Lambda logs for 'Exception' keyword"
    insufficient_data_actions = []
    alarm_actions = [aws_sns_topic.dlq.arn]
}

output "lambda_name" {
    value = "${aws_lambda_function.error_function.id}"
}

This Terraform configuration deploys errored Lambda function, which returns an error during every execution. Lambda function has permissions to send messages to SNS topic and log its errors to CloudWatch.

Now, you may use the following code block to add CloudWatch Metric Filter and Alarm to the Lambda function logs as well:

resource "aws_cloudwatch_log_metric_filter" "lambda_exceptions" {
   name = "${var.app_env}_lambda_exceptions"
   pattern = "\"Exception\""
   log_group_name = "${aws_cloudwatch_log_group.lambda_loggroup.name}"
   metric_transformation {
       name = "${var.app_env}_lambda_exceptions"
       namespace = "MyCustomMetrics"
       value = 1
   }
}

resource "aws_cloudwatch_metric_alarm" "lambda_exceptions" {
   alarm_name = "${var.app_env}_lambda_exceptions"
   comparison_operator = "GreaterThanOrEqualToThreshold"
   evaluation_periods = "1"
   metric_name = "${var.app_env}_lambda_exceptions"
   namespace = "MyCustomMetrics"
   period = "10"
   statistic = "Average"
   threshold = "1"
   alarm_description = "This metric monitors Lambda logs for 'Exception' keyword"
   insufficient_data_actions = []
   alarm_actions = [aws_sns_topic.dlq.arn]
}

Summary

In this article, we covered differences in the usage of SNS and SQS as targets for your Lambda functions.

We hope that this article was helpful. If yes, please, help us spread it to the world!

If you have any questions, which are not covered by this blog, please, feel free to reach out. We’re willing to help.

How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?

Subscribe to our updates

Like this article?

Share on facebook
Share on Facebook
Share on twitter
Share on Twitter
Share on linkedin
Share on Linkdin
Share on pinterest
Share on Pinterest

Want to be an author of another post?

We’re looking for skilled technical authors for our blog!

Leave a comment

If you’d like to ask a question about the code or piece of configuration, feel free to use https://codeshare.io/ or a similar tool as Facebook comments are breaking code formatting.