Using-SNS-And-SQS-As-Target-For-AWS-Lambda-Dead-Letter-Queue

Using SNS And SQS As Target For AWS Lambda Dead Letter Queue

As soon as you start developing microservice applications in the Serverless world, you start accepting the idea that sometimes your microservices may fail. And it’s OK if it does not affect your application or customer. In this article, we’re covering the pros and cons of the usage of SQS and SNS as Lambda Dead Letter Queues (DLQ).

If your Lambda function does something important, it becomes critical to know if the execution failed and what failure has happened with the message processing. These failures include network errors and client dependency errors.

The first way to get notified about the failures is to start using monitoring solutions for your Lambda functions.

There’re several approaches to follow:

The second way to solve this problem is to build a monitoring solution yourself using SNS or SQS as a transport.

Lambda Dead Letter Queue (DLQ) is a special service implementation feature released on Dec 1, 2016. This feature allows you to collect information, store messages, and process messages that consumers couldn’t process about asynchronous invocation events which your Lambda failed to process. Dead-letter queues are useful for debugging your application or messaging system because they let you isolate unconsumed messages to determine why their processing doesn’t succeed and instead have them processed correctly. Dead Letter Queue allows you to collect different queues, such as the standard queue, which provides at least one delivery; the source queue, which contains a buffer; and lastly, the FIFO queue(First-In-First-Out), which has the capabilities of a standard queue but are designed to enhance messaging between applications

Currently, you have 2 options to process the information:

  • SQS.
  • SNS.
Dead Letter Queue Options

SQS as Dead Letter Queue

You can use SQS as a Lambda DLQ as a durable store for failed events that can be monitored and picked up for resolution at your convenience. You can process information about Lambda failure events in bulk, have a defined wait period before re-triggering the original event, or you may do something else instead.

Here’s how it works:

SQS as Dead Letter Queue
  • Lambda receives any information from AWS service from the service itself or Eventbridge.
  • Then it attempts to do something meaningful in response to the event but fails.
  • Finally, Lambda sends incoming event information (JSON format document) to DLQ in case of failure
  • You can configure CloudWatch Alarm to trigger an alarm if the maximum number of messages in SQS exceeds a certain limit.

SQS Pros

  • Bulk processing – you may collect error messages in the destination queue and process them in bulk later.
  • Guaranteed delivery – messages deleted from the queue only when they are delivered and processed successfully by some other process or after 14 days by timeout.

SQS Cons

  • Not event-driven – messages must be pulled from the queue.

SNS as Dead Letter Queue

SNS or Simple Notification Service, on the other side, is a key part of any event-driven architecture in AWS. It allows you to instantly process its events and fan them out to multiple subscribers.

You can use an SNS Topic as a Lambda Dead Letter Queue. This allows you to take action on the failure instantly after the queue’s processing attempts. For example, you can attempt to re-process the event, alert an individual or a process, or have stored the event message in SQS for later follow-up. And you can do all those things at the same time in parallel.

Here’s how it works:

SNS as Dead Letter Queue
  • Lambda receives any information from AWS service from the service itself or Eventbridge.
  • Then it attempts to do something meaningful in response to the event but fails.
  • AWS Lambda sends incoming event information in the form of JSON document to DLQ
  • SNS immediately sends the incoming moving messages to multiple destinations.

Note that the advantage of using SNS is its ability to send messages to multiple subscribers almost instantaneously in parallel.

SNS Pros

  • Event-driven: SNS will take action instantly upon receiving a message group.
  • Fan-out: SNS allows multiple actions to be taken by different subscribers simultaneously.

SNS Cons

  • SNS is non-durable storage – it will delete the received event in 1 hour if it was not processed for any reason.

Terraform Implementation

Here’s Terraform’s following example of implementation of using SNS as Lambda DLQ. Complete source code, including scripts and Lambda function, is available at our GitHub repository:

variable "region" {
    default = "us-east-1"
    description = "AWS Region to deploy to"
}
variable "app_env" {
    default = "failure_detection_example"
    description = "AWS Region to deploy to"
}
variable "sns_subscription_email_address_list" {
    type = string
    description = "List of email addresses as string(space separated)"
}
data "aws_caller_identity" "current" {}
data "archive_file" "lambda_zip" {
    source_dir  = "${path.module}/lambda/"
    output_path = "${path.module}/lambda.zip"
    type        = "zip"
}
provider "aws" {
    region = "${var.region}"
}
resource "aws_iam_policy" "lambda_policy" {
    name        = "${var.app_env}-lambda-policy"
    description = "${var.app_env}-lambda-policy"
 
    policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "sns:Publish"
      ],
      "Effect": "Allow",
      "Resource": "${aws_sns_topic.dlq.arn}"
    },
    {
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}
EOF
}
resource "aws_iam_role" "iam_for_terraform_lambda" {
    name = "${var.app_env}-lambda-role"
    assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Effect": "Allow"
    }
  ]
}
EOF
}
resource "aws_iam_role_policy_attachment" "terraform_lambda_iam_policy_basic_execution" {
    role = "${aws_iam_role.iam_for_terraform_lambda.id}"
    policy_arn = "${aws_iam_policy.lambda_policy.arn}"
}
resource "aws_lambda_function" "error_function" {
    filename = "lambda.zip"
    source_code_hash = data.archive_file.lambda_zip.output_base64sha256
    function_name = "${var.app_env}-lambda"
    role = "${aws_iam_role.iam_for_terraform_lambda.arn}"
    handler = "index.handler"
    runtime = "python3.6"
    dead_letter_config {
        target_arn = aws_sns_topic.dlq.arn
    }
}
resource "aws_sns_topic" "dlq" {
    name = "${var.app_env}-errors-sns"
    provisioner "local-exec" {
        command = "sh sns_subscription.sh"
        environment = {
            sns_arn = self.arn
            sns_emails = var.sns_subscription_email_address_list
        }
    }
}
resource "aws_cloudwatch_log_group" "lambda_loggroup" {
    name = "/aws/lambda/${aws_lambda_function.error_function.function_name}"
    retention_in_days = 14
}
///////////////////////// CloudWatch Events /////////////////////////
resource "aws_cloudwatch_log_metric_filter" "lambda_exceptions" {
    name = "${var.app_env}_lambda_exceptions"
    pattern = "\"Exception\""
    log_group_name = "${aws_cloudwatch_log_group.lambda_loggroup.name}"
    metric_transformation {
        name = "${var.app_env}_lambda_exceptions"
        namespace = "MyCustomMetrics"
        value = 1
    }
}
resource "aws_cloudwatch_metric_alarm" "lambda_exceptions" {
    alarm_name = "${var.app_env}_lambda_exceptions"
    comparison_operator = "GreaterThanOrEqualToThreshold"
    evaluation_periods = "1"
    metric_name = "${var.app_env}_lambda_exceptions"
    namespace = "MyCustomMetrics"
    period = "10"
    statistic = "Average"
    threshold = "1"
    alarm_description = "This metric monitors Lambda logs for 'Exception' keyword"
    insufficient_data_actions = []
    alarm_actions = [aws_sns_topic.dlq.arn]
}
output "lambda_name" {
    value = "${aws_lambda_function.error_function.id}"
}

This Terraform configuration deploys errored Lambda function, which returns an error when an error occurs during every execution. Lambda function has permissions to send messages to SNS topic and log its errors to CloudWatch.

Now, you may use the following code block to add CloudWatch Metric Filter and Alarm to the Lambda function logs as well:

resource "aws_cloudwatch_log_metric_filter" "lambda_exceptions" {
   name = "${var.app_env}_lambda_exceptions"
   pattern = "\"Exception\""
   log_group_name = "${aws_cloudwatch_log_group.lambda_loggroup.name}"
   metric_transformation {
       name = "${var.app_env}_lambda_exceptions"
       namespace = "MyCustomMetrics"
       value = 1
   }
}
resource "aws_cloudwatch_metric_alarm" "lambda_exceptions" {
   alarm_name = "${var.app_env}_lambda_exceptions"
   comparison_operator = "GreaterThanOrEqualToThreshold"
   evaluation_periods = "1"
   metric_name = "${var.app_env}_lambda_exceptions"
   namespace = "MyCustomMetrics"
   period = "10"
   statistic = "Average"
   threshold = "1"
   alarm_description = "This metric monitors Lambda logs for 'Exception' keyword"
   insufficient_data_actions = []
   alarm_actions = [aws_sns_topic.dlq.arn]
}

Summary

This article covered differences in using SNS and SQS as targets for your Lambda functions.

We hope that this article was helpful. If yes, please, help us spread it to the world!

If you have any questions, which are not covered by this blog, please, feel free to reach out. We’re willing to help.

Similar Posts