Fork me on GitHub

Ivan-Site.com

Exponential backoff in SQS

Amazon Web Services offers a service called Simple Queue Service (SQS) which makes it easy to decouple and scale your asynchronous system compoents. Messages in SQS will be retried on failure, and will retry once per message visibility timeout (default at 30 seconds). However, a common practice is to use exponential backoff instead of constant wait times for better flow control (see Error Retries and Exponential Backoff in AWS and the wikipedia page on Exponential backoff).

This post gives you a sample implementation on how to implement exponential backoff in your SQS consumer.

How it works

To make it work, we will use 2 features of SQS - message attributes, and message visibility.

SQS provides several message attributes that you can request when receiving messages, see ReceiveMessage API. One of these attributes is "ApproximateReceiveCount", which we can use in order to determine the wait time for this message in case of failure. With every receive attempt this number will increase by 1, and thus our wait time will also increase (exponentially).

SQS also allows you to adjust message visibility using its ChangeMessageVisibility API. This allows you to "hide" the message from other consumers for a specified period of time.

Combining these two features together you get exponential backoff for messages that fail to be processed. See sample implementation on github sqsconsumer.

Implementation

It retrieves the value for ApproximateReceiveCount attribute for the message and uses retry4j library for computing the next wait time for that count (but any other exponential backoff implementation will also work). It also allows you to specify the maximum amount of time you want to wait for the message. We use ApproximateFirstReceiveTimestamp message attribute for this.

int approximateReceiveCount = Integer.parseInt(getAttribute(message, SQSConstants.APPROXIMATE_RECEIVE_COUNT));
Instant approximateFirstReceiveTime = Instant.ofEpochMilli(Long.parseLong(getAttribute(message, SQSConstants.APPROXIMATE_FIRST_RECEIVE_TIMESTAMP)));
Instant maximumTime = approximateFirstReceiveTime.plus(maximumWaitTime);
Duration nextDelay = backoffStrategy.getDurationToWait(approximateReceiveCount, initialDelayBetweenRetries);
Instant now = Instant.now();
Duration adjustedDelay = Collections.min(Arrays.asList(
        SQSConstants.MAX_VISIBILITY_TIMEOUT,
        nextDelay,
        Duration.between(now, maximumTime)
));

return adjustedDelay.isNegative() ? 0 : Math.toIntExact(adjustedDelay.getSeconds());

We then use that wait time to change the visibility of this message:

Function<Message, Integer> backoffFunction = visibilityTimeoutProvider.get();
List<ChangeMessageVisibilityBatchRequestEntry> changeMessageVisibilityRequests = messages.stream()
        .map(m -> new ChangeMessageVisibilityBatchRequestEntry(m.getMessageId(), m.getReceiptHandle())
                .withVisibilityTimeout(backoffFunction.apply(m)))
        .collect(Collectors.toList());
sqs.changeMessageVisibilityBatch(new ChangeMessageVisibilityBatchRequest(queueUrl, changeMessageVisibilityRequests));

Note that this implementation is using the batch APIs, but the same concept can also be applied to non-batch API.

Further Reading

If you're interested more in SQS failure handling, take a look at Dead-Letter Queues, which allow you to specify a maximum number of retries for a message, and move the message to a separate queue if that number is reached.

Also take a look at the importance of jitter, or some amount of randomness, when using exponential backoff.

Posted Sun 17 June 2018 by Ivan Dyedov in Java (Java, AWS)