Why Exponential Backoff in Rabbitmq or In Event-Driven Systems

August 23, 2022

Understanding Simple Message Workflow
Ways to Handle Failure Messages
Issue in Re-queue the Message Back to Queue
Issue in Dead-Letter-Exchange queue
The Saviour - Exponential Backoff Strategy

Understanding Simple Message Workflow

First, lets understand a simple workflow in event-driven systems or in messaging workflow.

A message (event) is generated on a queue (on some action)
Some consumer get the message
Process the message, do processing
Acknowledge or Delete the message on completion

Event Driven System Positive Workflow

Lets consider a failure scenario, where the the format of message is unknown and the consumer throws an exception, without acknowledging or negative acknowledgement.

The message goes back to the original queue, and is available for next consumption.

Event Driven System Negative Workflow

Ways to Handle Failure Messages

There are three ways to handle the failure case:

Reject the message
Re-queue in Rabbitmq Queue
Publish it to Dead-Letter-Exchange queue

Note: We can not lose any message, so every message is important to the system.

Issue in Re-queue the Message Back to Queue

Now when one worker fails by saying that I didn’t understand the format. It is mostly likely, it will be fail next time as well. Now, think of a case that you are rejecting something, and it is coming back to you again infinitely! In computer terms, we are wasting the resources and actually doing DDOS our own systems, which is bad.

Issue in Dead-Letter-Exchange queue

The idea behind pushing in a Dead-Letter-Exchange queue is that someone will manually handle the messages, and would probably push it back to original queue after some modification, or code change in workers. Or, may be we will delete the message if its not important or gets produced by mistake.

But, its a manual step! But, this saves our services from DDOS attack.

The Saviour - Exponential Backoff Strategy

Remember, what we do in a normal Exponential Backoff retries. We retry after some random time sleep, and we keep on increasing this sleep time exponentially.

The idea is same. We will have a separate retry queue for every such queue present in our system.

Let me list the steps:

Create separate retry queue for each of your queue
On failures, you push the message to retry queue, with a Expiration or TTL metadata.
On expiring that Expiration time or TTL time, the message will be expired from the retry queue and sent back to original queue. And, its back to be processed again.

Event Driven System Negative with Retry Workflow