software-design|August 23, 2022|2 min read

Why Exponential Backoff in Rabbitmq or In Event-Driven Systems

TL;DR

Without exponential backoff, a failing message re-queues instantly and creates an infinite retry storm that crushes your consumer. Backoff with dead-letter exchanges gives the downstream system time to recover.

Why Exponential Backoff in Rabbitmq or In Event-Driven Systems

Understanding Simple Message Workflow

First, lets understand a simple workflow in event-driven systems or in messaging workflow.

  • A message (event) is generated on a queue (on some action)
  • Some consumer get the message
  • Process the message, do processing
  • Acknowledge or Delete the message on completion

Event Driven System Positive Workflow

Lets consider a failure scenario, where the the format of message is unknown and the consumer throws an exception, without acknowledging or negative acknowledgement.

The message goes back to the original queue, and is available for next consumption.

Event Driven System Negative Workflow

Ways to Handle Failure Messages

There are three ways to handle the failure case:

  1. Reject the message
  2. Re-queue in Rabbitmq Queue
  3. Publish it to Dead-Letter-Exchange queue

Note: We can not lose any message, so every message is important to the system.

Issue in Re-queue the Message Back to Queue

Now when one worker fails by saying that I didn’t understand the format. It is mostly likely, it will be fail next time as well. Now, think of a case that you are rejecting something, and it is coming back to you again infinitely! In computer terms, we are wasting the resources and actually doing DDOS our own systems, which is bad.

Issue in Dead-Letter-Exchange queue

The idea behind pushing in a Dead-Letter-Exchange queue is that someone will manually handle the messages, and would probably push it back to original queue after some modification, or code change in workers. Or, may be we will delete the message if its not important or gets produced by mistake.

But, its a manual step! But, this saves our services from DDOS attack.

The Saviour - Exponential Backoff Strategy

Remember, what we do in a normal Exponential Backoff retries. We retry after some random time sleep, and we keep on increasing this sleep time exponentially.

The idea is same. We will have a separate retry queue for every such queue present in our system.

Let me list the steps:

  • Create separate retry queue for each of your queue
  • On failures, you push the message to retry queue, with a Expiration or TTL metadata.
  • On expiring that Expiration time or TTL time, the message will be expired from the retry queue and sent back to original queue. And, its back to be processed again.

Event Driven System Negative with Retry Workflow

Hope it helps.

Related Posts

How to Implement Exponential Backoff in Rabbitmq Using AMQP in Node.js

How to Implement Exponential Backoff in Rabbitmq Using AMQP in Node.js

Exponential Backoff in Rabbitmq Please make sure to read first, why we need the…

Deep Dive on Redis: Architecture, Data Structures, and Production Usage

Deep Dive on Redis: Architecture, Data Structures, and Production Usage

“Redis is not just a cache. It’s a data structure server that happens to be…

Deep Dive on Apache Kafka: A System Design Interview Perspective

Deep Dive on Apache Kafka: A System Design Interview Perspective

“Kafka is not a message queue. It’s a distributed commit log that happens to be…

Deep Dive on Elasticsearch: A System Design Interview Perspective

Deep Dive on Elasticsearch: A System Design Interview Perspective

“If you’re searching, filtering, or aggregating over large volumes of semi…

Deep Dive on API Gateway: A System Design Interview Perspective

Deep Dive on API Gateway: A System Design Interview Perspective

“An API Gateway is the front door to your microservices. Every request walks…

Efficient Data Modelling: A Practical Guide for Production Systems

Efficient Data Modelling: A Practical Guide for Production Systems

Most engineers learn data modelling backwards. They draw an ER diagram…

Latest Posts

Claude Code Skills — Build a Better Engineering Workflow with AI-Powered Code Reviews, Security Scans, and More

Claude Code Skills — Build a Better Engineering Workflow with AI-Powered Code Reviews, Security Scans, and More

Most developers use Claude Code like a search engine — ask a question, get an…

Server Security Best Practices — Complete Hardening Guide for Production Systems

Server Security Best Practices — Complete Hardening Guide for Production Systems

Every breach post-mortem tells the same story: an unpatched service, a…

Staff Engineer Study Plan for MAANG Interviews — The Complete 12-Week Roadmap

Staff Engineer Study Plan for MAANG Interviews — The Complete 12-Week Roadmap

If you’re a Senior Engineer (L5) preparing for Staff (L6+) roles at MAANG…

XSS and CSRF Explained — The Complete Guide with Real Attack Examples and Defenses

XSS and CSRF Explained — The Complete Guide with Real Attack Examples and Defenses

XSS and CSRF have been in the OWASP Top 10 for over a decade. They’re among the…

OWASP Top 10 (2021) — Every Vulnerability Explained with Code

OWASP Top 10 (2021) — Every Vulnerability Explained with Code

The OWASP Top 10 is the industry standard for web application security risks. If…

HTTP Cookies Security — Everything Developers Get Wrong

HTTP Cookies Security — Everything Developers Get Wrong

Cookies are the single most important mechanism for web authentication. Every…