Retries Can Make Systems Worse - How We Used Backoff Algorithms

Retries Can Make Systems Worse  - How We Used Backoff Algorithms

A few years back, when I was working as an Associate Software Engineer, we worked on a batch processing system that handled large amounts of data through Kinesis streams and stored processed records in DynamoDB.

At that time, I didn’t know much about retry strategies or distributed system failure handling. I just knew one thing:

sometimes systems fail in weird ways 😄.

Very recently, I faced another similar scenario in a completely different project. We were working with an external service API that failed randomly. Sometimes it recovered immediately. Sometimes it stayed unstable for a few seconds under load before becoming healthy again.

After facing these kinds of issues multiple times across different projects, I thought it would be nice to share one simple mechanism that helped us handle failures much better:

exponential backoff retries.

There are definitely more advanced ways to solve these problems. But this post is just about how we handled things in our projects without making the systems overly complex.


Scenario 1 - DynamoDB and Kinesis Throttling

Once upon a time, we worked on a batch processing system where we used Kinesis streams to process large data flows. Once records were received, workers immediately processed and wrote them into DynamoDB.

Initially, everything worked fine.

But when larger batches started flowing through the system, DynamoDB began throttling write requests. The main reason was our partition key design. Some partition keys became “hot,” meaning too many writes were targeting the same partitions repeatedly. This issue is called hot partitioning.

Normally, one solution would be redistributing or redesigning partition keys to spread traffic more evenly.

But in our case, the client did not want to change the key structure because it matched their access patterns very well. So we had to handle the issue without redesigning the database.

At the same time, we were also working with Kinesis limitations.

Kinesis PutRecord APIs have shard limits:

  • 1,000 records per second per shard
  • or 1 MB per second per shard

Technically, we could spread traffic across more shards.

But Kinesis guarantees ordering only within a single shard, and for some of our records we needed to preserve ordering consistency (I honestly can’t remember the exact reason now).

So scaling horizontally was not always straightforward.


Scenario 2 — Randomly Failing External Service API

Recently, I worked on another project involving a very unstable external service.

The behavior was honestly weird sometimes.

Under load, the service would randomly fail for short periods.
Sometimes it recovered immediately.
Sometimes it took several seconds before becoming stable again.

Since the service was outside our control, there wasn’t much we could do on the provider side.

So we used a very similar retry strategy again:
exponential backoff with retry limits.

This pattern is especially useful for:

  • transient network failures
  • temporary overload situations
  • unstable downstream services
  • rate limiting scenarios

Instead of aggressively retrying failed requests immediately, we progressively delayed retries and allowed the service time to recover.

That small change alone reduced a lot of unnecessary failures.


Our First Mistake - Immediate Retries

Initially, our retry logic was very simple.

If a request failed, retry immediately.

But retrying immediately during throttling is usually a bad idea. Because when the database or service is already overloaded, instant retries simply create even more pressure. Things became even more interesting because we were using Lambda workers. When multiple Lambdas failed at the same time, they also retried at nearly the same time.

So failures started amplifying themselves. This is basically how retry storms begin.

We discussed several possible solutions internally. One option was introducing queues in between. But queues can also become bottlenecks depending on throughput requirements, so initially we wanted something simpler without redesigning the entire flow.

That’s when one of our tech leads mentioned exponential backoff algorithms.

Honestly, that was the first time I realized retries themselves need proper design.


What Is Exponential Backoff?

Instead of retrying immediately after a failure, exponential backoff gradually increases the waiting time between retries.

The idea is simple:
after every failed attempt, wait a little longer before retrying again.

For example is base delay is 1s:

  • first retry waits 1 second
  • second retry waits 2 seconds
  • third retry waits 4 seconds
  • fourth retry waits 8 seconds

This gives the failing system some breathing room to recover instead of continuously hammering it with requests.


Why We Added Jitter

There’s another important concept called jitter.

Without jitter, all failed workers may retry at exactly the same time.

Imagine hundreds of Lambda functions waiting exactly 8 seconds and then retrying together again.

That creates another traffic spike.

So instead of fixed retry timing, we added small random delays to the base delay. This base delay was calculated using the retry attempt count.

AWS heavily recommends this approach in distributed systems.


One Important Thing - Not Every Failure Should Be Retried

One important lesson we learned was:

even with exponential backoff, retrying everything is a bad idea.

Before retrying, we first checked whether the failure was actually a retriable scenario.

For example:

  • retrying a 400 Bad Request usually makes no sense because the issue is with the payload itself, not the server
  • similarly in DynamoDB, if the issue was caused by invalid object structure or bad key data, retries would never fix the problem

So instead of blindly retrying every failure, we created a list of retryable scenarios.

Only errors inside that allowed retry group went through the backoff mechanism.

Everything else was immediately moved to the Dead Letter Queue (DLQ) without retrying.

That small decision helped us avoid:

  • unnecessary retries
  • wasted compute
  • retry storms caused by invalid payloads
  • noisy logs

This was honestly one of the most important improvements in the entire retry flow.


Breaking the Retry Loop

Retries are useful. Infinite retries are dangerous.

So in both projects, we added maximum retry limits.

Once the retry threshold was reached, we stopped retrying completely. This is not exactly a full circuit breaker pattern, but it follows a similar idea:
don’t endlessly keep attacking an already failing dependency.

Instead:

  • stop retrying
  • isolate the failure
  • move the failed payload safely for later investigation

Why We Added DLQ Handling

If retries still failed after several attempts:

  • the message was moved into a Dead Letter Queue (DLQ)
  • errors were logged
  • related DynamoDB job statuses were updated
  • failure reports were generated for visibility

This made the systems much safer operationally because failures became visible and manageable instead of silently retrying forever.

There may be more advanced solutions depending on scale and architecture, but this is one practical approach that worked well for us in real production scenarios. So as always its open for your thoughts and really like to start a conversation 😺.

Read more