Ask HN: Are retries the wrong abstraction under rate limits?
Over the last few years, I’ve watched a lot of production systems fail in ways that feel… strangely predictable.
When services hit 429s or timeouts, the standard response is almost always the same: retries with backoff, sleep loops, jitter, etc. This is treated as a best practice across languages and platforms.
But in systems with high concurrency, fan-out, or shared downstream dependencies, retries often seem to amplify load instead of smoothing it. What starts as localized failure can turn into retry storms, thundering herds, and cascading outages.
It’s made me wonder whether retries are solving the wrong problem at the wrong layer — treating a coordination issue as an application-level error-handling concern.
I wrote up a longer piece exploring this idea and arguing for making failure boring again by handling it at a different layer: https://www.ezthrottle.network/blog/making-failure-boring-again
Curious how this matches others’ experience:
Have retries actually improved stability for you under sustained rate limiting?
Have you seen cases where they clearly made things worse?
If retries aren’t the right abstraction, what is?
Interested in war stories, counterexamples, and alternative approaches.
No comments yet.