Ask HN: Is anyone losing sleep over retry storms or partial API outages?

2 points | by rjpruitt16 1 day ago

2 comments

toast0 21 hours ago
Retry storms are "easy" exponential backoff with jitter. Like what ethernet on shared media has been doing since the 80s.
If that's not enough to come back from an outage, you need to put in load shedding and/or back pressure. There's no sense accepting all the requests and then not servicing any in time.
You want to be able to accept and do work on requests that are likely to succeed within reasonable latency bounds, and drop the rest --- but being careful that an instant error may feed back into retry storms, sometimes it's better if such errors come after a delay, so that the client is stuck waiting (back pressure)
[-]
- rjpruitt16 13 hours ago
  Agree backoff+jitter is table stakes, and load shedding/backpressure is necessary under sustained overload. The tricky cases I’m digging into are shared rate limits (429s) and many concurrent clients/agents where local backoff isn’t coordinated and you still get herds after partial outages. Curious what patterns you’ve seen work well for coordinating retries/fairness across tenants or API keys?
HelloNurse 19 hours ago
A worrying choice of words.
"Losing sleep" implies an actual problem, which in turn implies that the mentioned mitigations and similar ones have not been applied (at least not properly) for dire reasons that are likely to be a more important problem than bad QoS.
"Infrastructure" implies an expectation that you deploy something external to the troubled application: there is a defective, presumably simplistic application architecture, and fixing it is not an option. This puts you in an awkward position: someone else is incompetent or unreasonable, but the responsibility for keeping their dumpster fire running falls on you.
[-]
- rjpruitt16 12 hours ago
  Fair pushback — to clarify, I’m not assuming incompetence or suggesting infra should paper over bad architecture.
  By “losing sleep” I really mean on-call fatigue during partial outages — the class of incidents where backoff, shedding, and breakers exist, but retry amplification, shared rate limits, or degraded dependencies still cause noisy pages and prolonged recovery.
  I’m trying to understand how teams coordinate retries and backpressure across many independent clients/services when refactors aren’t immediately available, not replace good architecture or take ownership of someone else’s system.
  If you’ve seen patterns that consistently avoid that on-call pain at scale, I’d genuinely love to learn from them.