Ask HN: How to scale agent systems when Layer 7 is unreliable?
Agent workflows often involve 10+ API calls to different services (LLMs, data APIs, web scraping). Layer 7 being unreliable = workflows fail or cause retry storms.
Common failure modes I'm thinking about: - 429 rate limits → agents retry → hammer API worse - Partial outages → synchronized retries across customers - LangGraph workflows fail mid-execution → how to resume?
For those running agent systems at scale: - How do you handle Layer 7 failures? - Retry coordination? Circuit breakers? - How do you prevent retry storms to downstream dependencies? - Do LangGraph workflows gracefully handle API failures?
Curious what the production reality looks like. There's nothing special about agents in this respect, the same techniques we use for other services. What do you use?