Idempotency bug leads to free Uber Eats food
twitter.comIt's hard to see how you could blame this on idempotency - even the author's questionable definition[1]. He pretty much explains what went wrong in black and white:
"Don't assume "unknown" means "good". Assume the opposite." Ding! We have a winner.
"This API at the time retuned only 200s where the body had a message to be parsed which indicated success / status message / error." - and that's how you got yourself into this mess in the first place.
What I would really like to know is how the financial loss compares to the loss if Uber had handled the situation "correctly", i.e. stopped the transactions. I mean, is a restaurant really better off turning customers away all day, vs giving them free food all day? And if so, by how much?
[1] i.e that idempotency means always getting the same response, as opposed to the usual meaning which is that repeating the request n times leaves the system in the same state as the first time.
Sounds very much Uber’s fault, although there is not quiet enough information. It’s perfectly reasonable for a payments API to return “error: not enough funds” on the first call and “error: rate limit exceeded” on the second. This just reads as a desperate attempt to pass on the blame
The ex-Uber author is using a loose or colloquial definition of idempotency, which is concerned with getting an identical response across calls (implying something about system state changes that produce the responses). But the API providers may consider their system still idempotent because the HTTP response is not considered part of their ‘system state’ and so the response changing across calls is not relevant. What’s relevant is the payment provider’s internal system state being invariant across 1-N calls. They may claim “this is idempotent because it’s safe to retry repeatedly” without making the claim that the error code is identical each call.
Repeated failures with the same error code is not idempotency even using a loose or colloquial definition, which at some point means nothing at all beyond getting the same result when trying the same thing multiple times (of course you do, it fails!)
The API provider should have documented the change ahead of time. However they were still returning an error, even if a new error, when the payment failed.
There should be a catch-all for errors and that should certainly not default to 'success'.
Now, if the API provider really did change the API to return something new that is not an error this is indeed trickier. In general good design is to check specifically for success and to deem everything else a failure, which avoids this sort of surprise.
At the bottom of the thread, there was another company that assumed anything non-successful is a failure (but there was a new success state) which resulted in customers retrying and getting charged multiple times
It seems the safest option is whenever there is a new API state, a major version bump is needed
This is still the saner thing to do. There can obviously be 'smart' failures: e.g. Report and block if something unknown is reported to acknowledge the fact that an 'unknown' condition requires to be urgently looked into while preventing further problems.
I also find it rather hilarious that the author of the thread then tries to shift the 'blame' to "growth!"...