Self-Optimizing A/B Tests
chanind.github.ioOne of the assumptions of vanilla multi-armed bandits is that the underlying reward rates are fixed. It's not valid to assume that in a lot of cases, including e-commerce.
To see how things could go wrong, imagine that you are running this on an website with a control/treatment variant. After a bit you end up sampling the treatment a little more (say 60:40). You now start running a sale - and the conversion rate for BOTH variants goes up equally (say). But since you are sampling from the treatment variant more, its overall conversion rate goes up faster than the control - meaning you start weighting even more towards that variant. This could be happening purely because of the sale and random noise at the start - you could even end up optimising towards the wrong variant. There are more sophisticated MAB approaches that try to remove the identical reward-rate assumption - they have to model a lot more uncertainty, and so optimise more conservatively.
Thanks for pointing this out! I updated the post to note that this only works well if the conversion rate stays constant, and linked to this comment.
One trick I used myself in a bayesian bandit-esque approach (thompson sampling from some distribution, eg. a Beta distribution) was to implement some "forgetting rate" on the parameters of the distribution.
I updated the post to note Thompson Sampling and added a note about using only recent data as a potential improvement and linked to this comment. Thanks for this note!
This is the classic Multi-Armed bandit problem https://en.m.wikipedia.org/wiki/Multi-armed_bandit
I like the graphs and the explanation leads the reader deeper, but it takes the naive approach to exploration without discussing trade-offs.
Tangentially, nearly every self-optimising a/b test I have code reviewed has been significantly flawed.
Thanks for pointing this out! I update the post to note that this is a Multi-Armed Bandit problem, and linked to this comment in the updates section.
I used to work for an A/B testing company, and we used both contextual and non-contextual Bayesian multi-armed bandit approaches.
Here's a cool talk my former colleague Austin Rochford gave at the 2018 PyData NYC conference about how we implemented it and made it work at scale: