Why Wikipedia's A/B testing is wrong
synference.blogspot.comI won't comment on the virtues of bandit algorithms, but the author grossly mischaracterizes the requirements of a multivariate A/B test. Just do a regular 50/50 split and run a regression on the outcome using the observable characteristics as explanatory variables. Then take that model and feed the observables back in to get an individualized prediction as to whether A or B will perform better. You can bin continuous variables or you can including higher-order terms to achieve a polynomial fit. The data requirements really aren't that large.
I believe the author's confusion stems from attempting to model every interaction term, e.g. if you think there will be some extra effect to being Male and using Windows and being Swedish that is more than the "Male effect" and "Windows effect" and the "Sweden effect" considered individually.
I wrote a little article on implementing a simple multivariate A/B test from scratch:
http://www.evanmiller.org/linear-regression-for-fun-and-prof...
All you really need is a function that inverts a matrix and you're good to go. The cool part is that you can actually throw away most of the data that comes in from the firehose, just keep updating the Hessian until you're ready to invert it and bam, you have a model.
(Synference cofounder here)
That's a good blog post.
There are many ways to handle multi-variate data, with different tradeoffs. The linear model you describe in your post is one of the classic machine learning models often used. In the post, we address the shortcoming of such models in the subsection "Trading off exploration vs. Exploitation."
As we point out, such a model ignores the trade-off between exploration and exploitation. That means that you have to gather all your data by randomly serving variants, before you can figure out the right variant to serve to each user.
That's very expensive in terms of serving-the-wrong-variant, if what you are trying to predict is conversion or revenue. By contrast, our method will quickly realise if a particular variant has a very high expected value for the current user (with the current user's particular combination of features), and so serve that user that variant - even if we aren't confident about other users.
So, like any bandit algorithm, we trade off exploration and exploitation; but further, we trade this off not just across the population as a whole, but on a personalised basis. That's a very important functionality that the approach you describe doesn't have.
Why is this an issue? Because if you have to gather data by randomly sampling, what if a certain sort of user doesn't appear very often? (e.g. lets say for a particular site, that's a 65 year old female from Canada). Do you randomly sample until you see a user like this? If so, you are randomly sampling for a long time. Or do you stop your exploration before you see rare users? This totally neglects your rarer users. What if rare users have unusual preferences, or are particularly valuable? (E.g. maybe your rare users are your biggest spenders - which happens.)
And, if you have high dimensional data, all your users are effectively rare (because you rarely see users with that exact combination of features).
Furthermore, the 'gather data randomly' approach is also impractical if user behaviour is dynamic, as it often is in real-world settings. (i.e. the preference of your user base as a whole changes - maybe they've all recently read an article, so a different marketing copy now resonates with them)
Finally, sometimes those interactions you mention, between different features, are really important, and shouldn't be neglected.
I could go on - there's a lot of subtle issues that the approach we are using solves that simple multivariate testing doesn't handle, basically.
What we're doing is trying to provide an API that wraps all this knowledge up into a service that's easy for people to use.
Your blog post makes an important mistake in that it recommends actually inverting the matrix X^T X, which one never actually does in this case. In general, A^{-1} b is sort of mathematical "slang" for "a solution to the linear system Ax = b", not "compute A^{-1} and multiply b by it".
This is discussed in many places, but here are a few examples:
http://www.johndcook.com/blog/2010/01/19/dont-invert-that-ma... http://en.wikipedia.org/wiki/Linear_least_squares_(mathemati...
"Highly successful" and "All wrong"? There's an oxymoron if ever I've heard one.
Perhaps this article title has been A-B tested and found to be highly successful at driving traffic to the site.
(Synference cofounder here)
It has successfully increased Wikipedia's donation revenue; however, some fundamental assumptions baked into the AB-testing approach are all wrong; this leaves a lot of money on the table.
Instead of trying to find a single best version for everyone, people should realise that different segments of their users are going to have different preferences. Machine learning can find these in an automated way, and there's a lot of value to be gained - but people need to see past the standard 'one size fits all' AB-testing approach to take advantage of it.
There's a huge difference between "all wrong" and "has some fundamental assumptions that could lead to non-optimal results if you wanted to push the results even further". The former implies, upon initial reading of the headline, that whatever Wikipedia is doing doesn't even give you a better solution.
A better headline might have been "Hidden assumptions in Wikipedia's A/B testing and how it could be improved". Also, why are we picking on Wikipedia here? Don't a lot of companies claim to do A/B testing and isn't this a problem inherent in all simple A/B testing?
AB-testing is much better than doing nothing.
But you have to be careful with any powerful tool, in case its success blinds you to its weaknesses. "When all you have is a hammer, everything starts to look like a nail". We think that's happening with AB-testing at the moment.
We think Wikipedia is awesome, and would love them to get more donations by using a more sophisticated approach.
Yes, most companies doing AB-testing, if they have the ability to personalise their user experience (i.e. they aren't trying to quickly find the single best UI) could benefit.
However, Wikipedia is a really good example to start with - the phrase 'Wikipedia needs those nickels' is a great example - resonates well with US donators, will probably work in Canada, but what about the UK? Australia?
It's obvious once its pointed out - but wouldn't it be better if the system automatically realises this? And considers all the combinations? That's our point.
Yeah that's fair enough, but 50% increase in revenue per impression is hardly bad. You can make the argument that it could be better, and I agree with that sentiment within the article. But one of the first things to bear in mind when optimising is to always go for the low-hanging fruit first.
Although it's just my guesstimate, I'm imagine that the broad approach to A/B testing (or just doing it at all in the first place!), will be the most important and easiest fruit to grab. Whilst a multi-armed bandit solution will probably only provide much smaller returns on top. So given a choice between a wikipedia dev concentrating on improving the overall message, as opposed to trying to improve 5 or more cohorts at once - I'd go for the easier option until it stops giving significant returns.
That's not to say it shouldn't be done eventually. Like you said, it's potential money left on the table!
I believe the word for what you were doing here is "marketrolling".
I'm a big proponent of bandit algorithms (I'm a founder of Myna http://mynaweb.com/, something of a competitor to Synference). I think there is a bit of nuance here that is missing in the discussion.
Should Wikipedia use a contextual bandit system? Assuming the cost is equivalent, I say yes, for the following reasons:
- they have a shedload of traffic, and then some; and
- they have a limited time in which the campaign is running.
Let's address these in turn.
The regret bounds on contextual bandits are O(d sqrt(T)), where d is the dimensionality of the user profile you're using and T is time (see e.g. http://arxiv.org/abs/1209.3352). In contrast, the standard stochastic bandit has regret bounds of O(log(T)), which is much tighter. The summary is you need a lot more data to explore the bigger range of possibilities you've created. So I'm not sure contextual bandits are viable for smaller sites, but Wikipedia definitely has the traffic to make them work.
If your test is running for a limited time you can just delete the code when you're done. This has several advantages. People running A/B tests typically want to answer the question "what is best?" (To crush your enemies, obviously, but you can't usually say that) under the assumption that you'll then display the best variant forever. Contextual bandit algorithms don't answer this question; instead they minimise regret (equivalently, making you the most money) which makes more sense in the limited time setting. In fact there is no "best" with a contextual bandit -- the best variant depends on the user profile. This means you need to run the algorithm forever: good if you're collecting monthly subscriptions based on algorithm use, but bad if you're trying to manage a site with an exploding number of variants under test that you can never delete.
Ob plug: If you've found this post interesting you should consider signing up for the bandit course I'm running: http://noelwelsh.com/data/2013/08/16/free-bandit-course-draf... It will cover contextual bandits, and a whole heap of other interesting stuff.
>The regret bounds on contextual bandits are O(d sqrt(T)), where d is the dimensionality of the user profile you're using and T is time (see e.g. http://arxiv.org/abs/1209.3352). In contrast, the standard stochastic bandit has regret bounds of O(log(T)), which is much tighter. The summary is you need a lot more data to explore the bigger range of possibilities you've created. So I'm not sure contextual bandits are viable for smaller sites, but Wikipedia definitely has the traffic to make them work.
Its definitely viable for smaller sites. Your argument there is about the bounds in a particular theoretical situation ("when the contexts are provided by an adaptive adversary.")
In a real world situation, your contexts are provided by the normal behaviour of your users, not an adversary. And, rather than the patterns being maximally hard to find, there'll instead be a lot of low-hanging-fruit - in terms of very obvious combinations of features that define different segments of users.
What's important is not the bound of the contextual bandit in a theoretical situation, but rather what performance it gives in 90% of business situations. In many situations, doing some personalisation to automatically find the most obvious user groups is a huge win.
And later, as the amount of data increases, the granularity of segmentation will effectively increase.
I agree with your general comments on bandits, and love what Myna is doing.
Implementing more complex machine learning methods isn't free, and neither is bying one, so the system should bring in dramatically more money just to cover its costs.
The company whose ad this post was is not showing much data about the difference between the solution it's selling and A/B testing. I wonder if there isn't any or if it is not supporting it's price?
I'm a cofounder of the early 2-person startup whose blog post this is.
Yes, it's currently difficult and expensive to bring complex ML to bear on this problem; we are trying to make that easy and cheap. The reason people adopted AB-testing was that it has a positive impact on the metrics they care about; we think the workflow can be easier and more valuable than AB testing. If its easy to use, the extra value should cover the cost.
We're currently focused on trying to communicate that this approach can create value - this blog post, written for a technical audience, is to give some intuition around how the approach works - does that make sense?
We genuinely believe in making it easy to use ML optimisation in this context - that's why we're working on this.
It's a minor point, and perhaps only a quibble, but the title is wrong.
Wikipedua's AB testing is correct in as far as the problem and approach have been defined. This article suggests an improvement.
Wrong is when they're trying to test for something but due to some error they aren't actually testing for what they think they are.
Something can be accurate/correct but also improvable.
That's because this post exists as attempted marketing for Synference.
Yes, its marketing, at least in the broad sense of the word - we think people should use contextual bandits more, and we want people to know that.
That's why we're working on this problem.
As I said elsewhere, we think AB-testing is a big improvement over nothing, but, like any powerful tool, it can blind people to its downsides.
Our blog title is designed to point out, succinctly, that we think the vanilla AB-testing framework is wrong for Wikipedia, as an easy-to-understand example. We believe that.
I notice that you volunteer for Wikipedia; I want to be clear, we're not trying to pick on Wikipedia here; I personally love Wikipedia. Its just a good well-known example of a successful AB-testing campaign, that we think could be way improved.
Regarding the title, I've updated the title to say 'conceptually wrong' - its more of a mouthful, but maybe what we mean is clearer? I hope our intent was clear in the intro paragraph anyway.
I'll take it.
It's awesome to see a Bandit-based algorithm being used. I think this is really similar to what is called Adaptive Sampling (also based on the Bandit problem), and it's been used in multiple contexts. My favorite has been in clinical trials where adaptive sampling was able to save lives.
Read more about the research done in Clinical Trials, and one of my favorite Professors here: http://web.eecs.umich.edu/~qstout/AdaptiveDesign.html
I found it was useful to brush up on how the multi-armed bandit problem works in order to understand this article. http://en.wikipedia.org/wiki/Multi-armed_bandit
This is a little silly. Of course every form would convert better if you perfectly customized it to the current user, no one really disagrees with that. But it doesn't mean it's always practical or worth the effort, or that the Wikipedia team have the time to tackle that at this moment.
Hey there, I wrote that post. It's a valid criticism you make, but the main point of the article was not to pick on Wikipedia, which is actually way ahead of most other organizations in terms of testing its content.
We made a case-study Wikipedia because their banners are a well-known example of A/B testing. Plus, because their banners make tens of millions of dollars each year, and they have an entire team of people working on the A/B tests over a period of months, I think they have capacity to use something more sophisticated. Maybe, like you suggest, they're just focused on other aspects of the fundraising campaign. I guess the only way to know would be to hear from some of the Wikipedia team!
Really interesting concept, but what other information can you grab from a user who has never visited your site before?
There's the example of location, and I guess you can also get browser information (mobile device vs. desktop), but other than that, I feel there isn't that much in terms of context to work with with first-time anonymous users, which is the most common case for A/B testing.
Would love to hear how else you're classifying users, and what other buckets you're using.
You could divide articles into various categories.
Maybe people reading computer science articles prefer light blue, while people reading medicine articles prefer gold.
Of course, a quick look at Wikipedia's categories show there are many possible article categorisation schemes (are sneakers footwear like stilettos or sports equipment like cricket bats? Of course they're both), so this would introduce many additional variables - more than you could exhaustively A/B test.
I'm one of the Synference co-founders, and we've thought a lot about how to make it easy to extract relevant information from users who land on your site for the first time.
There are many important pieces of information you can grab from a user who's never been on your site based on basic information available to all web servers, such as the user's IP address and user agent: (1) the time of day/day of week they're visiting -- relevant because, e.g., night users or weekend users might be a different demographic (2) information related to the particular page they're visiting, such as article category on wikipedia (3) operating system type and age (4) device type -- smartphone, tablet, pc (5) geographic location.
You can use location to infer a range of demographic features, such as income, education, etc. Due to its importance, location is an attribute that Synference adaptively coarsens/zooms in on. For example, if you have a lot of users from a certain city, then each user within that city will have a location whose granularity goes down to the neighborhood. But if you have relatively few users from Europe, then each user's location might only be at the level of the country.
The Synference API makes it easy to extract this information. All you have to do is send the API the user's IP address and user agent, and the API parses those two pieces of information into most of the features I've just listed. We also add a timestamp (corrected for the user's timezone using the IP address's geographic location).
We don't need to split users into buckets---machine learning classifiers build models that automatically take care of deciding how a user's attributes can be used to predict the target metric.
You're disregarding the necessity of a "small non-profit" not wasting time running broken tests, which is not necessarily optimized by complicating the bejeezus out of everything.
This is just advertising for their company, how does this algorithm actually work?
stopped reading at "much of it is organized by volunteers" as that's not really true anymore.
I'm sure the A/B testing can be improved but not at all convinced how it's done now is "wrong".