Winning A/B results were not translating into improved user acquisition
blog.sumall.comThe red flag here for me was that Optimizely encourages you to stop the test as soon as it "reaches significance." You shouldn't do that. What you should do is precalculate a sample size based on the statistical power you need, which involves determining your tolerance for the probability of making an error and on the minimum effect size you need to detect. Then, you run the test to completion and crunch the numbers afterward. This helps prevent the scenario where your page tests 18% better than itself by minimizing probability that your "results" are just a consequence of a streak of positive results in one branch of the test.
I was also disturbed that the effect size was taken into account in the sample size selection. You need to know this before you do any type of statistical test. Otherwise, you are likely to get "positive" results that just don't mean anything.
OTOH, I wasn't too concerned that the test was a one-tailed test. Honestly, in a website A/B test, all I really am concerned about is whether my new page is better than the old page. A one-tailed test tells you that. It might be interesting to run two-tailed tests just so you can get an idea what not to do, but for this use I think a one-tailed test is fine. It's not like you're testing drugs, where finding any effect, either positive or negative, can be valuable.
I should also note that I only really know enough about statistics to not shoot myself in the foot in a big, obvious way. You should get a real stats person to work on this stuff if your livelihood depends on it.
Hi pmiller, Dan from Optimizely here. Thanks for your thoughtful response. This is a really important issue for us, so I wanted to set the record straight on a couple of points:
#1 - “Optimizely encourages you to stop the test as soon as it reaches ‘statistical significance.’” - This actually isn’t true. We recommend you calculate your sample size before you start your test using a statistical significance calculator and waiting until you reach that sample size before stopping your test. We wrote a detailed article about how long to run a test, here: https://help.optimizely.com/hc/en-us/articles/200133789-How-...
We also have a sample size calculator you can use, here: https://www.optimizely.com/resources/sample-size-calculator
#2 - Optimizely uses a one-tailed test, rather than a 2-tailed test. - This is a point the article makes and it came up in our customer community a few weeks ago. One of our statisticians wrote a detailed reply, and here’s the TL;DR:
- Optimizely actually uses two 1-tailed tests, not one.
- There is no mathematical difference between a 2-tailed test at 95% confidence and two 1-tailed tests at 97.5% confidence.
- There is a difference in the way you describe error, and we believe we define error in a way that is most natural within the context of A/B testing.
- You can achieve the same result as a 2-tailed test at 95% confidence in Optimizely by requiring the Chance to Beat Baseline to exceed 97.5%.
- We’re working on some exciting enhancements to our methodologies to make results even easier to interpret and more meaningfully actionable for those with no formal Statistics background. Stay tuned!
Here’s the full response if you’re interested in reading more: http://community.optimizely.com/t5/Strategy-Culture/Let-s-ta...
Overall I think it’s great that we’re having this conversation in a public forum because it draws attention to the fact that statistics matter in interpreting test results accurately. All too often, I see people running A/B tests without thinking about how to ensure their results are statistically valid.
Dan
Thanks for replying. I agree with all the points you mention your statistician covered, but you should make sure your users know what kind of test you're using. The only reason I say this is because this article gives me the impression that you were using a single one-tailed test (which, as I said in my post, is a perfectly acceptable thing to do in the context of web site A/B testing).
But, as far as "Optimezely encourages you to stop the test as soon as it reaches 'statistical significance,'" I'm not saying your user documentation or anything encourages people to stop tests early. I'm saying (and this is based only on the article as I've never used Optimizely) that your platform is psychologically encouraging users to stop tests early. E.g. from the article:
I am aware of literature in experimental design that talks about criteria for stopping an experiment before its designed conclusion. Such things are useful in, say, medical research, where if you see a very strong positive or negative result early on, you want to have that safety valve to either get the drug/treatment to market more quickly or to avoid hurting people unnecessarily.Most A/B testing tools recommend terminating tests as soon as they show significance, even though that significance may very well be due to short-term bias. A little green indicator will pop up, as it does in Optimizely, and the marketer will turn the test off. <image with a green check mark saying "Variation 1 is beating Variation 2 by 18.1%"> But most tests should run longer and in many cases it’s likely that the results would be less impressive if they did. Again, this is a great example of the default settings in these platforms being used to increase excitement and keep the users coming back for more.Unless you've built that analysis into when you display your "success message" that "Variation 1 is beating Variation 2 by 18.1%," I'd argue that you're doing users a disservice. When I see that message, I want to celebrate, declare victory, and stop the test; and that's not what you should encourage people to do unless it's statistically sound to do so.
The other thing in the article that lead me to this position is that you display "conversion rate over time" as a time series graph. Again, if I see that and I notice one variation is outperforming the other, what I want to do is declare victory and stop the test. That might not be mathematically/statistically warranted.
IMO, as a provider of statistical software, I think you'd do your users a service to not display anything about a running experiment by default until it's either finished or you can mathematically say it's safe to stop the trial. Some people will want their pretty graphs and such, so give them a way to see them, but make them expend some effort to do so. Same thing with prematurely ended experiments; don't provide any conclusions based on an incomplete trial. Give users the ability to download the raw data from a prematurely ended experiment, but don't make it easy or the default.
For a second I thought you were Evan Miller who wrote about the exact same thing: http://www.evanmiller.org/how-not-to-run-an-ab-test.html
No, I'm not him, but stuff like fixing a sample size in advance and not stopping tests early without careful analysis are things I learned in my stat classes. This stuff should be stressed in any good intro stats class covering hypothesis testing. (I was a math major in college, so I had all of 2 courses in mathematical statistics and 0 in experimental design. I didn't go to the best school, but my stats teacher was a former industry statistician focusing on quality control.)
"Honestly, in a website A/B test, all I really am concerned about is whether my new page is better than the old page. A one-tailed test tells you that."
No, it's the other way around. One tailed test is only usable for testing if the new design worse than the old one, because it being better than the old one does not matter as long it's not worse. If you are testing that is the new design better, you definitely need to test both tails or else you may likely switch to a worse design than the old one.
More precisely, before you start the test you need to choose a "default" choice. If the default choice is the old version, then it's safe to switch to the new version provided it isn't worse. Apply the converse if your default choice is the new version.
The key point here is that you aren't choosing a testing procedure, you are choosing a decision procedure.
Frequentism rears its ugly head again...
This problem exists with Bayesian techniques also, its just more obvious how to set up the problem.
Exactly! The problems arise because of the disconnect between what the math is actually saying and what people think the math is saying. Or rather: what people wish it was saying. Frequentist methods give you "if page A performs the same as page B then then the likelihood of observing something at least as extreme as this measurement is less than X%". In practice we never want to know this information. What people actually want to know is "given this measurement, the probability of page A being better than page B is X%", so they interpret whatever number comes out of the frequentist method like that...wishful thinking.
Just give them 2 posterior distributions of the conversion rate of page A and page B. It may look more daunting than a single number at first, but it's much easier to interpret than that single number that comes out of hypothesis testing, and, you know, it's the information they actually need to make a decision whether to pick page A or page B.
"given this measurement and our prior beliefs, the probability of page A being better than page B is X%"
FTFY ;). I think Bayesian methods add a lot of interpretive power, but I'm not sure that it would help people make a correct interpretation. I suspect that if practitioners are neglecting the difference between a one-sided and two-sided test, they will likely forget (or gloss over) what priors are (and their non-trivial implementation).
I definitely agree that their is a disconnect between the math and its interpretation, though.
In an A/B test where you usually get so much data, priors honestly don't matter much. Just use a flat prior. You'll overestimate the uncertainty a bit, so you may need a couple more data points than necessary but it's still way less than you'd need for a frequentist method. An A/B testing company could even automatically come up with better priors based on A/B tests that their customers have done in the past.
Even in the Bayesian case, you need more than 2 posteriors. You need a decision rule. Comparing posteriors is not sufficient.
http://www.bayesianwitch.com/blog/2014/bayesian_ab_test.html
You can just show the posterior and let your brain be the decision rule. You can visually see the difference in conversion rate and the uncertainty around it. That info makes it easy to decide whether to continue the test or stop the test and pick the best performer. Much better information to base a decision on than a hypothesis test with a significance threshold that people pull out of their ass.
If you want to be fancy you could even implement a strategy that maximizes the total conversions based on bayesian decision theory, so that it automatically tends to show the best performer as time goes on.
That article is weird. It uses a normal distribution as the prior for the conversion rate. That could produce a negative conversion rate or a conversion rate above 100%. Then in the section "So why doesn’t everyone already do this?" they say "The answer is simple - it’s computationally inefficient.". No shit if you are using a normal prior. A much better way to do this is to use a beta prior (or a Dirichlet prior in case you have more than 2 alternatives). Then the math becomes trivial & fast and you don't have nonsense negative or above 100% conversion rates.
I didn't say hypothesis test, I said decision rule. The method I describe in the article has only two quantities "pulled out of the ass" - the threshold of caring and the prior. If you visually inspect the posterior, your are implicitly pulling out of your ass an unknown "threshold of visual similarity".
That article is weird. It uses a normal distribution as the prior for the conversion rate.
That's incorrect. From the article: "To begin we will choose a Beta distribution prior." The computational intensiveness is not caused by the choice of prior, it's caused by the need to evaluate an integral over the joint posterior.
A Dirichlet prior is also not what you'd use for more than 2 alternatives - you have two beta distributions, one representing the posterior for the control and the other for the variation. If you had a second variation, you'd have 3 beta distributions, and you'd need to evaluate a 3 dimensional integral.
> I didn't say hypothesis test, I said decision rule.
I did not say that you said hypothesis test.
> If you visually inspect the posterior, your are implicitly pulling out of your ass an unknown "threshold of visual similarity".
Yes, but you are "implicitly pulling a number out of your ass" based on a lot more information. When you ask somebody to come up with a mechanical decision rule before seeing the posterior, it's unlikely that you will get as good a decision as when you just show them the posterior.
> That's incorrect. From the article: "To begin we will choose a Beta distribution prior." The computational intensiveness is not caused by the choice of prior, it's caused by the need to evaluate an integral over the joint posterior.
Ah, I was confused because they are specifying the prior in terms of a mean and standard deviation. That is a very weird way to represent a beta distribution.
> The computational intensiveness is not caused by the choice of prior, it's caused by the need to evaluate an integral over the joint posterior.
I see, they are computing expected_value(max(ctr[A]-ctr[B], 0.0)). That is still weird though. What you want to know is if it's worth it to run the test another time. So you want to compare E(final conversion rate if stop now) with E(final conversion rate if run another time), and if the latter is not much greater than the former you stop the test. Both of those have a closed form. Even better would be to compare E(final conversion rate if stop now) and E(final conversion rate if we test A) and E(final conversion rate if we test B). Then you would also automatically decide the best version to show (e.g. if the uncertainty about A is small and the uncertainty about B is big, you'll show B).
> A Dirichlet prior is also not what you'd use for more than 2 alternatives
Hm? Lets say you have a free plan, basic plan, and enterprise plan. This is a very common scenario in practice. A dirichlet prior would be the natural thing to use here, IMO.
E(final conversion rate if run another time)... Both of those have a closed form.
I'm curious - where can I learn more?
Lets say you have a free plan, basic plan, and enterprise plan...A dirichlet prior would be the natural thing to use here, IMO.
This would be handled via Dirichlet, and then the results multiplied by their LTV. I thought you were referring to multiple variants - i.e., landing page A, landing page B, landing page C.
Suppose you have Beta(a1,b2) and Beta(a2,b2) at the current step. The expected conversion rates are:
If we stop now the expected conversion rate is E = max(E1,E2).M(a,b) = a/(a+b) E1 = M(a1,b1) E2 = M(a2,b2)If we continue for another timestep with option 1 then the question is whether that can make us switch from 1 to 2 or from 2 to 1 or not. If it can't then the expected conversion rate is the same whether or not we execute one more step. Lets assume without loss of generality that option 2 is currently winning, but if option 1 gets another conversion then 1 is winning. So the new expected conversion rate is:
where p_1 is the probability density of Beta(a1,b1). All the moments of the beta distribution have a closed form, so E' also has a closed form.E' = int(p_1(r)*(r*r + (1-r)*E2)), r=0..1)You could generalize this to running it for n more times instead of one more time, you'd get an expression of the form:
I suspect that also has a closed form but I'm not sure at first glance.E = int(p_1(r1)*p_2(r2)*polynomial(r1,r2))
Why not run a two-tailed test and double the alpha? If I'm understanding it correctly, you'll still make the same conclusion at either tail as a one-tailed test, but this way you have both directions covered. I could be missing something, just thinking out loud.
Note on SumAll
All users who use SumAll should be wary of their service. We tried them out and we then found out that they used our social media accounts to spam our followers and users with their advertising. We contacted them asking for answers and we never heard from them. Our suggestion: Avoid SumAll.
Hey Antr, Jacob from SumAll here. Sorry to hear you had a bad experience with us. The tweets you're talking about that "spam" your accounts were most likely the performance tweets that you are free to toggle on and off. Here's how you can do that: https://support.sumall.com/customer/portal/articles/1378662-...
Best, Jacob
As the tweets contain both SumAll-related hash tags and Links to SumAll, this is definitely marketing that should be opt-in, not opt-out. Unless the user of your service is explicitly made aware of these automated tweets in clear terms when they sign up, this is a bit shady and dishonest to say the least.
Even if it's in the terms - do it opt-in.
So SumAll spams your followers by default but you can turn it off if you know how?
There's no need for scare quotes. You were clearly spamming the guy's followers.
You are also free to toggle that feature off, and should.
>the performance tweets that you are free to toggle on and off.
It's opt out isn't it.
I couldn't imagine a worse target audience to use that line on.
This is opt-out? Srsly?
maybe it is a revenue stream for them?
And...? I'm sure it is. It markets their product at the expense of their user's credibility with their social circles. There's no downside! (For Sumall)
There is ... the ability for them to get word of mouth marketing is effectively dead for them. At best, if someone really likes it ..they no longer need to tell their friends... it's already done it for them.
Of course it is, but such posting on behalf should imho always be opt-in.
This article comes off as a bit boastful and somewhat of an advertisement for the company...
"What threw a wrench into the works was that SumAll isn’t your typical company. We’re a group of incredibly technical people, with many data analysts and statisticians on staff. We have to be, as our company specializes in aggregating and analyzing business data. Flashy, impressive numbers aren’t enough to convince us that the lifts we were seeing were real unless we examined them under the cold, hard light of our key business metrics."
I was expecting some admission of how their business is actually different/unusual, not just "incredibly technical". Secondly, I was expecting to hear that these "technical" people monkeyed with the A/B testing (or simply over-thought it) which got them in to trouble .. but no, just a statement about how "flashy" numbers don't appeal to them.
I think the article would be much better without some of that background.
They are incredible as in literally not credible.
>We decided to test two identical versions of our homepage against each other... we saw that the new variation, which was identical to the first, saw an 18.1% improvement. Even more troubling was that there was a “100%” probability of this result being accurate.
Wow. Cool explanation of one-tailed, two tailed tests. Somehow I have never run across that. Here's a link with more detail (I think it's the one intended in the article, but a different one was used): http://www.ats.ucla.edu/stat/mult_pkg/faq/general/tail_tests...
Oh great, another misuse of A/B testing
Here's the thing, stop A/Bing every little thing (and/or "just because") and you'll get more significant results.
Do you think the true success of something is due to A/B testing? A/B testing is optimizing, not archtecting.
Indeed. A/B testing will get you stuck on local optimums.
It seems like I see these articles pop up on a regular basis over at Inbound or GrowthHackers.
I think the problem is two-sided: one on the part of the tester and one on the part of the tools. The tools "statistically significant" winners MUST be taken with a grain of salt.
On the user side, you simply cannot trust the tools. To avoid these pitfalls, I'd recommend a few key things. One, know your conversion rates. If you're new to a site and don't know patterns, run A/A tests, run small A/B tests, dig into your analytics. Before you run a serious A/B test, you'd better know historical conversion rates and recent conversion rates. If you know your variances, it's even better, but you could probably heuristically understand your rate fluctuations just by looking at analytics and doing A/A test. Two, run your tests for long after you get a "winning" result. Three, have the traffic. If you don't have enough traffic, your ability to run A/B tests is greatly reduced and you become more prone to making mistakes because you're probably an ambitious person and want to keep making improvements! The nice thing here is that if you don't have enough traffic to run tests, you're probably better off doing other stuff anyway.
On the tools side (and I speak from using VWO, not Optimizely, so things could be different), but VWO tags are on all my pages. VWO knows what my goals are. Even if I'm not running active tests on pages, why can't they collect data anyway and get a better idea of what my typical conversion rates are? That way, that data can be included and considered before they tell me I have a "winner". Maybe this is nitpicky, but I keep seeing people who are actively involved in A/B testing write articles like this, and I have to think the tools could do a better job in not steering intermediate-level users down the wrong path, let alone novice users.
What he did in that article is more commonly known as an "A/A test"
Optimizely actually has a decent article on it: https://help.optimizely.com/hc/en-us/articles/200040355-Run-...
I just checked in one possible R calculation of two-sided significance under a binomial model under the simple null hypothesis A and B have the same common rate (and that that rate is exactly what was observed, a simplifying assumption) here http://winvector.github.io/rateTest/rateTestExample.html . The long and short is you get slightly different significances under what model you assume, but in all cases you should consider it easy to calculate an exact significance subject to your assumptions. In this case it says differences this large would only be seen in about 1.8% to 2% of the time (a two-sided test). So the result isn't that likely under the null-hypothesis (and then you make a leap of faith that maybe the rates are different). I've written a lot of these topics at the Win-Vector blog http://www.win-vector.com/blog/2014/05/a-clear-picture-of-po... .
They said they ran an A/A test (a very good idea), but the numbers seem slightly implausible under the two tests are identical assumption (which again, doesn't immediately imply the two tests are in fact different).
The important thing to remember is your exact significances/probabilities are a function of the unknown true rates, your data, and your modeling assumptions. The usual advice is to control the undesirable dependence on modeling assumptions by using only "brand name tests." I actually prefer using ad-hoc tests, but discussion what is assumed in them (one-sided/two-sided, pooled data for null, and so on). You definitely can't assume away a thumb on the scale.
Also this calculation is not compensating for any multiple trial or early stopping effect. It (rightly or wrongly) assumes this is the only experiment run and it was stopped without looking at the rates.
This may look like a lot of code, but the code doesn't change over different data.
What do you mean by "brand name tests"?
I was looking for a much more personal article from the headline.
I would be curious to know what percentage of teams with statisticians / data people actually use tools like Optimizely? A lot of people seem to be building their own frameworks that use a lot of different algorithms (two-armed bandits, etc.). From my understanding, Optimizely is really aimed at marketers without much statistical knowledge.
Of course, if you're a startup, building an A/B testing tool is your last priority, so you would use an existing solution.
Are there much more advanced 'out-of-the-box' tools for testing out there besides the usual suspects, i.e. Optimizely, Monetate, VWO, etc.?
This title used to read "How Optimizely (Almost) Got Me Fired", which is the actual title of the article.
It seems a mod (?) changed it to "Winning A/B results were not translating into improved user acquisition".
I've seen a descriptive title left by the submitter change back to the less descriptive original by a mod. But I'm curious why a mod would editorialize certain titles and change them away from their original, but undo the editorializing of others and change them to the less descriptive originals.
I feel that the second title is better, as it talks about the kind of testing they are using, instead of being a click bait of "HOW DID IT GET YOU FIRED?".
My question is why mods change some headlines away from the originals to be more descriptive (good) and why they change back to the originals even though they are less descriptive (bad).
FWIW the change to this headline seems like the right decision to me.
The guideline is to use the original title unless it is misleading or linkbait [1]. It's astonishing how often that qualifier gets dropped from these discussions. It's pretty critical, and makes the reason for most title changes pretty obvious.
Thanks for the response. I'd humbly submit that there are occasions where the guidelines should be ignored in service of a more descriptive (non-linkbaity) title.
I can't find the submission but one recent example that comes to mind is a presentation on radar detectors that was fascinating. I clicked because the submitter described the article; the original title was (IIRC) the model number of the radar gun.
Later a mod changed the HN post back to the model number, which had zero relevance to anybody not in the radar gun industry.
The original title was clearly linkbait.
> The kicker with one-tailed tests is that they only measure – to continue with the example above – whether the new drug is better than the old one. They don’t measure whether the new drug is the same as the old drug, or if the old drug is actually better than the new one. They only look for indications that the new drug is better...
I don't understand this paragraph. They only look for indications that the drug is better... than what?
Do any of these tools show you a distribution of variable your trying to optimize? I am just thinking that some product features might be polarizing but if you measure, the mean it might give you different results than expected. I am thinking that's where the two-tailed comes in.
Perhaps the most troubling element is that optimizely seems comfortable claiming 100% certainty in anything. That requires (in Bayesian terminology) infinite evidence, or equivalently (in frequentist terminology) if they have finite data, an infinite gap between mean performances.
Peculiar use of the word bug in this context:
"They make it easy to catch the A/B testing bug..."
meaning "fever" - generally cured by more cowbell, but in this case only "curable" by more A/B testing
this is all fine and good, but if you're goal is to see what works best between X new versions of a page and you are rigorous in creating variants, Optimizely is a great tool for figuring out the best converting variant.
Except, apparently, they aren't actually that good at _that_. If an A/A test to not yield 100% chance of 18% uplift, what gives you any degree of certainty that other tests won't have equally skewed results?
Run an A/A/B (or A/A/B/B) test, decide on traffic levels before you start the test, and let it run until you reach those levels before you peek.
In my experience Optimizely does everything they can to mislead their users into overestimating their gains.
Optimizely is best suited at creating exciting graphs and numbers that will impress the management, which I guess is a more lucrative business than providing real insight.
The headline isn't really what this article is about, particularly the disparaging of Optimizely. Might I suggest "The dangers of naive A/B testing" or "Buyer beware -- A/B methodologies dissected" or "Don't Blindly Trust A/B Test Results".
Where's the part where he "(almost)" got fired?
Maybe that's the headline that did best in an A/B test.