The partial F-test: Solving for A/B test interactions and conditional treatment effects

16 min read Original article ↗

Moving beyond the standard t-test in experimentation

When it comes to advanced A/B testing, product and data teams often need to measure complex A/B test interactions and conditional treatment effects (CATE) in order to understand how different users actually respond. Yet, somewhat surprisingly, almost all the analysis in experimentation platforms, even those with extra complexities such as CUPED, is ultimately reduced to some form of univariate test, most often the basic t-test. The t-test is most useful when we have a single hypothesis we want to test. For example, does some treatment perform better than a control, or maybe to check if two simple A/B tests are interacting with one another. However, there are many cases where what is needed is a multivariate test – a way to test more than one null hypothesis at the same time. 

Is there such a test? Yes! It is the partial F-test for nested linear regression models. The partial F-test is perhaps one of the most useful, underappreciated, and underused approaches in experimentation. It is such a flexible framework that all of the following can be recast as partial F-tests of nested regression models: 

  • A/B t-tests
  • omnibus ANOVA/ANCOVA tests
  • conditional treatment effects (CATE)
  • and even interaction effects between different concurrent A/B tests

So, if you want to be able to answer questions like “Do my A/B test results differ by customer?” and/or “Are my A/B tests interacting with each other?”, then read on!

TL;DR: The 5 steps of a partial F-test

If you already know about multivariate regression and nested models, and would just like a TL;DR version, here are the basic steps for the (homoscedastic) partial F-test:

1. Compare two regression models:
a. A more complex ‘full’ regression model;
b. A nested, simpler ‘partial’ regression model.

2. See how well each ‘predicts’ the outcome data based on the residual sum of squares.

3. Compare the models’ results using an F-statistic.

4. Find the area under the cumulative F-distribution wrt the F-statistic to calculate the associated p-value.

If that was clear, great, jog on. If not, there are a few key concepts to cover before getting into the details of running the F-test. First, we need to review the linear model and nesting.

Understanding the linear model in A/B testing 

By linear model, we mean a model that can be solved with OLS regression. These are models in the form of Y = Xβ where X is the design matrix (the data about the experiment, such as test assignment and any possible covariates) and β are the estimated weights (the things we want to learn from the experiment).

A simple A/B test in this framework can be modeled as Y = β0 + β1 * Treat, where ‘Treat’ is a binary indicator (dummy) variable encoding, with a value of  ‘1’ for participants who received the treatment and ‘0’ for those in the control group. 

The regression weight for treatment β1 is the estimate for the average treatment effect (ATE). The ATE is the difference in the estimated value of the outcome measure, Y, between those in the control and treatment groups.

The standard t-test approach evaluates if the null H0 : β1 = 0, should be rejected by seeing how far β1 / St_err(β1) is from zero. When it is beyond a critical value (e.g., abs(1.96)), then we reject H0 (the null); else we ‘fail to reject’ H0.

Let’s look at a very simple A/B test example where we have a revenue measure, and subjects have been assigned to treatment or control.

RevenueTreatment
$ 4.900
$ 8.520
$ 6.340
$ 5.370
$ 2.000
$ 3.930
$ 8.931
$ 6.151
$ 10.061
$ 6.621
$ 8.111
$ 7.951

Using Excel (yeah, yeah, insert a joke about Excel, but feel free to use whatever deterministic stats software you like), I first calculated the mean values for the treatment and control groups. Then I regressed revenue onto treatment to get the following results:

  1. The mean value for each test arm and the average Treatment effect (ATE)
Test armMeans
Control$ 5.18
Treatment$ 7.97
ATE$ 2.79
    1. The regression ANOVA table
ANOVAdfSSMSFSignificance F
Regression123.42123.4216.7470.027
Residual1034.7153.471
Total1158.135
    1. And the Coefficients (regression weights) table
CoefficientsStandard Errort StatP-value
Intercept$ 5.180.7616.8040.000
Treatment$ 2.791.0762.5970.027

A few things to notice: 

  1. The regression coefficient for the Treatment indicator variable is the same as the ATE calculated from the difference in means. 
  2. The t-statistic from the coefficients table, 2.597, is the square root of the F-statistic in the ANOVA table: sqrt(6.747) = 2.597. 
  3. And the P-values for the t-test and the F-test are exactly the same! In other words, we could (with a tiny bit of fiddling for dealing with signs in one-tailed tests) use either the t-test of the Treatment coefficient or the Omnibus F-test for the regression and get the same result. 

Conclusion: A/B tests are just a special case of the partial F-test framework. 

The partial F-test and nested regression models

By nested, we mean the family of simpler models that are subsets of some larger ‘full’ model. 

Conductrics robot

In our simple example, the ‘full’ model is a simple one-variable dummy regression Y = β0 + β1 * Treat. Where the dummy variable (a binary indicator variable) represents the users who were assigned to the treatment. The nested model, by dropping the treatment variable, reduces to just an intercept term model Y = β0. This model is equivalent to using the overall mean value to predict Y. Notice that Y = β0 = β0 + β1 * Treat, where β1 = 0. But β1 = 0 is exactly what we are trying to test – do we have evidence to reject H0 : β1 = 0. The intercept-only model implicitly forces β1 = 0. We can then compare the two models and ask whether the full model, which includes the Treatment variable, explains more of the variation in Y than the model without it.

This comparison between the nested and the full model is the partial F-test. To make the initial discussion simpler, we will use the equal error (homoscedastic) version of the partial F-statistic:

fStat=(ResSSnestedResSSfull)p(ResSSfull)(Nk)f – Stat = \frac{\frac{(Res_SS_{nested} – Res_SS_{full})}{p}}{\frac{(Res_SS_{full})}{(N-k)}}

The Res_SSnested term is the total sum of squared residuals (you can think of the Res_SS as a measure of how well the model predicts Y over all of the data points in the sample) using the nested model to estimate Y, and Res_SSfull is the total sum of squared residuals of the full model. P is the number of joint null hypotheses, and k is the total number of parameters in the full model.

To see where everything comes from, we can calculate all of those values directly ‘by hand’. Below, I have used the predicted value of Revenue (Y) from each model to calculate its respective total residual sum of squares. 

The nested model uses the overall mean, 6.57, as the predicted value of each subject’s revenue. The full model uses β0 = 5.18 as the estimated revenue for those in the Control group and β0 + β1 = 7.97 as the estimate for the treatment group. After squaring the residuals for each subject and then summing those values, we find that Res_SSnested = 58.14 and Res_SSfull = 34.71.

Nested model

Revenue Nested Model Estimate Residual Residual Squared
$4.90 $ 6.57 $ (1.67) 2.78
$8.52 $ 6.57 $ 1.95 3.79
$6.34 $ 6.57 $ (0.24) 0.06
$5.37 $ 6.57 $ (1.21) 1.46
$2.00 $ 6.57 $ (4.57) 20.92
$3.93 $ 6.57 $ (2.64) 6.99
$8.93 $ 6.57 $ 2.36 5.56
$6.15 $ 6.57 $ (0.42) 0.18
$10.06 $ 6.57 $ 3.48 12.13
$6.62 $ 6.57 $ 0.04 0.00
$8.11 $ 6.57 $ 1.54 2.38
$7.95 $ 6.57 $ 1.38 1.89
Total Residual Sum of Squares 58.14

Full model

Revenue Full
Model Estimate
Residual Residual Squared
$4.90 $ 5.18 $ (0.27) 0.07
$8.52 $ 5.18 $ 3.34 11.18
$6.34 $ 5.18 $ 1.16 1.35
$5.37 $ 5.18 $ 0.19 0.04
$2.00 $ 5.18 $ (3.18) 10.10
$3.93 $ 5.18 $ (1.25) 1.55
$8.93 $ 7.97 $ 0.96 0.92
$6.15 $ 7.97 $ (1.82) 3.30
$10.06 $ 7.97 $ 2.09 4.35
$6.62 $ 7.97 $ (1.35) 1.83
$8.11 $ 7.97 $ 0.15 0.02
$7.95 $ 7.97 $ (0.02) 0.00
Total Residual Sum of Squares 34.71

These are the same values in the ANOVA table. This is because when we have just one hypothesis or just one variable in our model, the omnibus ANOVA F-test is equivalent to the partial F-test for the single t-test. 

To complete the exercise, take both the sum of squared residuals and plug them into the F-statistic: 

fStat=(Res_SSnestedRes_SSfull)p(Res_SSfull)(Nk)=(58.1434.71)1(34.71)(122)=23.423.471=6.747f – Stat = \frac{\frac{(Res\_SS_{nested} – Res\_SS_{full})}{p}}{\frac{(Res\_SS_{full})}{(N-k)}} = \frac{\frac{(58.14 – 34.71)}{1}}{\frac{(34.71)}{(12-2)}} = \frac{23.42}{3.471} = 6.747

This matches the F-statistic from the Excel regression ANOVA table (and will match those from R, Stata, etc.).

Just looking at those equations can be confusing, so it helps me to think of the F-stat as roughly how much extra relative error we see in our estimates if we just use a simpler model. Or rather, what is the marginal benefit of adding complexity to the model, and if that marginal improvement is in the order that would be consistent with random fluctuations in our sample.

Of course, so far, you might think this is just an interesting relationship for the nerds. But that would be wrong. While yes, it is interesting for nerds, it is also super useful for everyone interested in answering more complex experimentation questions than basic A/B tests. The partial f-test provides a general framework for comparing any linear model with any of its nested versions. And that is hugely useful, especially for questions that can be recast into interaction problems.

Are my A/B tests interfering with each other? Interactions and conditional treatment effects (CATE)

By interactions, we mean when two or more independent variables don’t just have separate effects, but they combine in a way that changes the outcome. For example:

  • Are there heterogeneous treatment effects, e.g., do different customer segments prefer different treatments/experiences?
  • ANOVA and Factorial Multivariate Designs, where we want to decompose the share of the explainable variance in our measure of interest over several multivariate factors. ‘Does changing the headline and image affect user behavior, and if so, what share of that behavior is due to changing the headlines vs. changing the images?’
  • And the evergreen, ‘Are my A/B tests interfering with each other?’

    (Side note: I know it’s fashionable to dismiss this question, but it seems totally reasonable to be able to provide a principled approach to an answer rather than a bunch of handwaving and appeals to the industry Hippos that it never matters. If the C-suite is concerned about it, maybe it is a good idea to have a better answer than ‘Booking and Microsoft told us it doesn’t matter’). 

Interactions are equivalent to logical ‘And’ statements. They are multiplicative models in which we need to know both this and that to describe an effect.

For example, maybe your organization wants to check whether there are conditional treatment effects across customer loyalty segments. The test data now includes a customer loyalty segment variable with values ‘Low’, ‘Med’, and ‘High’.

RevenueTreatmentSegment
$ 4.900Low
$ 8.520Med
$ 6.340High
$ 5.370Low
$ 2.000Med
$ 3.930High
$ 8.931Low
$ 6.151Med
$ 10.061High
$ 6.621Low
$ 8.111Med
$ 7.951High

We might ask if knowing the customer’s loyalty status is informative for estimating the effect of the Treatment on revenue.

Visualizing interactions with means charts 

A useful tool for visualizing these two-way multivariate questions (Treatment*Loyalty) is a means chart. These charts plot the average value of the outcome variable for each combination (or stratum) in the data. The mean value of the outcome measure for each arm in the experiment is plotted as the lines in the plot. Each arm is crossed by the potentially interacting external variable. For example, in all of the charts below, ‘This Test’ is the test that we think might be affected by some external influence – perhaps some other test or some customer dimension. The horizontal axis represents the potentially interacting external variables.

No interaction = parallel segments
When the lines are parallel, it is unlikely that there is an interaction effect. 

Two multivariate means charts displaying perfectly parallel lines across different test assignments, visually indicating the absence of an interaction effect between A/B tests.

Magnitude interaction = nonparallel
When the lines are not parallel but don’t cross each other, then that is suggestive of a magnitude interaction – the direction of the effect is the same over the values in the user segment (or another A/B test when looking for test interactions), but the magnitude isn’t the same.

Two multivariate means charts illustrating a magnitude interaction in A/B testing. The plotted lines for the test variations are nonparallel, indicating different effect sizes, but they do not cross.

Sign interaction = crossing lines
If the lines cross, that is evidence of sign inversions and suggests that targeted treatments might be needed (or that the A/B is invalidated due to corruption from another A/B test running simultaneously).

Two multivariate means charts showing a sign interaction in an A/B test. The plotted lines cross each other, visually indicating a sign inversion where the treatment effect reverses direction across different segments.

Below we have a means chart of our test data broken out by Loyalty status:

A means chart displaying A/B test effects broken out by customer loyalty segment. The control and treatment lines are nonparallel but do not cross, suggesting a potential magnitude interaction without a sign interaction.

The lines don’t cross, so we don’t have evidence for a sign interaction. However, the lines aren’t exactly parallel, as the difference between Treatment and Control is at least nominally larger for highly loyal customers. Maybe there is a magnitude interaction, maybe not. 

This visual inspection approach gets even harder when there are many levels in one or more of our variables. For example, if we had four test treatments and eight segments, our data visualization might look something like this:

A complex multivariate means chart displaying four test options across eight user segments. The numerous crossing lines suggest potential sign interactions, highlighting the difficulty of relying solely on visual inspection for complex A/B tests.

There are crossing lines that suggest a sign interaction, but perhaps because there are many combinations, this behavior is due to unexplained, random variability in the outcome measure.

This is where having a multivariate statistical test, rather than just the simple univariate A/B test, enables us to test many combinations at once. Looking at it another way, the partial F-test lets us test whether the lines in the means chart are parallel. Providing both the statistical and visual approaches gives a fuller picture and makes the results accessible to everyone in the organization. 

You might also like: A/B Testing: When Tests Collide

To help clarify how this works, let’s walk through the analysis for our simple test of loyalty effects. First, write out the fully interacted model of the loyalty segment with Treatment assignment: Y = β0 + β1 * Treat + β2 * Med + β3 * High + β4 * Med * Treat + β5 * (High * Treat) , where β4 is the interaction term between treatment and customers in the medium loyalty group, and β5 is the interaction term between customers in the high loyalty group. Hypothesizing that there are no conditional treatment effects wrt to Loyalty is equivalent to hypothesizing that both β4 and β5 are equal to zero. Setting both to zero reduces the full model into the simpler additive, main effects model, which will be the nested comparison for our test. The main effect model can be written as Y = β0 + β1 * Treat + β2 * Med + β3 * High.

After running both regression models, we find their respective residual sums of squares and degrees of freedom are as follows: 

ANOVAdfResidual SS
Full Interacted Model631.08
Nested Main Effects833.11

The number of joint hypotheses, P, is calculated as the difference between the df_nested (8) and df_full (6), which is 2, as expected.  

Plugging these values into the simple f-stat formula yields the following f-stat:

fStat=(Res_SSnestedRes_SSfull)p(Res_SSfull)(Nk)=(33.1131.08)2(31.08)(126)=1.0135.181=0.196f – Stat = \frac{\frac{(Res\_SS_{nested} – Res\_SS_{full})}{p}}{\frac{(Res\_SS_{full})}{(N-k)}} = \frac{\frac{(33.11 – 31.08)}{2}}{\frac{(31.08)}{(12-6)}} = \frac{1.013}{5.181} = 0.196

An f-stat of 0.196 has a p-value of 0.827, so there is no evidence of conditional treatment effects based on loyalty. Equivalently, there is no evidence that our lines are not parallel. (This isn’t too surprising since we only have 12 data points).

The reality of unbalanced data: Heteroskedasticity and the robust F-test

Unfortunately, in the real world, we often can’t use the homoscedastic version of the F-test. While it is likely robust for well-defined and balanced factorial multivariate ANOVA tests, for more ad hoc tests, like testing for CATE and A/B test interactions, the homoscedastic version of the test is unlikely to be robust and will tend to have poor type 1 error control. This is because this version of the test is very sensitive to unequal cell sizes, which will often be the case when testing segment-level effects, as segment assignments are rarely uniformly distributed (for example, there are often many fewer customers in the High loyalty tier than in the Low loyalty tier). If we can’t confidently use this test, does that mean this is all for naught? No, there is another …

For the nerds – The robust F-test 

The robust F-test uses robust standard errors that account for unequal variance in the error terms across different covariate values. While there are various versions of these robust estimates, they are all much more complicated to construct than the standard OLS standard errors. The standard OLS variance-covariance matrix reduces to σ2(X’X)-1, where σ2 is the homoscedastic error variance. However, for robust estimates, we don’t assume homoscedastic errors and need to account for unequal error variance across different data configurations.

The basic Eicker-White (HC0) standard errors incorporate the squared residual errors for each record rather than using a single (homoscedastic) error estimate. The HC0 formula requires constructing and using a diagonal matrix Ω̂ that contains each residual term, which is used to adjust the weights on the influence of each data point in the standard errors.

VarHC0(β^)=(XX)1(XΩ^X)(XX)1, where Ω^=[ε^1200ε^n2]{V}ar_{HC0}(\hat{\beta}) = (X’X)^{-1}(X’\hat{\Omega}X)(X’X)^{-1}, \text{ where } \hat{\Omega} = \begin{bmatrix} \hat{\varepsilon}_1^2 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \hat{\varepsilon}_n^2 \end{bmatrix}

There are several versions of robust standard errors. While the choice of the best version to use is subjective, Conductrics uses HC3 standard errors, which adjust the squared residuals by (1-(diag(X(X’X)-1X’)))-2, where the diagonal elements represent the leverage of the data point. This version has been shown to provide much stronger type 1 error control when data is highly imbalanced. For those interested, the full formula for this robust F-test is:

FHC3=(1q)(Rβ^r)T(RV^HC3RT)1(Rβ^r).F_{HC3} = \left(\frac{1}{q}\right)(R\hat{\beta} – r)^T(R\hat{V}_{HC3} R^T)^{-1}(R\hat{\beta} – r).

Scaling the robust F-test for enterprise experimentation

While a discussion of the HC3 robust F-test is beyond the scope of this post, it is the approach used here at Conductrics. It is also likely the reason that no other publicly available experimentation platform offers these types of tests natively. Without thoughtful design and access to performative linear algebra routines (SQL really isn’t designed for inverting largish (1k by 1k) matrices), these calculations can be extremely compute-intensive.

Somewhat ironically, it is Conductrics’ privacy-by-design architecture that makes it possible for us to offer these tests at scale. In a way, data minimization is dual to efficient data storage and statistical computing. If one has been thoughtful, all of the matrix operations can be performed not in the space of the observations, but in the space of the covariates, which is most often orders of magnitude smaller. Often, less can really be more. 

If you’d like to see how Conductrics generates these multivariate means charts and the associated statistical analyses, and how our machine learning agent uses these analyses as a preprocessing step for advanced predictive audiences,reach out to us to schedule a walk-through.

Is your current experimentation platform this robust?

Our architecture natively handles the compute-intensive lifting required for advanced testing at scale. See how Conductrics automatically manages the complex math behind the scenes.

Category: Experimentation, Platform