Why does a linear least squares fit appear to have a bias when applied to simple test data?

5 min read Original article ↗

The answer lies in the data generating process. As other answers have pointed out, the simple OLS model assumes

$$ Y = \beta_0 + \beta_1 X + \varepsilon $$

and that $X$ and $\varepsilon$ are independent. It also assumes that we know $X$ with certainty.

Let’s consider the case where we have a perfectly fit regression and all error terms are zero. Because all the data points lie along the regression line, there is no variance in any other direction. So if we were to do PCA on this, the principal component would be in exactly the same direction as our regression line, matching your intuition. Once we start to add noise (i.e. increase the error term), which is strictly in the vertical direction, PCA is going to capture some of that for the first principal component, and leave what’s left over for the second. That is, the direction of maximum variance is going to be more extreme in the $Y$ direction. That’s what you’re observing in your own plots (only talking about the slope discrepancy here, not the centroid, which is covered by other answers).

Now consider the opposite extreme, when error dominates. Here it’s useful to pull apart the variance of $Y$:

$$ \operatorname{Var}(Y) = \operatorname{Var}(\beta_1 X + \varepsilon) = \operatorname{Var}(\beta_1 X) + \operatorname{Var}(\varepsilon) = \beta_1^2 \operatorname{Var}(X) + \operatorname{Var}(\varepsilon) $$

So in the case where

$$ \operatorname{Var}(\varepsilon) \gg \beta_1^2 \operatorname{Var}(X), $$

nearly all variation is vertical. The principal component direction approaches vertical as error variance increases.

variance domination

Another source of confusion may be that the covariance ellipse only serves as a meaningful data contour when $(X, Y)$ are jointly normal. Here are examples where $X$ has different distributions:

enter image description here

Again, we see the same pattern: when the $X$ variance dominates, the principal component lines up far more with the regression line. But when the ellipsoid doesn’t coincide with such a tight contour of the data, we can more easily see that we’re dealing with different things in PCA and OLS.

Notice that when

$$ \operatorname{Var}(X) \gg \operatorname{Var}(\varepsilon), $$

PCA and OLS align. But the ellipse no longer describes the data's shape. It indicates direction, not density.

Here’s the code used to generate this data. Notice that the make_y_from_x function follows the precise assumptions of OLS. $Y$ is a linear function of $X$ plus i.i.d. Gaussian noise (if you want to have correlated residuals to try to violate the assumption, you can create that in the data generating process). Importantly, $X$ isn’t necessarily a random variable. $X$ could be from an experiment we designed with fixed values, or non-random for any other reason. It can also be random (but still known with certainty), so it’s important to think about this when specifying a model. (See What is a random variable and what isn't in regression models). Notice that the covariance ellipse treats $X$ and $Y$ symmetrically, but the data generating process is quite different.

Quick summary:

  • When $\operatorname{Var}(X) \gg \operatorname{Var}(\varepsilon)$: PCA aligns with OLS (the textbook case)
  • When $\operatorname{Var}(\varepsilon) \gg \operatorname{Var}(X)$: PCA approaches vertical
  • $X$ is not necessarily random, and definitely doesn’t have to be normally distributed.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse

SLOPE = 1.0
INTERCEPT = -100.0
X_MEAN = 150.0
X_SCALE = 50.0


def make_y_from_x(x, noise_std, slope=SLOPE, intercept=INTERCEPT):
    return intercept + slope * x + np.random.randn(len(x)) * noise_std


def deterministic_binary_x(n):
    x_low = X_MEAN - X_SCALE
    x_high = X_MEAN + X_SCALE
    x = np.repeat([x_low, x_high], n // 2)
    if len(x) < n:
        x = np.append(x, x_high)
    return x


def deterministic_uniform_x(n):
    return np.linspace(X_MEAN - X_SCALE, X_MEAN + X_SCALE, n)


def bimodal_x(n):
    mix = np.random.rand(n) < 0.5
    x = np.empty(n)
    x[mix] = np.random.randn(mix.sum()) * (X_SCALE / 3) + (X_MEAN - X_SCALE)
    x[~mix] = np.random.randn((~mix).sum()) * (X_SCALE / 3) + (X_MEAN + 3 * X_SCALE)
    return x


def triangular_x(n):
    return np.random.triangular(
        X_MEAN - X_SCALE,
        X_MEAN + X_SCALE,
        X_MEAN + X_SCALE,
        size=n
    )


def plot_from_cov(ax, x, y, title):
    x = (x - np.mean(x)) / np.std(x)
    y = (y - np.mean(y)) / np.std(y)

    mu_x, mu_y = np.mean(x), np.mean(y)

    cov = np.cov(x, y)
    eigvals, eigvecs = np.linalg.eigh(cov)
    idx = np.argmax(eigvals)

    v = eigvecs[:, idx]
    angle = np.degrees(np.arctan2(v[1], v[0]))

    a = 3 * np.sqrt(eigvals[idx])
    b = 3 * np.sqrt(eigvals[1 - idx])

    ax.plot(x, y, '.', alpha=0.05)

    slope, intercept = np.polyfit(x, y, 1)
    xs = np.array([x.min(), x.max()])
    ax.plot(xs, slope * xs + intercept, lw=2)

    ax.quiver(
        mu_x, mu_y,
        a * v[0], a * v[1],
        angles='xy',
        scale_units='xy',
        scale=1,
        width=0.01,
        color='black',
        zorder=20
    )

    ell = Ellipse((mu_x, mu_y), 2 * a, 2 * b,
                  angle=angle, edgecolor='red',
                  facecolor='none', lw=2)
    ax.add_patch(ell)

    xmin = min(x.min(), mu_x - a)
    xmax = max(x.max(), mu_x + a)
    ymin = min(y.min(), mu_y - a)
    ymax = max(y.max(), mu_y + a)

    ax.set_xlim(xmin, xmax)
    ax.set_ylim(ymin, ymax)
    ax.set_aspect('equal', adjustable='box')

    ax.text(0.02, 0.98, title,
            transform=ax.transAxes,
            ha='left', va='top',
            fontsize=11, weight='bold')


# -------- Joint normal: variance domination --------
fig, axs = plt.subplots(2, 2, figsize=(12, 8))
axs = axs.ravel()

n = 20000
beta = 1.0

settings = [
    (0.3, "var(X) >> var(ε)"),
    (5.0, "var(ε) >> var(X)")
]

for ax, (sigma, label) in zip(axs, settings):
    X = np.random.randn(n)
    Y = beta * X + np.random.randn(n) * sigma
    plot_from_cov(ax, X, Y, label)

for ax in axs[len(settings):]:
    ax.axis("off")

plt.tight_layout()


# -------- Different X distributions --------
fig, axs = plt.subplots(2, 2, figsize=(12, 8))
axs = axs.ravel()

make_xs = [
    deterministic_binary_x,
    deterministic_uniform_x,
    bimodal_x,
    triangular_x
]

sigma = 30.0

for ax, make_x in zip(axs, make_xs):
    X = make_x(n)
    Y = beta * X + np.random.randn(n) * sigma
    plot_from_cov(ax, X, Y, make_x.__name__)

plt.tight_layout()
plt.show()