A frequent problem in User Acquisition is to estimate ARPU of users in a cohort.
Cohort — a group of users acquired together (same date, source, channel, etc.).
Individual user revenue follows a heavy-tailed distribution:
- Majority (~75%) generate zero revenue
- Some (~20%) generate modest revenue
- Tiny minority (~5%) generate the significant part (whales)
Assumption: users inside a cohort are i.i.d., individual revenue ∼ LogNormal (or any other heavy-tailed distribution with finite mean and variance).
Because mean μ and variance σ² exist and are finite, we can apply the Central Limit Theorem to the cohort mean.
So the problem is to estimate unknown parameters μ and σ² using data. For modeling purposes, it would be convenient to define σ in terms of some other parameter k and μ , for that we need observed standard deviation of revenue inside cohorts.In practice, revenue SD is roughly proportional to mean (common for positive heavy-tailed data).This leads to the hierarchical Bayesian model that works extremely well as long as cohort size n ≳ 50–100
import numpy as np import pymc as pm import pandas as pd df_agg = pd.DataFrame({ 'cohort': ['app1', 'app2', 'app3', 'app4', 'app5'], 'installs': [ 325, 430, 280, 190, 500], 'arpu': [0.15, 0.22, 0.30, 0.25, 0.05], 'revenue_sd': [0.48, 0.46, 0.55, 0.50, 0.20], }) df_agg['idx'], cats = pd.factorize(df_agg.cohort) coords = {'cohorts': cats} model = pm.Model(coords=coords) idx = df_agg.idx.values with model: mu_global = pm.Normal('mu_global', mu=np.log(1.2), sigma=0.3) sigma_global = pm.HalfNormal('sigma_global', sigma=0.3) log_mu = pm.Normal('log_mu', mu=mu_global, sigma=sigma_global, dims=('cohorts',),) k = pm.Gamma('k', mu=2, sigma=0.8, dims=('cohorts',)) mu = pm.Deterministic('mu', pm.math.exp(log_mu), dims=('cohorts',)) sigma_k = pm.HalfNormal('sigma_k', sigma=0.5) pm.Normal('y_sd', mu=k[idx] * mu[idx],sigma=sigma_k, observed=df_agg.revenue_sd.values) sigma_obs = k[idx] * mu[idx] / pm.math.sqrt(df_agg.installs.values) pm.Normal('y', mu=mu[idx], sigma=sigma_obs, observed=df_agg.arpu.values) idata = pm.sample(500, target_accept=0.97)
Then there is a question why it's possible:
pm.Normal('y_sd', mu=k[idx] * mu[idx],sigma=sigma_k, observed=df_agg.revenue_sd.values)
k[idx]*mu[idx] is a standard deviation of user revenue distribution and some math to explain why it's the case:
# sd / sqrt(n) sigma_obs = k[idx] * mu[idx] / pm.math.sqrt(df_agg.installs.values)