Model Comparison

You have spent your career comparing models with AIC, BIC, adjusted $R^2$, and the occasional likelihood ratio test. These tools are elegant, fast, and deeply embedded in econometric practice. They are also, in the Bayesian setting, either inapplicable or subtly misleading. This document explains the Bayesian model comparison toolkit — LOO-CV, ELPD, posterior predictive checks — by mapping each concept back to something you already understand. We also discuss the pitfalls that arise when comparing ELPD across models, because this is where even experienced practitioners make mistakes.

1. Why AIC and BIC Do Not Transfer Cleanly

AIC and BIC are derived from the maximised log-likelihood and a penalty term that counts the number of free parameters. The logic is intuitive: a model that fits the data well (high log-likelihood) but uses many parameters (high complexity) is penalised, preventing overfitting.

In a Bayesian model, the concept of “number of free parameters” becomes ambiguous. Consider a hierarchical prior on media coefficients: eight channel-level coefficients are partially pooled toward a shared group mean. Are there eight free parameters, or one? The answer depends on how much pooling the data induces. If the group mean dominates, the effective number of parameters is closer to one. If each channel estimate ignores the group mean, the effective number is closer to eight. The truth lies somewhere in between, and it changes depending on the data.

BIC fares no better. Its derivation assumes that the posterior concentrates on a single point (the MLE) as the sample size grows. In a fully Bayesian model with informative priors and moderate sample sizes — precisely the setting of most MMMs — this assumption fails. The posterior is a genuine distribution, not a spike, and BIC’s penalty term does not account for the regularisation imposed by the prior.

You can still compute AIC and BIC from a Bayesian model by plugging in the posterior mean and the nominal parameter count, and some software will do this for you. But the resulting numbers do not have their usual theoretical justification, and they can mislead you into selecting the wrong model.

2. LOO-CV: The Gold Standard for Predictive Model Comparison

The Bayesian replacement for information criteria is Leave-One-Out Cross-Validation (LOO-CV), computed via an efficient approximation called Pareto-Smoothed Importance Sampling (PSIS-LOO). The implementation in ArviZ (which Abacus uses) makes this computation fast enough to run routinely.

The intuition maps directly to something every econometrician understands: out-of-sample prediction. Imagine you have $N$ observations. For each observation $i$, you refit the model on the remaining $N - 1$ observations and compute the predictive density for the held-out observation $i$. The average of these $N$ predictive densities, on the log scale, gives you the Expected Log Pointwise Predictive Density (ELPD).

In practice, you do not actually refit the model $N$ times. PSIS-LOO uses importance sampling to approximate each leave-one-out posterior from the full-data posterior, making the computation nearly free once the model has been fitted. The Pareto-smoothing step stabilises the importance weights, and the shape parameter of the fitted Pareto distribution (the Pareto-$k$ diagnostic) tells you how reliable each approximation is.

For an econometrician, ELPD is the Bayesian analogue of the out-of-sample log-likelihood that motivates AIC. In fact, AIC can be interpreted as an asymptotic approximation to LOO-CV. The difference is that LOO-CV makes no asymptotic assumptions, fully accounts for the prior, and works correctly even when the effective number of parameters is ambiguous.

3. Reading the ELPD Output

When you run az.loo() in ArviZ (or access LOO diagnostics through an Abacus model), the output reports several quantities that deserve careful interpretation.

The first is elpd_loo, the estimated expected log pointwise predictive density. This is a single number that summarises the model’s out-of-sample predictive performance. Higher (less negative) values indicate better predictive accuracy. On its own, the absolute value of ELPD is not very informative — it depends on the scale of the data and the number of observations. ELPD becomes useful only when you compare it across models fitted to the same data.

The second is p_loo, the effective number of parameters. This quantity captures the complexity of the model as measured by how much each observation influences its own prediction. A model with strong regularisation (tight priors, heavy pooling) will have a small $p_\text{loo}$ relative to its nominal parameter count, because the priors constrain the flexibility. A model with weak regularisation will have $p_\text{loo}$ closer to the nominal count. If $p_\text{loo}$ exceeds the nominal number of parameters, the model is misspecified or the PSIS approximation has broken down.

The third is se_elpd_loo, the standard error of the ELPD estimate. This is crucial for model comparison and is where many practitioners make errors. We address this in detail below.

4. Comparing Models: The ELPD Difference and Its Standard Error

Suppose you have fitted two models to the same dataset and computed ELPD for each. Model A has $\text{ELPD}_A = -320$ and Model B has $\text{ELPD}_B = -315$. Model B appears to predict better. But is the difference meaningful, or is it within noise?

The function az.compare() in ArviZ computes the difference $\Delta\text{ELPD} = \text{ELPD}_B - \text{ELPD}_A$ and its standard error. The standard error of the difference is computed from the pointwise ELPD values (one per observation), accounting for the correlation between the two models’ predictions.

The interpretation is analogous to a classical hypothesis test. If $|\Delta\text{ELPD}|$ is large relative to its standard error (say, greater than 2 SE), you have reasonable evidence that one model predicts better than the other. If the difference is smaller than 2 SE, the models are indistinguishable in predictive performance, and you should prefer the simpler or more interpretable model on non-statistical grounds.

However — and this is the critical caveat — the standard error of $\Delta\text{ELPD}$ is itself an estimate, and it can be unreliable when the pointwise ELPD differences are heavy-tailed. A handful of influential observations (outliers that one model handles much better than the other) can inflate the standard error dramatically, making a genuine difference look insignificant. Conversely, if both models fail on the same outliers in the same way, the standard error can be artificially small, making a meaningless difference look significant.

The practical recommendation is to always inspect the pointwise ELPD differences alongside the aggregate comparison. If a small number of observations drive most of the difference, investigate those observations individually before concluding that one model is superior.

5. Pareto-k Diagnostics: When to Trust the Approximation

The PSIS-LOO approximation relies on importance sampling, and importance sampling can fail when individual observations are highly influential — that is, when removing a single observation substantially changes the posterior. The Pareto-$k$ diagnostic measures this influence for each observation.

For an econometrician, Pareto-$k$ plays a role analogous to Cook’s distance or leverage in OLS diagnostics. A high-leverage observation in OLS disproportionately influences the coefficient estimates. A high Pareto-$k$ observation in LOO-CV disproportionately influences the ELPD estimate, and the importance sampling approximation for that observation may be unreliable.

The conventional thresholds are straightforward. Pareto-$k$ values below 0.7 indicate that the PSIS approximation is reliable for that observation. Values between 0.7 and 1.0 indicate marginal reliability — the estimate is usable but noisy. Values above 1.0 indicate that the importance sampling approximation has broken down for that observation, and the reported ELPD is not trustworthy.

When you encounter high Pareto-$k$ values, several remedies are available. The simplest is moment matching, an option in ArviZ that improves the approximation for problematic observations. If that fails, you can refit the model with the offending observations actually held out (exact LOO-CV for those points only). More fundamentally, high Pareto-$k$ values often signal that the model is misspecified for those observations — perhaps they are genuine outliers, or the model’s functional form fails in that region of the data. Investigating why specific observations are influential is often more valuable than fixing the diagnostic.

6. Posterior Predictive Checks: The Bayesian Goodness-of-Fit Test

ELPD and LOO-CV are relative metrics: they tell you which model predicts better, but they cannot tell you whether any of your models predict well in an absolute sense. For that, you need posterior predictive checks.

The idea is simple. Once you have fitted a model, you generate simulated datasets from the posterior predictive distribution — that is, you sample parameter values from the posterior and then simulate new data from the likelihood. You then compare the distribution of these simulated datasets to the observed data. If the simulations look like the real data, the model is capturing the key features of the data-generating process. If not, the model is missing something important.

For an econometrician, posterior predictive checks are the Bayesian analogue of residual diagnostics, but more powerful. Instead of checking whether residuals are normally distributed or homoscedastic, you can check any feature of the data. Does the model reproduce the seasonal pattern? Does it capture the right degree of week-to-week volatility? Does the distribution of simulated total annual sales match the observed total? Each of these questions becomes a visual or numerical comparison between the real data and the posterior predictive distribution.

The key advantage over classical residual analysis is that posterior predictive checks incorporate parameter uncertainty. Classical residuals are computed at the point estimate, which can mask model deficiencies when the standard errors are large. Posterior predictive simulations are drawn from the full posterior, so they honestly reflect how much the model’s predictions could vary even if the model is correctly specified.

In practice, we recommend running posterior predictive checks before computing ELPD or comparing models. If the posterior predictive distribution fails to reproduce basic features of the data (the mean, the variance, the seasonal pattern), the model is misspecified at a fundamental level, and comparing its ELPD to another model’s ELPD is an exercise in choosing the least bad option rather than selecting a good model.

7. When Model Comparison Is Meaningful and When It Is Not

Not all model comparisons are informative, and econometricians should exercise the same caution here that they would when comparing nested versus non-nested classical specifications.

ELPD comparisons are meaningful when the two models are fitted to exactly the same dataset, with exactly the same observations and the same target variable. If one model drops missing values differently, or transforms the target variable (e.g., one model predicts $y$ and the other predicts $\log(y)$), the ELPD values are on different scales and cannot be compared directly. This is analogous to the well-known prohibition against comparing $R^2$ across models with different dependent variables in classical econometrics.

ELPD comparisons are also meaningful only when the Pareto-$k$ diagnostics are acceptable for both models. If one model has many observations with Pareto-$k$ above 1.0, its ELPD estimate is unreliable, and the comparison is confounded by approximation error rather than genuine predictive differences.

ELPD comparisons are less informative when the models differ in ways that do not affect prediction but do affect causal interpretation. Two models might produce nearly identical ELPD values — predicting sales equally well out of sample — while attributing completely different proportions of sales to TV versus search. This is the identification problem discussed in the causal identification FAQ: predictive equivalence does not imply causal equivalence. A model that attributes 30% of sales to TV and 10% to search might predict just as well as a model that attributes 20% to each, because the total media contribution is the same. ELPD cannot distinguish between these models, because it evaluates prediction, not attribution.

For this reason, we recommend treating ELPD as a necessary but not sufficient criterion for model selection. Use it to eliminate models that predict poorly. Use posterior predictive checks to verify that the surviving models capture the essential features of the data. Then use substantive economic reasoning, lift test calibration, and domain expertise to choose among predictively equivalent models based on the plausibility of their causal attributions.

8. A Practical Mapping from Classical to Bayesian Model Selection

To consolidate the discussion, here is how each classical tool maps to its Bayesian replacement.

Adjusted $R^2$ measures in-sample fit penalised by the number of parameters. The Bayesian analogue is the posterior predictive $R^2$ proposed by Gelman, Goodrich, Gabry, and Vehtari (2019), which computes $R^2$ from the posterior predictive distribution rather than a point estimate. Unlike classical adjusted $R^2$, posterior predictive $R^2$ comes with a full distribution (one value per posterior draw), so you can report its uncertainty.

AIC measures asymptotic out-of-sample predictive performance. The Bayesian analogue is ELPD estimated via PSIS-LOO. ELPD is more general (no asymptotic assumptions), fully accounts for the prior, and handles hierarchical models correctly.

BIC targets model identification rather than prediction (it is consistent for the true model as $N \to \infty$). There is no direct Bayesian analogue that serves the same purpose, because Bayesian model comparison via ELPD is inherently predictive. If you want to identify the “true” model in a Bayesian framework, you would use Bayes factors, but Bayes factors are sensitive to the prior specification in ways that ELPD is not, and we do not generally recommend them for MMM applications.

The likelihood ratio test compares nested models by examining whether the additional parameters significantly improve the likelihood. The Bayesian replacement is the ELPD difference with its standard error. If the ELPD difference exceeds roughly 2 standard errors, the more complex model predicts meaningfully better. If not, prefer the simpler model.

Classical residual diagnostics (Durbin-Watson, Breusch-Pagan, Q-Q plots) check model assumptions after fitting. The Bayesian replacement is posterior predictive checking, which is more flexible (you can check any data feature, not just residual properties) and more honest (it incorporates parameter uncertainty).

In every case, the Bayesian tool is at least as informative as its classical counterpart and often more so. The cost is unfamiliarity. We hope this document has reduced that cost.