MCMC Diagnostics

If you have spent your career reading Stata output — coefficient tables, standard errors, t-statistics, p-values, and the occasional Durbin-Watson statistic — then your first encounter with MCMC output will feel disorienting. There are no p-values. There is no single “estimate.” Instead, there are thousands of draws from something called a posterior distribution, accompanied by diagnostics you have never seen: R-hat, ESS, divergences, trace plots. This document maps every one of these concepts back to something you already understand, so you can read Bayesian output with the same confidence you bring to a regression table.

1. What the Sampler Actually Does

In classical econometrics, estimation is an optimisation problem. You write down a likelihood function and find the parameter values that maximise it (MLE) or minimise a loss function (OLS, GMM). The result is a single point estimate for each parameter, and the standard errors come from the curvature of the likelihood at that point (the inverse of the information matrix).

In Bayesian estimation, we do not optimise. We integrate. The goal is to characterise the entire posterior distribution — the full landscape of parameter values that are consistent with both the data and the prior. For most models of practical interest, this integral has no closed-form solution. We cannot write down a formula for the posterior the way you can write down the OLS estimator $\hat{\beta} = (X'X)^{-1}X'y$.

MCMC (Markov Chain Monte Carlo) solves this problem by constructing a random walk through the parameter space. At each step, the sampler proposes a new set of parameter values, evaluates how well they fit the data (the likelihood) and the prior, and decides whether to accept or reject the proposal. After enough steps, the collection of accepted values — the “chain” — converges to a representative sample from the posterior distribution.

The specific algorithm used in Abacus and PyMC is called NUTS (the No-U-Turn Sampler), a variant of Hamiltonian Monte Carlo (HMC). Think of it as a physics simulation: the sampler treats the negative log-posterior as a potential energy surface and launches a particle across it. The particle rolls downhill into regions of high posterior density and rolls uphill out of regions of low density. NUTS automatically tunes the trajectory length so the particle explores efficiently without doubling back on itself.

The critical point for an econometrician: the output of this process is not a single number. It is a collection of, say, 4,000 parameter vectors (2 chains × 2,000 draws each). Every summary statistic you will ever compute — the mean, the median, credible intervals, the probability that a coefficient exceeds zero — derives from this collection of draws.

2. Trace Plots: The First Thing to Check

A trace plot displays the sampled values of a single parameter across the iterations of the chain. The horizontal axis represents the iteration number. The vertical axis represents the parameter value. If everything has gone well, the trace plot looks like a “fuzzy caterpillar” — a dense, stationary band of values oscillating around a stable mean with no visible trends, steps, or sticky regions.

If you are an econometrician, think of the trace plot as the time-series plot of an MCMC residual. You want it to look like white noise. Specifically, you want three properties.

The first is stationarity. The chain should not drift upward or downward over time. If you see a clear trend, the chain has not converged: the sampler is still searching for the high-density region of the posterior, and the draws from the early part of the chain are not representative. This is analogous to estimating an AR(1) process that has not yet reached its stationary distribution.

The second is good mixing. The chain should move rapidly across the full support of the posterior. If you see long stretches where the chain gets “stuck” at a particular value before jumping to another region, the sampler is struggling to explore the parameter space. Poor mixing inflates your effective standard errors, just as strong autocorrelation in a time series reduces the effective information content of the data.

The third is agreement across chains. If you run multiple independent chains (and you always should — Abacus defaults to at least two), they should all settle into the same band. If one chain is exploring a different region of the parameter space from the others, the model has not converged, and you cannot trust any summary statistics.

3. R-hat: The Convergence Diagnostic

R-hat ($\hat{R}$) is the single most important diagnostic number in Bayesian computation. It measures whether multiple independent chains have converged to the same distribution.

The intuition is straightforward. R-hat compares the variance of a parameter within each chain to the variance of the same parameter across chains. If all chains are sampling from the same distribution, these two variances should be roughly equal, and R-hat should be close to 1.0. If the chains disagree — one chain has settled around 0.5 while another has settled around 2.3 — the between-chain variance will be large relative to the within-chain variance, and R-hat will be substantially greater than 1.0.

For an econometrician, think of R-hat as a convergence test analogous to the Gelman-Rubin statistic (because that is exactly what it is, in its modern split-chain formulation). The threshold is conventional: R-hat below 1.01 is considered safe. Values between 1.01 and 1.05 warrant caution. Values above 1.1 indicate that the chains have not converged, and you should not interpret the results.

When R-hat is too high, the remedy is usually to run the sampler for more iterations (increase tune and draws), reparameterise the model (e.g., use non-centered parameterisations for hierarchical models), or simplify the model.

4. Effective Sample Size (ESS): Your True Degrees of Freedom

The sampler produces, say, 4,000 draws. But consecutive draws are autocorrelated — each draw is a small perturbation of the previous one. The effective sample size (ESS) measures how many independent draws your 4,000 autocorrelated draws are actually worth.

If you are an econometrician, you already understand this concept perfectly. It is identical to the Newey-West correction for autocorrelated errors in time-series regression. When your regression residuals are positively autocorrelated, the “effective” number of independent observations is smaller than the nominal sample size $N$, and your standard errors are too small if you ignore the autocorrelation. ESS performs exactly the same adjustment for MCMC draws.

There are two flavours of ESS reported in PyMC and ArviZ output. ESS-bulk measures the effective sample size in the centre of the posterior distribution (around the mean and median). ESS-tail measures the effective sample size in the tails (relevant for credible interval estimation). Both matter.

The practical threshold is simple: you want ESS-bulk and ESS-tail both above 400 for reliable inference. Below 400, your posterior summaries are noisy — the mean might be reasonable, but the 95% credible interval endpoints could shift substantially if you re-ran the sampler. Below 100, the results are unreliable and should not be reported.

When ESS is too low, the remedies are to increase the number of draws, improve the model parameterisation, or thin the chains (though thinning is rarely the best option — more draws is almost always preferable).

5. Divergences: The Red Flag You Must Not Ignore

A divergence is an event during sampling where the NUTS trajectory encounters a region of the posterior that changes so sharply that the numerical integration breaks down. The sampler detects that its simulated particle has deviated from the true Hamiltonian trajectory and flags the draw.

For an econometrician, think of a divergence as the Bayesian equivalent of a near-singular Hessian in MLE optimisation. When the likelihood surface has extremely steep ridges or sharp funnels, the MLE optimiser either fails to converge or converges to a local maximum. In MCMC, the analogous pathology manifests as divergences.

Divergences are not merely a computational nuisance. They indicate that the sampler has failed to explore some region of the posterior, which means the resulting draws are a biased sample from the true posterior. Even a handful of divergences can systematically exclude an important region of the parameter space, leading to overconfident and potentially wrong inference.

The practical rule is unforgiving: zero divergences is the target. A small number (fewer than 10 out of 4,000 draws) may be tolerable if they occur during the early warmup phase and do not cluster in a particular region. But if you see hundreds of divergences, the model is misspecified or poorly parameterised, and no amount of additional sampling will fix the problem.

The most common remedies are increasing target_accept (the target acceptance probability for NUTS, analogous to tightening the step size), reparameterising the model (switching from a centred to a non-centred parameterisation for hierarchical priors), or simplifying the model to remove the pathological geometry.

6. Posterior Credible Intervals Replace Confidence Intervals

In classical econometrics, a 95% confidence interval means: “If we repeated this experiment infinitely many times and constructed an interval each time, 95% of those intervals would contain the true parameter.” Crucially, it does not mean that there is a 95% probability that the true parameter lies in this particular interval. The true parameter is fixed. The interval is random.

A 95% Bayesian credible interval means exactly what you always wished the confidence interval meant: “Given the data and the model, there is a 95% probability that the parameter lies in this interval.” The parameter is treated as a random variable (with a posterior distribution), and the interval directly quantifies our uncertainty about its value.

The Highest Density Interval (HDI), which Abacus and ArviZ report by default, is a specific type of credible interval: the narrowest interval that contains 95% (or 94%, the ArviZ default) of the posterior mass. For symmetric posteriors, the HDI coincides with the equal-tailed credible interval. For skewed posteriors (common for variance parameters or media effects bounded at zero), the HDI is narrower and more informative.

7. Mapping Bayesian Output to Classical Hypothesis Testing

econometricians are trained to ask: “Is this coefficient statistically significant?” In Bayesian inference, the question is reframed as: “What is the probability that this coefficient exceeds (or falls below) a particular threshold?”

The mapping is direct. When a 94% HDI for a media coefficient excludes zero — meaning the entire interval lies above zero — this is the Bayesian analogue of rejecting the null hypothesis at roughly the 6% significance level. When a 90% HDI excludes zero, the analogy is rejection at the 10% level.

But Bayesian inference offers richer answers than a binary significant/not-significant verdict. You can compute the exact posterior probability that the coefficient exceeds zero: $P(\beta > 0 \mid \text{data})$. If this probability is 0.98, you have strong evidence that the media channel has a positive effect. If it is 0.62, you have weak and inconclusive evidence. The posterior probability gives you a continuous measure of evidential strength, not a binary decision forced by an arbitrary 5% threshold.

You can also compute the posterior probability that the coefficient exceeds a practically meaningful threshold. “Is there at least a 90% probability that the ROI for TV exceeds 1.0?” is a more useful question for a media planner than “Is the TV coefficient significantly different from zero?” Bayesian inference answers the first question naturally.

8. A Diagnostic Checklist

When you receive MCMC output from an Abacus model run, work through the following checks in order.

Start with R-hat. Examine R-hat for every parameter. If any R-hat exceeds 1.01, stop. The chains have not converged, and every downstream summary is unreliable. Increase tune and draws, or investigate the model parameterisation.

Next, check for divergences. If the sampler reports more than a handful of divergences, the posterior geometry is pathological. Increase target_accept to 0.95 or 0.99. If divergences persist, the model likely needs reparameterisation or simplification.

Then examine ESS. Verify that ESS-bulk and ESS-tail exceed 400 for every parameter of interest. If ESS is low despite good R-hat, the chains are highly autocorrelated. Increase the number of draws.

Now inspect trace plots. Visually confirm that each chain looks like stationary white noise and that multiple chains overlap. Look for any sticky regions, trends, or bimodality.

Finally, interpret the posteriors. Report the posterior mean or median as your point estimate, the HDI as your interval estimate, and the posterior probability of exceeding zero (or any substantive threshold) as your measure of evidential strength.

Only after all four computational diagnostics pass — R-hat, divergences, ESS, and trace plots — should you proceed to interpret the substantive results. A Bayesian model with poor diagnostics is no more trustworthy than an OLS regression with autocorrelated residuals and a Durbin-Watson statistic of 0.4. The numbers may look plausible, but they are not reliable.