This section collects longer-form answers to recurring MMM, Bayesian, and
panel-econometrics questions that come up when practitioners move from
classical econometrics to PanelMMM.
The pages are written for technical readers who already understand regression,
panel data, and causal inference, but want the Abacus framing.
This document addresses common concerns that econometricians have about Bayesian priors, reframes them using familiar econometric concepts, and discusses the practical trade-offs between “tight” and “loose” prior specifications in the context of Marketing Mix Modeling.
1. Are priors subjective? Don’t they bias the results?
This is the most common objection from econometricians. The short answer is: you are already using priors, you just call them something else.
Priors You Already Use in Classical Econometrics
Every constraint or modelling decision an econometrician makes is, mathematically, a prior belief imposed on the parameter space:
Classical Econometric Practice
Bayesian Equivalent
“Media coefficients must be non-negative” (sign restriction)
A HalfNormal or truncated prior that places zero probability on negative values
“The intercept should be positive because sales can’t be negative”
A LogNormal prior on the intercept
Ridge regression (L2 penalty)
A Normal(0, sigma) prior on all coefficients, where sigma controls the penalty strength
LASSO regression (L1 penalty)
A Laplace(0, b) prior on all coefficients
Excluding a variable from the model entirely
An infinitely tight prior at exactly zero (a point mass)
Including a variable with no constraints
A uniform prior over $(-\infty, +\infty)$ — the so-called “non-informative” prior
The difference is not whether you impose assumptions, but whether you are explicit about them. In classical econometrics, these assumptions are hidden inside the model specification (variable selection, functional form, sign restrictions). In Bayesian modeling, they are declared openly as Prior objects, making them auditable, debatable, and reproducible.
Why “Letting the Data Speak” Is Itself a Prior
When a classical econometrician says “I let the data speak,” they are implicitly choosing a uniform (flat) prior: every parameter value from $-\infty$ to $+\infty$ is equally plausible before seeing the data. This sounds objective, but it has real consequences:
It assigns equal prior probability to a media ROI of 0.01 and a media ROI of 10,000,000.
In small samples (typical in marketing data: 100–200 weekly observations), this flat prior provides no regularization, leading to extreme, unstable coefficient estimates.
It is equivalent to running OLS with no penalty — which econometricians already know is fragile when $p$ is large relative to $N$.
A well-chosen weakly informative prior (e.g., HalfNormal(sigma=2) for media coefficients) does not “bias” the model. It says: “We believe media effects are positive and probably modest, but we are open to being surprised.” If the data strongly disagrees, the posterior will override the prior. If the data is ambiguous (as it often is with 150 weekly observations and 7 correlated media channels), the prior prevents the model from hallucinating absurd coefficient values.
2. How does Abacus specify priors?
In Abacus, priors are declared using Prior objects from the pymc_extras library. These are composable, hierarchical, and fully serializable. Here is a simple example:
frompymc_extras.priorimportPrior# A weakly informative prior for media channel betas:# "Media effects are positive, probably modest, but could be larger"beta_channel=Prior("HalfNormal",sigma=2)# A prior for the intercept:# "Baseline sales are positive and log-normally distributed"intercept=Prior("LogNormal",mu=0,sigma=5)# A hierarchical prior for adstock decay:# "Carryover is moderate, skewed toward shorter decay"alpha=Prior("Beta",alpha=1,beta=3)
Each Prior object is a first-class citizen in the model configuration. It can be inspected, overridden, serialized to YAML, and version-controlled — unlike classical econometric constraints, which are typically buried in code or verbal documentation.
3. What is the difference between “tight” and “loose” priors?
This is one of the most consequential modelling decisions in Bayesian MMM. Two real-world configurations from our repositories illustrate the spectrum.
Tight Priors: The DSAMbayes Approach
In the DSAMbayes R/Stan library, tight priors are implemented via explicit boundary constraints on media coefficients:
What this does: Every media coefficient is hard-bounded to be non-negative. Combined with the package’s default priors (which are relatively concentrated), this creates a model that is strongly constrained. The data can move the coefficients within the allowed region, but the model will never produce a negative media effect.
Pros of tight priors:
Stability: Results are robust even with very small sample sizes (e.g., 52 weeks). The model cannot produce economically nonsensical results like “TV advertising reduces sales.”
Interpretability: Stakeholders can trust the sign and rough magnitude of every coefficient.
Convergence: The MCMC sampler explores a smaller parameter space, converging faster and with fewer divergences.
Reproducibility: Different analysts fitting the same data will obtain very similar results because the prior dominates the likelihood in ambiguous regions.
Cons of tight priors:
Risk of masking genuine effects: If a media channel truly has zero or negligible effect, a tight positive prior will force the model to assign it some positive contribution, creating a false positive. The model cannot “discover” that a channel is worthless.
Prior-data conflict: If the data strongly suggests a negative relationship (e.g., due to confounding — heavy TV spend coincides with a recession), the tight prior will suppress this signal. The analyst will not see the conflict unless they explicitly check for it.
Overconfidence: The posterior credible intervals will be artificially narrow, because the prior has eliminated large regions of the parameter space. This can make the model appear more certain than it actually is.
Loose Priors: The AMMM Approach
In the AMMM Python library, priors are specified with wider distributions and fewer hard constraints:
# From: AMMM data-config/demo_config.ymlcustom_priors:intercept:dist:LogNormalkwargs:mu:0sigma:5# Very wide — allows intercept to range enormouslybeta_channel:dist:HalfNormalkwargs:sigma:1# Moderately wide positive prioralpha:# Adstock decaydist:Betakwargs:alpha:1beta:3# Weakly informative, skewed toward short decaylam:# Saturation ratedist:Gammakwargs:alpha:3beta:1# Moderately informative
What this does: The priors are “weakly informative” — they encode soft directional beliefs (media effects are positive via HalfNormal, intercept is positive via LogNormal) but with wide spreads that allow the data substantial room to determine the final estimates.
Pros of loose priors:
Data-driven: The posterior is dominated by the likelihood, not the prior. Results are closer to what an unconstrained MLE would produce, which may feel more “honest” to econometricians.
Discovery: The model can reveal surprising patterns (e.g., a channel with near-zero effect will have a posterior concentrated near zero, rather than being artificially inflated).
Honest uncertainty: Posterior credible intervals reflect genuine estimation uncertainty, including uncertainty about effect direction.
Cons of loose priors:
Instability in small samples: With only 100–200 weekly observations and 7+ correlated media channels, a loose prior provides insufficient regularization. Coefficients can be wildly unstable across different random seeds or slight data perturbations.
Economically nonsensical results: Without strong regularization, the model may produce results that are statistically plausible but economically absurd (e.g., display advertising having a larger effect than TV despite 10x less spend).
Harder convergence: The MCMC sampler must explore a vast parameter space, leading to longer runtimes, more divergences, and lower effective sample sizes.
4. Which should we use: tight or loose?
Neither extreme is correct in isolation. The right choice depends on your sample size, number of media channels, and tolerance for false positives vs. false negatives.
The Practical Recommendation
Scenario
Recommended Approach
Small sample ($N < 104$ weeks), many channels ($k > 5$)
Tight priors. The data simply cannot identify 5+ correlated media effects independently. Without strong regularization, the model is fundamentally underidentified.
Medium sample ($104 < N < 208$ weeks), moderate channels
Weakly informative priors (the Abacus default). Encode directional beliefs (positive media effects) but allow the data to determine magnitude.
Large sample ($N > 208$ weeks), few channels ($k \leq 3$)
Loose priors are defensible. The data volume is sufficient to overwhelm even a weak prior, so the choice matters less.
Any sample size, with lift test calibration
Loose priors become safer, because the lift test data injects external causal evidence that compensates for the weak regularization of the prior.
The Key Insight for Econometricians
In classical econometrics, you are trained to believe that constraints reduce efficiency (you “lose information” by restricting the parameter space). In Bayesian statistics, the opposite is often true for small samples: a well-chosen prior increases efficiency by concentrating the sampler on the economically plausible region of the parameter space. It is the Bayesian equivalent of using economic theory to improve your estimator, which is exactly what structural econometricians (e.g., in IO or macro) have always done.
The prior is not a bias. It is a statement of economic theory. If you believe advertising cannot reduce sales, encoding that belief is not “cheating” — it is incorporating domain knowledge, just as a structural econometrician incorporates equilibrium conditions or rational expectations into their likelihood.
5. Can I check whether the prior is dominating the posterior?
Yes. This is a critical diagnostic step. In Abacus (and any PyMC-based workflow), you should always compare the prior predictive distribution to the posterior distribution for each parameter.
If the posterior looks very similar to the prior, the data has not updated your beliefs. This means either: (a) the prior is too tight and is suppressing the data, or (b) the data genuinely contains no information about that parameter.
If the posterior is substantially narrower or shifted relative to the prior, the data has successfully updated your beliefs, and the prior served only as a sensible starting point.
This comparison is the Bayesian analogue of checking whether your classical constraints are binding. If they are always binding, you should question whether the constraints are appropriate.
Adstock and Saturation
In classical econometrics, you model diminishing returns by taking the logarithm of spend: $\log(\text{spend})$ enters the regression, and the coefficient captures an elasticity. Carryover effects, if considered at all, are handled with lagged dependent variables or Koyck distributed lags. These approaches are simple and familiar. They are also, for media measurement, inadequate.
This document explains the two non-linear transformations at the heart of every modern Marketing Mix Model — adstock (carryover) and saturation (diminishing returns) — and shows why they are more flexible, more interpretable, and more economically grounded than the classical alternatives. We also address a subtle but important modelling decision: whether to apply adstock before saturation, or saturation before adstock.
1. The Problem with Log-Linear Specifications
The classical $\log(\text{spend})$ specification makes a single, rigid assumption: the marginal return to an additional pound of media spend decreases at a rate governed by the reciprocal of current spend. Doubling spend from £100 to £200 produces the same incremental effect as doubling from £1,000 to £2,000. The curvature is fixed by the functional form. You cannot learn it from the data.
This creates two problems in practice.
The first is that the log transform cannot capture saturation at high spend levels. If a channel is already saturated — say, you have bought every available TV slot in the UK — the log transform will still predict positive incremental returns for every additional pound. The curve never flattens. In reality, the marginal return from saturated media is effectively zero, and you need a function that can reach a ceiling.
The second is that the log transform says nothing about carryover. A TV advertisement aired in week 10 does not affect sales only in week 10. Viewers remember the ad. Brand salience persists. The effect decays over subsequent weeks. A pure $\log(\text{spend}_t)$ specification attributes the entire effect to the week the money was spent, ignoring the temporal diffusion of advertising impact. You can add lagged terms manually ($\log(\text{spend}_{t-1})$, $\log(\text{spend}_{t-2})$, and so on), but each lag consumes a degree of freedom, and you must choose the lag length arbitrarily.
Abacus replaces both of these ad hoc treatments with two purpose-built, parameterised transformations whose shapes are learned jointly from the data inside the Bayesian graph.
2. Adstock: Modelling Carryover
Adstock captures a simple economic intuition: advertising has a lingering effect. A pound spent on TV in week 10 generates some response in week 10, a smaller response in week 11, a still smaller response in week 12, and so on until the effect has fully decayed.
The default implementation in Abacus is geometric adstock. The transformation takes the raw weekly spend series and replaces each observation with a weighted sum of current and past spend, where the weights decay geometrically:
$$x^*_t = x_t + \alpha \cdot x^*_{t-1}$$
The parameter $\alpha$ (between 0 and 1) controls the rate of decay. When $\alpha$ is close to zero, the effect is concentrated in the week of exposure — the ad is forgotten almost immediately. When $\alpha$ is close to one, the effect persists for many weeks — the brand impression lingers. The maximum lag length l_max truncates the convolution at a finite horizon for computational efficiency.
For an econometrician, recognise that this is precisely a Koyck distributed lag model, but with two critical differences. First, the decay parameter $\alpha$ is not estimated from lagged dependent variables (which introduces Nickell bias in short panels). It is estimated directly as a parameter of the transformation, with its own Bayesian prior — by default a Beta(1, 3) distribution that gently favours shorter decay while allowing the data to push toward longer persistence if warranted. Second, you do not need to choose the lag length by hand. You set l_max as a generous upper bound (say, 8 or 12 weeks), and the geometric decay structure ensures that distant lags receive negligible weight automatically.
Abacus also provides alternative adstock functions, including Weibull PDF and Weibull CDF adstock, which allow for non-monotonic decay patterns (an effect that peaks one or two weeks after exposure rather than immediately). These capture the empirical reality that some channels — particularly upper-funnel brand advertising — take time to build mental availability before generating measurable sales response.
3. Saturation: Modelling Diminishing Returns
Saturation captures the second economic intuition: each additional pound of spend on a channel is worth less than the last. The first £10,000 of TV spend reaches new audiences and generates substantial incremental sales. The next £10,000 reaches many of the same people again and generates less. Eventually, you have saturated the available audience, and further spend generates almost nothing.
The default implementation in Abacus is logistic saturation:
Two parameters govern the shape. The parameter $\lambda$ controls the steepness of the curve — how quickly diminishing returns set in. A large $\lambda$ means the channel saturates rapidly (steep initial response, early flattening). A small $\lambda$ means the channel has a long runway before saturation (gradual response, late flattening). The parameter $\beta$ controls the asymptotic maximum — the ceiling of the response, representing the maximum possible contribution from this channel regardless of spend.
Compare this to the classical $\log(\text{spend})$ specification. The logistic saturation curve has a genuine asymptote: beyond a certain spend level, the curve is effectively flat. The log specification has no such ceiling. The logistic curve also has a tunable inflection point (governed by $\lambda$), allowing the data to determine where diminishing returns begin. The log curve always bends at the same relative rate.
The default priors in Abacus encode mild economic beliefs. The prior on $\lambda$ is Gamma(3, 1), which centres mass on moderate saturation rates while allowing the data to push toward very steep or very gradual curves. The prior on $\beta$ is HalfNormal(sigma=2), which keeps the channel contribution positive and moderately scaled.
4. Joint Estimation Inside the Bayesian Graph
Here is the critical difference between the Abacus approach and classical pre-processing. In many legacy MMM implementations (and in some textbook treatments), the adstock and saturation transformations are applied as a pre-processing step: the analyst picks fixed values for $\alpha$ and $\lambda$ (perhaps through grid search or “expert judgement”), transforms the raw spend data, and then runs a linear regression on the transformed data.
This approach severs the chain of uncertainty. The regression treats the transformed spend as a known quantity, ignoring the fact that $\alpha$ and $\lambda$ were estimated (or guessed). The standard errors on the media coefficients are conditional on the pre-selected transformation parameters being exactly correct. They are too narrow.
In Abacus, the adstock parameter $\alpha$, the saturation parameters $\lambda$ and $\beta$, and the media coefficient are all estimated simultaneously inside a single PyMC model. The MCMC sampler explores the joint posterior over all parameters at once. When the sampler draws a high value of $\alpha$ (long carryover), it simultaneously adjusts $\lambda$ and the media coefficient to maintain consistency with the observed data. The resulting posterior credible intervals for media contribution honestly reflect uncertainty about the transformation shape, the coefficient magnitude, and their interactions.
This is analogous to the distinction between two-stage least squares (where the first-stage residuals inject estimation error into the second stage, requiring corrected standard errors) and full-information maximum likelihood (where all parameters are estimated jointly). The Bayesian joint estimation in Abacus is closer in spirit to FIML, but with the added benefit of prior regularisation.
5. The Ordering Decision: Adstock First or Saturation First
When you initialise a PanelMMM in Abacus, you choose adstock_first=True (the default) or adstock_first=False. This decision controls the order in which the two transformations are composed, and it encodes a substantive economic assumption about how the media channel operates.
When adstock_first=True, the pipeline is: raw spend → adstock → saturation. The economic interpretation is that carryover accumulates first in the consumer’s memory (brand salience builds up over multiple weeks of exposure), and only then does the accumulated stock of impressions hit diminishing returns. This makes sense for brand-building channels like TV, outdoor, and sponsorship, where the advertising effect is cumulative and the saturation constraint applies to the total accumulated exposure rather than to a single week’s spend.
When adstock_first=False, the pipeline is: raw spend → saturation → adstock. The economic interpretation is that diminishing returns apply immediately to each week’s spend (this week’s audience is saturated by this week’s spend alone), and only then does the already-saturated response carry over into future weeks. This makes sense for direct-response channels like paid search or performance display, where each week’s impressions hit a ceiling independently (you can only capture so many searches in a week), but the conversion effect persists.
The distinction matters quantitatively. Under adstock-first, the model allows a sequence of moderate spend weeks to accumulate into a heavily saturated state — even if no single week was high-spend on its own. Under saturation-first, each week’s spend is capped independently, so a steady moderate spend never reaches the saturation ceiling.
In practice, most MMM practitioners default to adstock-first for all channels, which is why Abacus sets adstock_first=True as the default. But if you have strong prior knowledge that a particular channel exhibits immediate per-week saturation (because the audience pool is fixed and refreshes weekly), switching the order is a principled modelling choice.
6. Why This Matters for Econometricians
The adstock-saturation framework replaces several ad hoc classical specifications with a coherent, jointly estimated non-linear model. To summarise the mapping:
The classical Koyck lag model is replaced by geometric adstock with a Bayesian prior on the decay rate. You no longer need to choose lag lengths manually or worry about Nickell bias from lagged dependent variables.
The classical $\log(\text{spend})$ specification is replaced by logistic saturation with learnable steepness and ceiling parameters. You gain a genuine asymptote (something $\log$ cannot provide) and data-driven curvature (something $\log$ fixes by assumption).
The classical two-stage approach (transform then regress) is replaced by joint Bayesian estimation. Your credible intervals honestly propagate uncertainty from the transformation parameters through to the media contribution estimates.
The result is a media response model that is more flexible than any classical specification, more honest about uncertainty, and grounded in the same economic intuitions — carryover and diminishing returns — that econometricians have always recognised. The difference is that Abacus lets the data determine the shape of these phenomena rather than imposing it through functional form.
HSGP
This document answers common questions econometricians may have when encountering HSGP (Hilbert Space Gaussian Process) approximations in the codebase, particularly regarding model flexibility and the number of basis functions.
1. Does a Hilbert Space Gaussian Process use up degrees of freedom when modelling?
Yes, but not in the strict $N - k$ counting sense used in classical OLS econometrics. Instead, Gaussian Processes (and their HSGP approximations) use “effective degrees of freedom” (EDF) due to Bayesian regularization.
Here is how to map HSGPs to classical econometrics concepts:
The Mechanical Setup (Looks like it uses $m$ degrees of freedom)
In classical econometrics, if you want to model a non-linear time trend, you might add polynomial terms or a Fourier series (sines and cosines). If you add $m$ sine/cosine terms to your OLS model, you lose exactly $m$ degrees of freedom.
An HSGP is mathematically very similar to a Fourier series. It approximates an infinite-dimensional Gaussian Process by using $m$ basis functions (the m parameter in the code, often set to 50–200).
If this were OLS, estimating those 200 basis function coefficients would cost 200 degrees of freedom, potentially breaking your model if $N < 200$.
The Bayesian Reality (Effective Degrees of Freedom)
In an HSGP, those $m$ coefficients are not freely estimated. They are bound together by a hierarchical prior structure governed by hyperparameters, specifically the lengthscale ($\ell$) and the amplitude/variance ($\eta$).
Because the coefficients share a prior that heavily shrinks most of them toward zero, we measure the flexibility using Effective Degrees of Freedom (EDF).
Like Ridge Regression: Think of HSGP as running a Ridge Regression (L2 regularization) on 200 Fourier terms. Even though there are 200 parameters, the L2 penalty restricts their variance. The “effective” degrees of freedom might only be 4 or 5.
Data-driven penalty: The amount of shrinkage is controlled by the lengthscale ($\ell$).
If the data shows a smooth, slowly moving trend, the model learns a large lengthscale. This imposes massive shrinkage on the high-frequency (wiggly) basis functions, meaning the HSGP uses very few effective degrees of freedom (acting almost like a simple linear trend).
If the data is highly volatile, the model learns a short lengthscale, relaxing the shrinkage, allowing the curve to wiggle, and consuming more effective degrees of freedom.
Summary: While you might instantiate an HSGP with 100 basis functions ($m=100$), it does not subtract 100 from your denominator. It dynamically consumes exactly as much “effective” flexibility as the data proves is necessary, heavily penalizing unnecessary complexity (wiggliness) via its priors. You are completely safe from the classical $N - k < 0$ matrix inversion failures.
2. Is it up to the analyst to decide how many basis functions to set? Will this result in specification hunting?
This is a very valid concern. In standard OLS, if Analyst A uses a 5th-order Fourier series and Analyst B uses a 20th-order Fourier series, they will get wildly different results, opening the door for specification hunting.
In the Abacus HSGP implementation, this risk is mitigated in two ways: Automated Heuristics (code design) and Approximation Limits (mathematical design).
1. Automated Selection (The Code Design)
The library is specifically designed so analysts do not have to guess or manually set the number of basis functions ($m$).
In the HSGP class, the factory method parameterize_from_data calculates $m$ automatically using an algorithm (approx_hsgp_hyperparams) based on published literature (Ruitort-Mayol et al., 2022).
It calculates $m$ deterministically based on two things:
The span of the time-series data (e.g., 3 years of weekly data).
The lower bound of the lengthscale prior (the shortest time-span over which we believe the effect could realistically change).
This guarantees that two analysts modeling the same dataset with the same assumptions will end up with the exact same $m$.
2. $m$ dictates “Resolution”, not “Complexity” (The Mathematical Design)
Even if an analyst decided to bypass the automation and manually force a massive number of basis functions, it would not result in overfitting or specification hunting.
In an HSGP, $m$ is just the resolution limit of the approximation to the true infinite-dimensional Gaussian Process.
If $m$ is too small: The model lacks the resolution to capture fast-moving trends (it will artificially smooth things out).
If $m$ is exactly right (e.g., $m=50$): The model perfectly approximates the true Gaussian Process.
If $m$ is absurdly large (e.g., $m=500$): The model will yield the exact same curve as $m=50$.
Why? Because the extra 450 basis functions represent very high-frequency, rapid wiggles. The Bayesian lengthscale prior mathematically forces the coefficients for those extra high-frequency basis functions exactly to zero.
The only penalty for setting $m$ too high is computation time. The MCMC sampler will run much slower because it has to drag around useless matrices, but the statistical fit will remain identical. Therefore, an analyst cannot “p-hack” or specification-hunt by artificially inflating $m$.
3. We often model trend/seasonality using explicit Fourier terms (e.g., sin52_1 + cos52_1 + ...). This uses up degrees of freedom and often causes severe multicollinearity (high VIF) with our media or control variables. Does HSGP solve this?
Yes. Explicitly adding Fourier terms to a linear formula creates structural problems that HSGP elegantly sidesteps.
1. The Degrees of Freedom Problem
As discussed in Section 1, explicitly adding 10 sine/cosine terms to a regression permanently burns 10 degrees of freedom. The model is forced to independently estimate an unpenalized coefficient for every single wave, regardless of whether that specific frequency is actually present in the data.
The HSGP Solution:
HSGP uses Effective Degrees of Freedom (EDF). It evaluates a large number of basis functions (which are essentially Fourier terms), but ties them all together under a single hierarchical Gaussian Process prior. If the data doesn’t exhibit a certain high-frequency wiggle, the GP lengthscale prior dynamically crushes the coefficients of those specific basis functions toward zero. You get the flexibility of 100 sine waves, but only “pay” for the effective degrees of freedom the data actually demands.
2. The Multicollinearity (High VIF) Problem
When you add explicit Fourier terms, they act as independent regressors. If one of your media channels (e.g., m_tv) happens to have a seasonal spending pattern that correlates strongly with sin52_1, the model suffers from classic multicollinearity. The VIF skyrockets, standard errors blow up, and the coefficient for m_tv becomes completely unstable (the “backdoor” bias).
The HSGP Solution:
HSGP mitigates this through structured regularization.
Orthogonal Basis: The basis functions generated internally by the HSGP are orthogonal to each other.
Shared Shrinkage: More importantly, the coefficients for the HSGP basis functions are not estimated independently. They are strictly regularized by the GP’s lengthscale ($\ell$) and variance ($\eta$) hyperparameters.
Because the GP is mathematically constrained to behave like a smooth, cohesive curve, it cannot arbitrarily spike a single basis function’s coefficient just to “steal” variance from a highly correlated m_tv variable. The GP prior strongly penalizes such isolated, un-smooth coefficient spikes. Consequently, the model focuses on capturing the true underlying baseline trend, leaving the media coefficients much more stable than they would be against unpenalized, explicit Fourier regressors.
4. Should we feed in holiday dummy variables instead?
No. You do not need to manually construct binary 1/0 dummy variables (e.g., is_black_friday) or step functions in your input data.
The recommendation is to pass the raw dates of the holidays directly into the model via a separate DataFrame. Abacus’s EventAdditiveEffect API will internally calculate the distance in days from your time series to the holiday, and wrap that in a continuous basis function (like a Gaussian curve). This provides a smoother, more realistic “build-up and cool-down” effect compared to the harsh structural breaks of traditional dummy variables.
Example: Ingesting a Holidays DataFrame into Abacus
If you have a CSV of holidays (like data-config/holidays.csv), you load it as a standard Pandas DataFrame and inject it into the model configuration before building.
importpandasaspdfrompymc_extras.priorimportPriorfromabacus.mmm.panelimportPanelMMMfromabacus.mmm.eventsimportEventEffect,GaussianBasis# 1. Load your raw holidays# The dataframe must contain exactly: "name", "start_date", "end_date"df_holidays=pd.DataFrame({"name":["Black Friday 2023","Black Friday 2024","Christmas 2023"],"start_date":["2023-11-24","2024-11-29","2023-12-25"],"end_date":["2023-11-25","2024-11-30","2023-12-26"]})# 2. Define the mathematical shape of the holiday effect# We use a GaussianBasis so the effect smoothly ramps up and downholiday_effect=EventEffect(basis=GaussianBasis(),effect_size=Prior("Normal",mu=0,sigma=1),dims="holiday")# 3. Initialize your MMMmmm=PanelMMM(date_column="date",target_column="sales",channel_columns=["tv","social"],dims=("country",))# 4. Inject the raw dataframe into the API# Abacus handles all the distance calculations and basis mappings internallymmm.add_events(df_events=df_holidays,prefix="holiday",effect=holiday_effect)# 5. Build and fit as normalmmm.build_model(X,y)mmm.fit()
5. If HSGP is statistically superior for seasonality, why does the fourier.py module still exist?
This is not a contradiction. Model building requires balancing statistical elegance with computational constraints and structural assumptions. There are four reasons explicit Fourier terms are retained alongside HSGP in the library:
1. Computation Speed
HSGPs are statistically efficient but computationally expensive. The PyMC engine must invert and multiply large matrices to solve the Gaussian Process approximation. Explicit Fourier terms, by contrast, are just static columns in the design matrix. Estimating a Bayesian regression with 4 sine/cosine columns takes seconds; fitting an HSGPPeriodic can be substantially slower. For analysts iterating rapidly on a prototype or running models on large datasets, explicit Fourier terms offer a fast, “good enough” approximation.
2. Static vs. Drifting Seasonality
HSGPPeriodic allows the seasonal shape to drift slowly over time (e.g., consumer behaviour shifting gradually across 5 years). This is more realistic but requires learning extra GP hyperparameters.
Explicit Fourier forces the seasonality to be completely static: the December peak in 2021 is mathematically identical to the December peak in 2024. If the econometrician has a strong prior belief that the seasonal structure is structurally invariant, explicit Fourier terms enforce that belief more rigidly than an HSGP can.
3. The “Trend = HSGP, Seasonality = Fourier” Hybrid Pattern
A very common and practically effective architecture in Bayesian MMMs is:
Standard HSGP for the baseline trend, because trend is unbound, unpredictable, and highly prone to overfitting.
A low-order YearlyFourier (e.g., n_order=2 or 3) for seasonality, because seasonality is bounded, predictable, and structurally repetitive.
By keeping the Fourier order very low, the degrees of freedom penalty is minimal (only 4–6 parameters), and the analyst avoids the computational overhead of running two separate HSGPs simultaneously. This hybrid is often the most practical choice for weekly marketing data.
4. Backwards Compatibility and Migration
Many teams migrate to Abacus from legacy OLS frameworks or tools like Prophet, which relies heavily on explicit Fourier terms. To build trust in the new Bayesian framework, econometricians often want to first build a “baseline” model that perfectly mirrors their old model’s architecture and verify they obtain comparable results. The fourier.py module enables this 1:1 apples-to-apples comparison before upgrading the architecture to use HSGPs.
MCMC Diagnostics
If you have spent your career reading Stata output — coefficient tables, standard errors, t-statistics, p-values, and the occasional Durbin-Watson statistic — then your first encounter with MCMC output will feel disorienting. There are no p-values. There is no single “estimate.” Instead, there are thousands of draws from something called a posterior distribution, accompanied by diagnostics you have never seen: R-hat, ESS, divergences, trace plots. This document maps every one of these concepts back to something you already understand, so you can read Bayesian output with the same confidence you bring to a regression table.
1. What the Sampler Actually Does
In classical econometrics, estimation is an optimisation problem. You write down a likelihood function and find the parameter values that maximise it (MLE) or minimise a loss function (OLS, GMM). The result is a single point estimate for each parameter, and the standard errors come from the curvature of the likelihood at that point (the inverse of the information matrix).
In Bayesian estimation, we do not optimise. We integrate. The goal is to characterise the entire posterior distribution — the full landscape of parameter values that are consistent with both the data and the prior. For most models of practical interest, this integral has no closed-form solution. We cannot write down a formula for the posterior the way you can write down the OLS estimator $\hat{\beta} = (X'X)^{-1}X'y$.
MCMC (Markov Chain Monte Carlo) solves this problem by constructing a random walk through the parameter space. At each step, the sampler proposes a new set of parameter values, evaluates how well they fit the data (the likelihood) and the prior, and decides whether to accept or reject the proposal. After enough steps, the collection of accepted values — the “chain” — converges to a representative sample from the posterior distribution.
The specific algorithm used in Abacus and PyMC is called NUTS (the No-U-Turn Sampler), a variant of Hamiltonian Monte Carlo (HMC). Think of it as a physics simulation: the sampler treats the negative log-posterior as a potential energy surface and launches a particle across it. The particle rolls downhill into regions of high posterior density and rolls uphill out of regions of low density. NUTS automatically tunes the trajectory length so the particle explores efficiently without doubling back on itself.
The critical point for an econometrician: the output of this process is not a single number. It is a collection of, say, 4,000 parameter vectors (2 chains × 2,000 draws each). Every summary statistic you will ever compute — the mean, the median, credible intervals, the probability that a coefficient exceeds zero — derives from this collection of draws.
2. Trace Plots: The First Thing to Check
A trace plot displays the sampled values of a single parameter across the iterations of the chain. The horizontal axis represents the iteration number. The vertical axis represents the parameter value. If everything has gone well, the trace plot looks like a “fuzzy caterpillar” — a dense, stationary band of values oscillating around a stable mean with no visible trends, steps, or sticky regions.
If you are an econometrician, think of the trace plot as the time-series plot of an MCMC residual. You want it to look like white noise. Specifically, you want three properties.
The first is stationarity. The chain should not drift upward or downward over time. If you see a clear trend, the chain has not converged: the sampler is still searching for the high-density region of the posterior, and the draws from the early part of the chain are not representative. This is analogous to estimating an AR(1) process that has not yet reached its stationary distribution.
The second is good mixing. The chain should move rapidly across the full support of the posterior. If you see long stretches where the chain gets “stuck” at a particular value before jumping to another region, the sampler is struggling to explore the parameter space. Poor mixing inflates your effective standard errors, just as strong autocorrelation in a time series reduces the effective information content of the data.
The third is agreement across chains. If you run multiple independent chains (and you always should — Abacus defaults to at least two), they should all settle into the same band. If one chain is exploring a different region of the parameter space from the others, the model has not converged, and you cannot trust any summary statistics.
3. R-hat: The Convergence Diagnostic
R-hat ($\hat{R}$) is the single most important diagnostic number in Bayesian computation. It measures whether multiple independent chains have converged to the same distribution.
The intuition is straightforward. R-hat compares the variance of a parameter within each chain to the variance of the same parameter across chains. If all chains are sampling from the same distribution, these two variances should be roughly equal, and R-hat should be close to 1.0. If the chains disagree — one chain has settled around 0.5 while another has settled around 2.3 — the between-chain variance will be large relative to the within-chain variance, and R-hat will be substantially greater than 1.0.
For an econometrician, think of R-hat as a convergence test analogous to the Gelman-Rubin statistic (because that is exactly what it is, in its modern split-chain formulation). The threshold is conventional: R-hat below 1.01 is considered safe. Values between 1.01 and 1.05 warrant caution. Values above 1.1 indicate that the chains have not converged, and you should not interpret the results.
When R-hat is too high, the remedy is usually to run the sampler for more iterations (increase tune and draws), reparameterise the model (e.g., use non-centered parameterisations for hierarchical models), or simplify the model.
4. Effective Sample Size (ESS): Your True Degrees of Freedom
The sampler produces, say, 4,000 draws. But consecutive draws are autocorrelated — each draw is a small perturbation of the previous one. The effective sample size (ESS) measures how many independent draws your 4,000 autocorrelated draws are actually worth.
If you are an econometrician, you already understand this concept perfectly. It is identical to the Newey-West correction for autocorrelated errors in time-series regression. When your regression residuals are positively autocorrelated, the “effective” number of independent observations is smaller than the nominal sample size $N$, and your standard errors are too small if you ignore the autocorrelation. ESS performs exactly the same adjustment for MCMC draws.
There are two flavours of ESS reported in PyMC and ArviZ output. ESS-bulk measures the effective sample size in the centre of the posterior distribution (around the mean and median). ESS-tail measures the effective sample size in the tails (relevant for credible interval estimation). Both matter.
The practical threshold is simple: you want ESS-bulk and ESS-tail both above 400 for reliable inference. Below 400, your posterior summaries are noisy — the mean might be reasonable, but the 95% credible interval endpoints could shift substantially if you re-ran the sampler. Below 100, the results are unreliable and should not be reported.
When ESS is too low, the remedies are to increase the number of draws, improve the model parameterisation, or thin the chains (though thinning is rarely the best option — more draws is almost always preferable).
5. Divergences: The Red Flag You Must Not Ignore
A divergence is an event during sampling where the NUTS trajectory encounters a region of the posterior that changes so sharply that the numerical integration breaks down. The sampler detects that its simulated particle has deviated from the true Hamiltonian trajectory and flags the draw.
For an econometrician, think of a divergence as the Bayesian equivalent of a near-singular Hessian in MLE optimisation. When the likelihood surface has extremely steep ridges or sharp funnels, the MLE optimiser either fails to converge or converges to a local maximum. In MCMC, the analogous pathology manifests as divergences.
Divergences are not merely a computational nuisance. They indicate that the sampler has failed to explore some region of the posterior, which means the resulting draws are a biased sample from the true posterior. Even a handful of divergences can systematically exclude an important region of the parameter space, leading to overconfident and potentially wrong inference.
The practical rule is unforgiving: zero divergences is the target. A small number (fewer than 10 out of 4,000 draws) may be tolerable if they occur during the early warmup phase and do not cluster in a particular region. But if you see hundreds of divergences, the model is misspecified or poorly parameterised, and no amount of additional sampling will fix the problem.
The most common remedies are increasing target_accept (the target acceptance probability for NUTS, analogous to tightening the step size), reparameterising the model (switching from a centred to a non-centred parameterisation for hierarchical priors), or simplifying the model to remove the pathological geometry.
In classical econometrics, a 95% confidence interval means: “If we repeated this experiment infinitely many times and constructed an interval each time, 95% of those intervals would contain the true parameter.” Crucially, it does not mean that there is a 95% probability that the true parameter lies in this particular interval. The true parameter is fixed. The interval is random.
A 95% Bayesian credible interval means exactly what you always wished the confidence interval meant: “Given the data and the model, there is a 95% probability that the parameter lies in this interval.” The parameter is treated as a random variable (with a posterior distribution), and the interval directly quantifies our uncertainty about its value.
The Highest Density Interval (HDI), which Abacus and ArviZ report by default, is a specific type of credible interval: the narrowest interval that contains 95% (or 94%, the ArviZ default) of the posterior mass. For symmetric posteriors, the HDI coincides with the equal-tailed credible interval. For skewed posteriors (common for variance parameters or media effects bounded at zero), the HDI is narrower and more informative.
7. Mapping Bayesian Output to Classical Hypothesis Testing
econometricians are trained to ask: “Is this coefficient statistically significant?” In Bayesian inference, the question is reframed as: “What is the probability that this coefficient exceeds (or falls below) a particular threshold?”
The mapping is direct. When a 94% HDI for a media coefficient excludes zero — meaning the entire interval lies above zero — this is the Bayesian analogue of rejecting the null hypothesis at roughly the 6% significance level. When a 90% HDI excludes zero, the analogy is rejection at the 10% level.
But Bayesian inference offers richer answers than a binary significant/not-significant verdict. You can compute the exact posterior probability that the coefficient exceeds zero: $P(\beta > 0 \mid \text{data})$. If this probability is 0.98, you have strong evidence that the media channel has a positive effect. If it is 0.62, you have weak and inconclusive evidence. The posterior probability gives you a continuous measure of evidential strength, not a binary decision forced by an arbitrary 5% threshold.
You can also compute the posterior probability that the coefficient exceeds a practically meaningful threshold. “Is there at least a 90% probability that the ROI for TV exceeds 1.0?” is a more useful question for a media planner than “Is the TV coefficient significantly different from zero?” Bayesian inference answers the first question naturally.
8. A Diagnostic Checklist
When you receive MCMC output from an Abacus model run, work through the following checks in order.
Start with R-hat. Examine R-hat for every parameter. If any R-hat exceeds 1.01, stop. The chains have not converged, and every downstream summary is unreliable. Increase tune and draws, or investigate the model parameterisation.
Next, check for divergences. If the sampler reports more than a handful of divergences, the posterior geometry is pathological. Increase target_accept to 0.95 or 0.99. If divergences persist, the model likely needs reparameterisation or simplification.
Then examine ESS. Verify that ESS-bulk and ESS-tail exceed 400 for every parameter of interest. If ESS is low despite good R-hat, the chains are highly autocorrelated. Increase the number of draws.
Now inspect trace plots. Visually confirm that each chain looks like stationary white noise and that multiple chains overlap. Look for any sticky regions, trends, or bimodality.
Finally, interpret the posteriors. Report the posterior mean or median as your point estimate, the HDI as your interval estimate, and the posterior probability of exceeding zero (or any substantive threshold) as your measure of evidential strength.
Only after all four computational diagnostics pass — R-hat, divergences, ESS, and trace plots — should you proceed to interpret the substantive results. A Bayesian model with poor diagnostics is no more trustworthy than an OLS regression with autocorrelated residuals and a Durbin-Watson statistic of 0.4. The numbers may look plausible, but they are not reliable.
Prior Predictive Checks
If you come from classical econometrics, you are used to checking assumptions
after estimation: residual plots, heteroskedasticity tests, outlier influence,
and maybe out-of-sample fit. Bayesian workflow adds one earlier question:
Before fitting anything, do my priors imply plausible behaviour for the
target variable?
That is what prior predictive checking answers.
1. Why parameter-level priors are not enough
A prior can look sensible when you inspect it in isolation and still imply
absurd behaviour once it flows through the whole model.
For example:
an intercept prior may look “weakly informative” on paper
a channel coefficient prior may look “reasonably positive”
a likelihood sigma prior may look “safely diffuse”
But jointly, those choices might imply:
weekly revenue that is far above anything you could ever observe
negative conversions for a business where the target is always non-negative
far more volatility than the real series could possibly have
Classical econometrics rarely forces you to check this explicitly because you
usually specify penalties or constraints directly on the coefficient space.
Bayesian MMM requires one more layer of discipline: inspect the implied
distribution of y, not just the configured priors on the parameters.
2. What prior predictive checking does
Prior predictive checking asks:
If the priors were true, what kinds of target series would this model
generate before seeing the actual data?
The workflow is:
Build the model with your chosen priors and structure.
Sample from the prior predictive distribution.
Compare those simulated target draws with the scale and shape of the real
target series.
This is not a convergence check and it is not a causal test. It is a
plausibility check on the model you are about to fit.
3. How Abacus supports it
Abacus exposes prior predictive sampling directly on PanelMMM:
In the structured runner, this is Stage 10, the preflight stage. The pipeline
writes:
10_pre_diagnostics/prior_predictive.nc
10_pre_diagnostics/prior_predictive.png
Abacus currently gives you the sampled draws and the plot. It does not
apply an automatic plausibility score or a hard pass/fail gate for you.
4. What to look for
A useful prior predictive check is not about matching the data exactly. That
would defeat the point of a prior. The question is whether the implied target
behaviour is at least in the right universe.
Look for the following.
Level
Do the simulated draws live on roughly the same order of magnitude as the
observed target?
If your historical weekly revenue is in the low millions, prior predictive
draws in the billions are a red flag.
Dispersion
Is the implied volatility remotely plausible?
If the prior predictive distribution is much wider than the observed series,
your likelihood sigma or contribution priors are probably too loose.
Sign and support
Does the model imply values that violate business reality?
For example:
negative conversions
implausibly negative revenue
large oscillations around zero for a strictly positive KPI
These are often signs that the prior scale is too permissive relative to the
data scaling and likelihood choice.
Time pattern
Do the implied trajectories look structurally plausible?
You are not looking for a perfect seasonal pattern before fitting, but you
should ask whether the prior predictive draws look like something that could
have come from your business rather than from a random-number generator with no
economic interpretation.
5. Common failure modes
Several practical pathologies show up repeatedly.
The intercept is too loose
A very wide intercept prior can dominate the prior predictive distribution,
especially when the target has been scaled but the intercept prior is still too
diffuse for the transformed space.
The likelihood sigma is too loose
If the prior predictive draws look far too noisy, the problem is often not the
media priors at all. It is the observation model allowing implausibly large
residual variance.
Media transformation priors are too permissive
Adstock and saturation priors that allow unrealistically persistent carryover
or unrealistically steep response can imply contributions that are wildly too
large before the data has had any say.
Flexible baseline terms are too unconstrained
Time-varying intercepts, seasonality, events, and other additive effects can
all inject structure into the prior predictive distribution. If those priors
are too loose, the target series can become implausibly volatile or
pattern-heavy before fitting.
6. What to do when the prior predictive check looks bad
Do not proceed directly to posterior interpretation. Fix the model first.
Typical remedies:
tighten the intercept prior
tighten the likelihood sigma prior
make media priors more weakly informative in the economically plausible
region rather than completely diffuse
reduce unnecessary model flexibility before the data has justified it
check whether your scaling choices make the configured priors too wide or too
narrow on the model scale
This is the Bayesian analogue of catching a broken specification before you
start arguing about p-values.
7. What prior predictive checks do not tell you
Passing a prior predictive check does not mean:
the model is causally identified
the model will fit well
the posteriors will converge cleanly
the attribution decomposition will be trustworthy
It only means the configured priors do not imply obviously absurd target
behaviour before seeing the data.
Treat prior predictive checking as a standard pre-fit step, not as an optional
extra for purists.
In Abacus terms, the workflow should usually be:
Specify the model and priors.
Run sample_prior_predictive(...).
Inspect the implied target behaviour.
Revise the priors if needed.
Fit only once the prior predictive behaviour is broadly plausible.
That sequence is usually cheaper than fitting a badly specified Bayesian MMM
and then discovering that the posterior is unstable for reasons you could have
caught before sampling.
Posterior Predictive Checks
Posterior predictive checking asks a simple question:
After fitting the model, can it reproduce the main features of the observed
data?
For a classically trained econometrician, this is the Bayesian analogue of
residual diagnostics, fitted-versus-observed checks, and out-of-sample
sanity-checking, but with one important difference: the checks are based on the
full posterior distribution, not a single point estimate.
1. What the check actually is
After fitting, you sample from the posterior predictive distribution:
That assessment stage is the closest Abacus comes to a retained,
systematically-produced posterior predictive diagnostics bundle.
4. What to inspect
Observed versus fitted over time
Start with the time-series overlay.
Ask:
Does the fitted mean track the major movements in the target?
Are the predictive intervals wide enough to cover the observed series
reasonably often?
Does the model systematically lag turning points or seasonal peaks?
If the observed line keeps sitting outside the predictive interval in
structured ways, the model is missing something systematic rather than merely
being noisy.
Residual structure
Residuals should not show strong unresolved patterns.
In practice, look for:
long runs of positive residuals followed by long runs of negative residuals
clear seasonality left in the residuals
residual variance increasing with fitted values
one panel slice fitting much worse than the others
The presence of structure in the residuals usually means the model is still
under-specified for the data.
Scatter of fitted versus observed
The fitted-versus-observed scatter is not a formal test, but it quickly shows:
compression toward the mean
systematic underprediction at high values
systematic overprediction at low values
This is the Bayesian cousin of the fitted-value plots you would inspect after a
classical regression.
5. What “good” posterior predictive behaviour looks like
A good posterior predictive check does not mean the model matches every
wiggle exactly.
You are looking for something more practical:
the main level and variation are captured
the observed series falls inside plausible predictive ranges often enough
residuals are not strongly structured
panel slices are not failing in obviously asymmetric ways
The question is whether the model is adequate for interpretation, not whether
it is perfect.
6. What posterior predictive checks cannot prove
This is the most important warning.
A model can pass posterior predictive checks and still fail as a causal model.
Why? Because posterior predictive checks evaluate prediction of the target,
not causal attribution of the components.
Two models can predict sales equally well while assigning very different shares
of those sales to:
baseline
media
controls
seasonality
events
That is why posterior predictive checking must be paired with:
If the fitted line misses broad movements or regime changes, the model may need
more structural flexibility, for example in trend, seasonality, controls, or
events.
The model is too flexible in the wrong place
You may see good in-sample fit but strange residual behaviour or unstable
attribution because the model is fitting noise through components that should
remain more constrained.
Media is carrying baseline structure
If media spend is strongly correlated with time patterns, the model may let
media soak up baseline variation that should have been handled by intercept,
seasonality, controls, or other additive structure.
Baseline is carrying media structure
The reverse can also happen: a very flexible baseline can absorb variation that
you would otherwise attribute to media.
8. What to do when checks fail
If posterior predictive checks look bad, resist the temptation to jump straight
to interpreting coefficients anyway.
Instead:
Check convergence first.
Inspect residual structure rather than only aggregate fit.
Revisit baseline specification, controls, seasonality, events, and media
transformation choices.
Refit and compare again.
In other words, use posterior predictive checking as a model-development tool,
not just as a reporting plot.
9. Practical recommendation
In Abacus, the robust sequence is:
Run prior predictive checks before fitting.
Fit the model and verify MCMC diagnostics.
Run posterior predictive checks and inspect residuals.
Only then move to contributions, optimisation, or causal interpretation.
That order mirrors how a careful econometrician would already work, except that
the Bayesian workflow makes the predictive-check step much richer and more
honest about uncertainty.
Model Comparison
You have spent your career comparing models with AIC, BIC, adjusted $R^2$, and the occasional likelihood ratio test. These tools are elegant, fast, and deeply embedded in econometric practice. They are also, in the Bayesian setting, either inapplicable or subtly misleading. This document explains the Bayesian model comparison toolkit — LOO-CV, ELPD, posterior predictive checks — by mapping each concept back to something you already understand. We also discuss the pitfalls that arise when comparing ELPD across models, because this is where even experienced practitioners make mistakes.
1. Why AIC and BIC Do Not Transfer Cleanly
AIC and BIC are derived from the maximised log-likelihood and a penalty term that counts the number of free parameters. The logic is intuitive: a model that fits the data well (high log-likelihood) but uses many parameters (high complexity) is penalised, preventing overfitting.
In a Bayesian model, the concept of “number of free parameters” becomes ambiguous. Consider a hierarchical prior on media coefficients: eight channel-level coefficients are partially pooled toward a shared group mean. Are there eight free parameters, or one? The answer depends on how much pooling the data induces. If the group mean dominates, the effective number of parameters is closer to one. If each channel estimate ignores the group mean, the effective number is closer to eight. The truth lies somewhere in between, and it changes depending on the data.
BIC fares no better. Its derivation assumes that the posterior concentrates on a single point (the MLE) as the sample size grows. In a fully Bayesian model with informative priors and moderate sample sizes — precisely the setting of most MMMs — this assumption fails. The posterior is a genuine distribution, not a spike, and BIC’s penalty term does not account for the regularisation imposed by the prior.
You can still compute AIC and BIC from a Bayesian model by plugging in the posterior mean and the nominal parameter count, and some software will do this for you. But the resulting numbers do not have their usual theoretical justification, and they can mislead you into selecting the wrong model.
2. LOO-CV: The Gold Standard for Predictive Model Comparison
The Bayesian replacement for information criteria is Leave-One-Out Cross-Validation (LOO-CV), computed via an efficient approximation called Pareto-Smoothed Importance Sampling (PSIS-LOO). The implementation in ArviZ (which Abacus uses) makes this computation fast enough to run routinely.
The intuition maps directly to something every econometrician understands: out-of-sample prediction. Imagine you have $N$ observations. For each observation $i$, you refit the model on the remaining $N - 1$ observations and compute the predictive density for the held-out observation $i$. The average of these $N$ predictive densities, on the log scale, gives you the Expected Log Pointwise Predictive Density (ELPD).
In practice, you do not actually refit the model $N$ times. PSIS-LOO uses importance sampling to approximate each leave-one-out posterior from the full-data posterior, making the computation nearly free once the model has been fitted. The Pareto-smoothing step stabilises the importance weights, and the shape parameter of the fitted Pareto distribution (the Pareto-$k$ diagnostic) tells you how reliable each approximation is.
For an econometrician, ELPD is the Bayesian analogue of the out-of-sample log-likelihood that motivates AIC. In fact, AIC can be interpreted as an asymptotic approximation to LOO-CV. The difference is that LOO-CV makes no asymptotic assumptions, fully accounts for the prior, and works correctly even when the effective number of parameters is ambiguous.
3. Reading the ELPD Output
When you run az.loo() in ArviZ (or access LOO diagnostics through an Abacus model), the output reports several quantities that deserve careful interpretation.
The first is elpd_loo, the estimated expected log pointwise predictive density. This is a single number that summarises the model’s out-of-sample predictive performance. Higher (less negative) values indicate better predictive accuracy. On its own, the absolute value of ELPD is not very informative — it depends on the scale of the data and the number of observations. ELPD becomes useful only when you compare it across models fitted to the same data.
The second is p_loo, the effective number of parameters. This quantity captures the complexity of the model as measured by how much each observation influences its own prediction. A model with strong regularisation (tight priors, heavy pooling) will have a small $p_\text{loo}$ relative to its nominal parameter count, because the priors constrain the flexibility. A model with weak regularisation will have $p_\text{loo}$ closer to the nominal count. If $p_\text{loo}$ exceeds the nominal number of parameters, the model is misspecified or the PSIS approximation has broken down.
The third is se_elpd_loo, the standard error of the ELPD estimate. This is crucial for model comparison and is where many practitioners make errors. We address this in detail below.
4. Comparing Models: The ELPD Difference and Its Standard Error
Suppose you have fitted two models to the same dataset and computed ELPD for each. Model A has $\text{ELPD}_A = -320$ and Model B has $\text{ELPD}_B = -315$. Model B appears to predict better. But is the difference meaningful, or is it within noise?
The function az.compare() in ArviZ computes the difference $\Delta\text{ELPD} = \text{ELPD}_B - \text{ELPD}_A$ and its standard error. The standard error of the difference is computed from the pointwise ELPD values (one per observation), accounting for the correlation between the two models’ predictions.
The interpretation is analogous to a classical hypothesis test. If $|\Delta\text{ELPD}|$ is large relative to its standard error (say, greater than 2 SE), you have reasonable evidence that one model predicts better than the other. If the difference is smaller than 2 SE, the models are indistinguishable in predictive performance, and you should prefer the simpler or more interpretable model on non-statistical grounds.
However — and this is the critical caveat — the standard error of $\Delta\text{ELPD}$ is itself an estimate, and it can be unreliable when the pointwise ELPD differences are heavy-tailed. A handful of influential observations (outliers that one model handles much better than the other) can inflate the standard error dramatically, making a genuine difference look insignificant. Conversely, if both models fail on the same outliers in the same way, the standard error can be artificially small, making a meaningless difference look significant.
The practical recommendation is to always inspect the pointwise ELPD differences alongside the aggregate comparison. If a small number of observations drive most of the difference, investigate those observations individually before concluding that one model is superior.
5. Pareto-k Diagnostics: When to Trust the Approximation
The PSIS-LOO approximation relies on importance sampling, and importance sampling can fail when individual observations are highly influential — that is, when removing a single observation substantially changes the posterior. The Pareto-$k$ diagnostic measures this influence for each observation.
For an econometrician, Pareto-$k$ plays a role analogous to Cook’s distance or leverage in OLS diagnostics. A high-leverage observation in OLS disproportionately influences the coefficient estimates. A high Pareto-$k$ observation in LOO-CV disproportionately influences the ELPD estimate, and the importance sampling approximation for that observation may be unreliable.
The conventional thresholds are straightforward. Pareto-$k$ values below 0.7 indicate that the PSIS approximation is reliable for that observation. Values between 0.7 and 1.0 indicate marginal reliability — the estimate is usable but noisy. Values above 1.0 indicate that the importance sampling approximation has broken down for that observation, and the reported ELPD is not trustworthy.
When you encounter high Pareto-$k$ values, several remedies are available. The simplest is moment matching, an option in ArviZ that improves the approximation for problematic observations. If that fails, you can refit the model with the offending observations actually held out (exact LOO-CV for those points only). More fundamentally, high Pareto-$k$ values often signal that the model is misspecified for those observations — perhaps they are genuine outliers, or the model’s functional form fails in that region of the data. Investigating why specific observations are influential is often more valuable than fixing the diagnostic.
6. Posterior Predictive Checks: The Bayesian Goodness-of-Fit Test
ELPD and LOO-CV are relative metrics: they tell you which model predicts better, but they cannot tell you whether any of your models predict well in an absolute sense. For that, you need posterior predictive checks.
The idea is simple. Once you have fitted a model, you generate simulated datasets from the posterior predictive distribution — that is, you sample parameter values from the posterior and then simulate new data from the likelihood. You then compare the distribution of these simulated datasets to the observed data. If the simulations look like the real data, the model is capturing the key features of the data-generating process. If not, the model is missing something important.
For an econometrician, posterior predictive checks are the Bayesian analogue of residual diagnostics, but more powerful. Instead of checking whether residuals are normally distributed or homoscedastic, you can check any feature of the data. Does the model reproduce the seasonal pattern? Does it capture the right degree of week-to-week volatility? Does the distribution of simulated total annual sales match the observed total? Each of these questions becomes a visual or numerical comparison between the real data and the posterior predictive distribution.
The key advantage over classical residual analysis is that posterior predictive checks incorporate parameter uncertainty. Classical residuals are computed at the point estimate, which can mask model deficiencies when the standard errors are large. Posterior predictive simulations are drawn from the full posterior, so they honestly reflect how much the model’s predictions could vary even if the model is correctly specified.
In practice, we recommend running posterior predictive checks before computing ELPD or comparing models. If the posterior predictive distribution fails to reproduce basic features of the data (the mean, the variance, the seasonal pattern), the model is misspecified at a fundamental level, and comparing its ELPD to another model’s ELPD is an exercise in choosing the least bad option rather than selecting a good model.
7. When Model Comparison Is Meaningful and When It Is Not
Not all model comparisons are informative, and econometricians should exercise the same caution here that they would when comparing nested versus non-nested classical specifications.
ELPD comparisons are meaningful when the two models are fitted to exactly the same dataset, with exactly the same observations and the same target variable. If one model drops missing values differently, or transforms the target variable (e.g., one model predicts $y$ and the other predicts $\log(y)$), the ELPD values are on different scales and cannot be compared directly. This is analogous to the well-known prohibition against comparing $R^2$ across models with different dependent variables in classical econometrics.
ELPD comparisons are also meaningful only when the Pareto-$k$ diagnostics are acceptable for both models. If one model has many observations with Pareto-$k$ above 1.0, its ELPD estimate is unreliable, and the comparison is confounded by approximation error rather than genuine predictive differences.
ELPD comparisons are less informative when the models differ in ways that do not affect prediction but do affect causal interpretation. Two models might produce nearly identical ELPD values — predicting sales equally well out of sample — while attributing completely different proportions of sales to TV versus search. This is the identification problem discussed in the causal identification FAQ: predictive equivalence does not imply causal equivalence. A model that attributes 30% of sales to TV and 10% to search might predict just as well as a model that attributes 20% to each, because the total media contribution is the same. ELPD cannot distinguish between these models, because it evaluates prediction, not attribution.
For this reason, we recommend treating ELPD as a necessary but not sufficient criterion for model selection. Use it to eliminate models that predict poorly. Use posterior predictive checks to verify that the surviving models capture the essential features of the data. Then use substantive economic reasoning, lift test calibration, and domain expertise to choose among predictively equivalent models based on the plausibility of their causal attributions.
8. A Practical Mapping from Classical to Bayesian Model Selection
To consolidate the discussion, here is how each classical tool maps to its Bayesian replacement.
Adjusted $R^2$ measures in-sample fit penalised by the number of parameters. The Bayesian analogue is the posterior predictive $R^2$ proposed by Gelman, Goodrich, Gabry, and Vehtari (2019), which computes $R^2$ from the posterior predictive distribution rather than a point estimate. Unlike classical adjusted $R^2$, posterior predictive $R^2$ comes with a full distribution (one value per posterior draw), so you can report its uncertainty.
AIC measures asymptotic out-of-sample predictive performance. The Bayesian analogue is ELPD estimated via PSIS-LOO. ELPD is more general (no asymptotic assumptions), fully accounts for the prior, and handles hierarchical models correctly.
BIC targets model identification rather than prediction (it is consistent for the true model as $N \to \infty$). There is no direct Bayesian analogue that serves the same purpose, because Bayesian model comparison via ELPD is inherently predictive. If you want to identify the “true” model in a Bayesian framework, you would use Bayes factors, but Bayes factors are sensitive to the prior specification in ways that ELPD is not, and we do not generally recommend them for MMM applications.
The likelihood ratio test compares nested models by examining whether the additional parameters significantly improve the likelihood. The Bayesian replacement is the ELPD difference with its standard error. If the ELPD difference exceeds roughly 2 standard errors, the more complex model predicts meaningfully better. If not, prefer the simpler model.
Classical residual diagnostics (Durbin-Watson, Breusch-Pagan, Q-Q plots) check model assumptions after fitting. The Bayesian replacement is posterior predictive checking, which is more flexible (you can check any data feature, not just residual properties) and more honest (it incorporates parameter uncertainty).
In every case, the Bayesian tool is at least as informative as its classical counterpart and often more so. The cost is unfamiliarity. We hope this document has reduced that cost.
Causal Identification
If you are a classically trained econometrician, you have every right to be sceptical of Marketing Mix Models. The causal identification strategy underpinning MMM is weaker than the methods you were taught to trust. This document confronts that reality head-on: we explain what MMM can and cannot claim causally, where the identifying assumptions break down, and how modern calibration techniques partially rescue the framework. We also place MMM on the “causal ladder” relative to the gold-standard methods you already know.
Our goal is not to oversell MMM. It is to give you an honest accounting of the trade-offs, so you can deploy the tool where it is defensible and flag where it is not.
1. The Identification Problem, Plainly Stated
Every causal claim rests on an identification strategy — a logical argument for why the estimated relationship reflects a true causal effect rather than a statistical artefact. In classical econometrics, you learned several strategies, each with a well-understood set of assumptions. Consider three that you know well.
A randomised controlled trial (RCT) identifies a causal effect by physically randomising treatment assignment. Because randomisation breaks the link between treatment and all confounders (observed and unobserved), the simple difference in means is an unbiased estimator of the average treatment effect. The assumption is minimal: the randomisation was executed correctly.
An instrumental variables (IV/2SLS) estimator identifies a causal effect by exploiting an instrument — a variable that affects the outcome only through the endogenous treatment. The identifying assumptions are relevance (the instrument predicts the treatment) and the exclusion restriction (the instrument has no direct effect on the outcome). These assumptions are testable to some degree and falsifiable.
A difference-in-differences (DiD) estimator identifies a causal effect by comparing the change in outcomes over time between a treated and control group. The identifying assumption is parallel trends: absent treatment, the two groups would have followed the same trajectory. Again, this assumption is partially testable using pre-treatment data.
Now consider what MMM does. An MMM estimates media effects by regressing sales (or another KPI) on media spend and controls over time. The variation it exploits is temporal: weeks when TV spend was high are compared to weeks when TV spend was low, after controlling for seasonality, trend, and other observables.
The identifying assumption is strict exogeneity of the media regressors, conditional on the controls. In plain language: after we account for trend, seasonality, holidays, and any included control variables, the remaining variation in media spend is “as good as random” with respect to the error term. If an unobserved, time-varying confounder drives both media spend and sales simultaneously — and we have not controlled for it — the media coefficient is biased.
This is a strong assumption. And unlike the IV exclusion restriction or the DiD parallel trends assumption, it is essentially untestable. You cannot run a placebo check on an unobserved confounder you have not measured.
2. Where the Assumptions Break Down
The strict exogeneity assumption fails in practice more often than MMM practitioners care to admit. Consider three common violations.
The first is simultaneity. Media planners increase spend during periods when they expect sales to be high (Christmas, product launches, promotional windows). Sales are high in those periods not because of the advertising but because of the underlying demand shock. The MMM attributes the demand shock to the media channel, inflating its estimated effect. This is textbook endogeneity, identical to the problem that motivates IV estimation in labour economics or IO.
The second is omitted variable bias from time-varying confounders. Suppose a competitor launches an aggressive pricing campaign in Q3, simultaneously causing your sales to drop and your marketing team to increase defensive spend. The MMM sees high spend coinciding with low sales and may underestimate the media effect. If instead the competitor withdraws, the reverse happens. Without a “competitor activity” control, the media coefficient absorbs the confounding variation.
The third is functional form misspecification. Even if the true data-generating process satisfies strict exogeneity, specifying the wrong functional form (linear when the truth is concave, or missing an interaction between channels) introduces bias. MMM frameworks like Abacus mitigate this with flexible non-linear transforms (adstock, saturation), but no parametric family can guarantee correct specification.
3. How Lift Test Calibration Partially Rescues MMM
Modern Bayesian MMM frameworks, including Abacus, address the endogeneity problem through calibration with incrementality experiments (lift tests or geo-experiments). The logic works as follows.
A lift test is a controlled experiment — typically a geo-randomised or matched-market design — in which media exposure is deliberately varied across treatment and control regions. Because the variation is experimentally induced, the resulting incremental estimate is causally identified in the RCT sense, at least for the specific channel, time window, and geography tested.
When you feed this lift test estimate into the MMM (via the EventAdditiveEffect or lift test calibration API in Abacus), you inject an external piece of causal evidence into the model’s likelihood. The Bayesian machinery then updates the media coefficient posterior to be consistent with both the observational time-series data and the experimental result. In effect, the lift test acts as an anchor: it constrains the media coefficient to a causally credible region, even if the observational data alone would have produced a biased estimate.
Think of the lift test as playing a role analogous to an instrumental variable. The IV provides exogenous variation that identifies the causal effect. The lift test provides exogenous variation (from the experiment) that calibrates the observational estimate. The difference is that the IV is embedded inside the estimator, whereas the lift test enters as an informative prior or likelihood penalty.
This approach does not eliminate all bias. The lift test identifies the causal effect for one channel in one time window. Extrapolating that result across all channels and all time periods requires additional assumptions (stability of the effect over time, no interaction between the calibrated and uncalibrated channels). But it is a genuine improvement over pure observational MMM, and it brings the framework closer to the causal credibility that econometricians demand.
4. MMM on the Causal Ladder
We can place MMM relative to the methods you trust by thinking about a hierarchy of identification strategies, ordered by the strength of their causal assumptions.
At the top sits the RCT. Randomisation eliminates all confounding, and the only threat to validity is implementation failure (non-compliance, attrition, spillovers). For media measurement, the RCT analogue is a well-executed geo-experiment or a randomised holdout test. When you can run one, run one.
One rung below sits IV/2SLS. The instrument provides exogenous variation, but only if the exclusion restriction holds. In media measurement, genuine instruments are rare. Weather shocks that affect outdoor advertising exposure, or regulatory changes that force abrupt spend shifts, occasionally qualify. But most marketing datasets lack a credible instrument.
Below IV sits DiD and synthetic control methods. These exploit a treatment event (a campaign launch, a market entry) and compare treated versus control units under a parallel trends assumption. Geo-experiments with a staggered rollout fit naturally into this framework. The assumption is testable but not guaranteed.
Below DiD sits regression discontinuity (RD), which exploits a sharp threshold in treatment assignment. Media applications are uncommon because advertising spend rarely exhibits the kind of sharp discontinuity that RD requires.
And then we arrive at the observational regression — which is where standard MMM lives. The identifying assumptions are the weakest in the hierarchy: conditional exogeneity given controls, correct functional form, and no unobserved time-varying confounders. Without external calibration, this is the least credible causal claim on the ladder.
However, MMM calibrated with lift tests occupies a hybrid position. The observational regression provides the structure and the time-series variation. The lift test provides a causally identified anchor point. Together, they produce an estimate that is stronger than pure observational regression but weaker than a full RCT across all channels. In practice, this hybrid is the best that most marketing organisations can achieve at scale, because running a separate RCT for every channel, every quarter, in every market, is prohibitively expensive.
5. The Role of DAGs and Structural Thinking
If you are trained in the Pearlian causal inference tradition (directed acyclic graphs, do-calculus, the structural causal model), you will recognise that MMM implicitly assumes a particular DAG. The assumed structure looks roughly like this: media spend causes sales, seasonality and trend cause sales, controls cause sales, and (critically) nothing unobserved simultaneously causes both media spend and sales after conditioning on the included controls.
Drawing this DAG explicitly is a powerful exercise. It forces you to articulate every backdoor path between media and sales, and to verify that your control set blocks them all. If you identify a backdoor path that your controls do not block — for example, “competitor pricing → our media spend” and “competitor pricing → our sales” — you have found a source of bias that the MMM cannot resolve without either adding a control for competitor pricing or calibrating with a lift test.
We strongly recommend that every MMM engagement begins with a causal DAG workshop, even an informal one. The DAG does not make the model causal. But it forces the team to be explicit about what they are assuming, and it provides a framework for discussing where the model’s causal claims are credible and where they are not.
6. Honest Counsel for Sceptical Econometricians
We close with five points of honest counsel.
First, do not treat MMM outputs as causal estimates with the same confidence you would place in a well-identified IV or DiD result. They are not. They are conditional associations, regularised by Bayesian priors and (ideally) anchored by experimental calibration.
Second, always ask: “What is the identifying variation?” If the answer is “weeks when spend was high versus weeks when spend was low,” follow up with: “Why was spend high in those weeks? Could the same factor that drove high spend also have driven high sales independently?” If the answer is “yes” or “maybe,” the estimate is potentially confounded.
Third, calibrate wherever possible. A single well-executed lift test for your largest channel does more for the credibility of the entire model than any amount of prior tuning or functional form experimentation.
Fourth, use the model for what it does well. MMM excels at relative channel comparison (channel A versus channel B), at budget allocation (given a fixed total budget, how should we distribute it?), and at scenario planning (what happens if we increase TV spend by 20%?). These tasks require correct ranking of media effects, not unbiased point estimation. Even a moderately biased MMM can rank channels correctly if the bias is roughly proportional across channels.
Fifth, be transparent with stakeholders. Present posterior credible intervals, not point estimates. Discuss the assumptions openly. Flag where calibration data exists and where it does not. The credibility of the framework depends not on pretending the model is an RCT, but on demonstrating that the team understands its limitations and has taken concrete steps to mitigate them.
Baseline vs Media Trade-offs
One of the most confusing experiences in MMM is this:
two specifications can fit the target series almost equally well
both can have acceptable diagnostics
yet they can assign very different amounts of the target to media versus
baseline
This is not necessarily a bug in the software. It is a structural feature of
the problem.
This page explains how that trade-off appears in Abacus and why you should
expect it.
1. The decomposition problem
At a high level, Abacus builds the expected target from several additive
components.
In the retained PanelMMM build path, the mean function can include:
intercept_contribution
channel_contribution
control_contribution, if you configure control_columns
mundlak_contribution, if use_mundlak_cre=True
yearly_seasonality_contribution, if yearly_seasonality is enabled
additional additive effects you attach before build, such as events or trend
effects
The likelihood sees the sum of these pieces, not a directly observed
“ground-truth baseline” and “ground-truth media” split.
That means the total fit can be easier to identify than the decomposition.
2. Why the trade-off exists
Suppose revenue rises every December and TV spend also rises every December.
Several stories can fit the same sales data reasonably well:
December uplift is mostly seasonality
December uplift is mostly TV
December uplift is partly both
If the model includes both a seasonal term and media terms, they will compete
to explain the same observed movement.
This is the core baseline-versus-media trade-off:
the data often identify total explained variation better than they identify
which component deserves the credit
Classical econometricians already know this as collinearity and omitted-variable
competition. Bayesian MMM does not make that problem disappear. It makes the
uncertainty around it more explicit.
3. What counts as “baseline” in Abacus
In Abacus, the baseline side comes from the terms you specify inside the PyMC
graph.
Depending on configuration, that can include:
a static intercept
a time-varying intercept
yearly Fourier seasonality
controls
events
trend-like additive effects
Mundlak CRE adjustments in panel settings
So when people say “baseline absorbed the effect”, they usually mean one or
more of those components, not a separate external decomposition engine.
4. How media can lose attribution
Media can lose attribution when the non-media side of the model is too good at
explaining the same movements.
Common cases:
a flexible time-varying intercept captures medium-run swings that media could
also explain
strong seasonal terms absorb repeating peaks that coincide with campaign
timing
control variables proxy for media timing or market conditions too strongly
event effects explain demand spikes that were previously being picked up by
channel coefficients
In each case, the model may still predict well. The question is how the
variation is partitioned.
5. How media can steal attribution from baseline
The reverse failure is also common.
If the baseline side is under-specified, media channels can absorb variation
that is not truly incremental media response.
Examples:
missing seasonality leaves recurring annual structure for media to explain
missing controls leave competitor, pricing, or macro effects for media to
explain
missing events leave spikes for channels to absorb
insufficient baseline flexibility forces media to act as a trend proxy
This usually inflates media contribution and makes optimisation outputs look
better than they should.
6. Why good fit does not settle the argument
You might hope that whichever specification predicts better must also have the
more trustworthy attribution split.
Unfortunately, that does not follow.
A model can reproduce the observed target series very well while still having
ambiguous attribution. Predictive adequacy is necessary, but it is not enough
to identify the correct media decomposition.
7. Signs that the trade-off is driving your result
Be cautious when you see any of the following:
very similar model fit with materially different channel contributions
large channel swings after adding or removing a seasonal or trend term
media ROI rankings that flip after adding controls or events
one highly flexible baseline term dominating decomposition while media
contributions collapse
implausibly smooth media contributions paired with a very wiggly baseline, or
vice versa
These are not proofs of misspecification, but they are strong prompts for
sensitivity analysis.
8. What to do in practice
A disciplined Abacus workflow is usually better than trying to argue
theoretically about the “right” split in the abstract.
Recommended approach:
Start with a specification that has the minimum baseline structure you can
defend.
Add seasonal, control, event, or time-varying terms only when you can
justify them substantively or diagnostically.
Refit and compare decomposition stability, not just target fit.
Report instability when attribution changes materially across defensible
specifications.
Where possible, bring in external evidence such as lift tests or
calibration.
The important point is not to force one narrative prematurely. It is to show
which attribution conclusions remain stable after reasonable specification
changes.
9. Abacus-specific interpretation
In Abacus, you should treat the decomposition outputs as conditional on the
configured structure:
the chosen controls
whether yearly_seasonality is on
whether the intercept is time-varying
whether media effects are time-varying
whether you added events or other additive effects
whether use_mundlak_cre=True
Change the structure, and the attribution can change even when predictive fit
does not move much.
That is normal. It is the software telling you where the data alone are not
decisive.
10. Bottom line
Baseline-versus-media trade-offs are unavoidable in MMM because the observed
target only reveals the sum of the contributing processes.
Abacus makes this explicit by fitting all configured terms inside one additive
Bayesian graph. That is a strength, but it also means you need to read the
decomposition as a conditional statement:
given this model structure, priors, and data, this is the most plausible
attribution split
That is much more defensible than pretending the split is uniquely observed in
the data.
Mundlak Specification Test
Background
Classical panel econometrics uses the Mundlak specification test (also
called the Chamberlain–Mundlak test) to decide whether random effects (RE) or
fixed effects (FE) should be preferred. Stata 19 implements this as
estat mundlak — a Wald test on the auxiliary Mundlak γ coefficients:
H₀: RE is consistent (γ = 0 jointly), so the simpler RE model is adequate.
H₁: RE is inconsistent (γ ≠ 0), so CRE or FE is needed.
This test is the cluster-robust-compatible replacement for the classical
Hausman test, which breaks under heteroskedasticity or within-cluster
correlation.
Why It Does Not Apply to Abacus
Abacus is a fully Bayesian MMM framework. The Mundlak specification test
is a frequentist hypothesis test and does not translate directly:
No frequentist rejection framework. There is no Wald statistic or
asymptotic chi-squared distribution. Bayesian inference does not produce
p-values or binary accept/reject decisions.
The posterior already answers the question. When
use_mundlak_cre=True, the Mundlak γ coefficients receive priors and are
estimated jointly with all other model parameters. If the posteriors of γ
are concentrated near zero, the baseline panel specification was adequate.
If they are clearly non-zero, the CRE correction is absorbing meaningful
between-group confounding. You read the posterior — you do not need a
separate test.
Bayesian pooling is a continuum, not a binary choice. In Abacus,
hierarchical shrinkage only appears when you encode it in the priors.
Default PanelMMM panel priors are indexed by the panel coordinates, not
automatically hierarchical. Once you choose hierarchical priors, there is
no clean “pure RE” versus “pure FE” dichotomy to test between.
What to Do Instead
Inspect the γ posteriors directly
After fitting with use_mundlak_cre=True, examine the Mundlak coefficients:
If the 94% HDI includes zero for all channels, the CRE correction is doing
little. If the HDI excludes zero, the correction is absorbing real
between-group correlation.
The diagnostics surface also reports these:
mmm.diagnostics.mcmc_summary()
Bayesian model comparison (optional, not currently in scope)
The formal Bayesian analog of the specification test is model comparison via
LOO-CV (leave-one-out cross-validation using Pareto-smoothed importance
sampling):
Fit with use_mundlak_cre=False.
Fit with use_mundlak_cre=True.
Compare ELPD (expected log predictive density) via az.compare().
This is currently out of scope for Abacus. LOO/WAIC were explicitly
deferred in the project backlog. If formal Bayesian model comparison is needed
later, it would be a separate feature.
Prior predictive checks
Verify that the prior on γ is not dominating the posterior. This is standard
Bayesian workflow and is already supported via mmm.sample_prior_predictive().
Summary
Approach
Framework
Available in Abacus
Mundlak specification test (estat mundlak)
Frequentist
No — does not apply
Posterior inspection of γ
Bayesian
Yes — az.summary() / mmm.diagnostics
LOO-CV model comparison
Bayesian
Not yet — deferred in backlog
Prior predictive check
Bayesian
Yes — sample_prior_predictive()
The recommendation is to inspect the γ posteriors rather than implement a
frequentist specification test. The Bayesian posterior provides a richer and
more directly interpretable answer than a binary reject/fail-to-reject
decision.
References
Mundlak, Y. (1978). “On the Pooling of Time Series and Cross Section Data.”
Econometrica, 46(1), 69–85.
Vehtari, A., Gelman, A., & Gabry, J. (2017). “Practical Bayesian model
evaluation using leave-one-out cross-validation and WAIC.”
Statistics and Computing, 27(5), 1413–1432.