Data Preparation

This section explains how to prepare X and y for PanelMMM. It covers the required columns, how panel rows are organised when you use dims, and how Abacus scales channels and the target before fitting.

Pages

  • Input Data Requirements — Required X and y inputs, column roles, alignment rules, and common input errors.
  • Panel Data Layout — How to structure rows for no panel dims, one dim such as geo, or multiple dims such as geo and brand.
  • Scaling and Preprocessing — What Abacus scales automatically, how Scaling works, and what to preprocess yourself.

Subsections of Data Preparation

Input Data Requirements

Use this page together with Panel Data Layout and Scaling and Preprocessing when you prepare a dataset for PanelMMM.

Core contract

For direct Python use, PanelMMM expects:

  • X as a pandas.DataFrame
  • y as a pandas.Series named target_column, or a one-dimensional NumPy array of the same length as X

X must contain the date column, all media columns, and any configured control_columns or dims columns. y carries only the target values.

Role Where it must be present Required Notes
date_column X Yes Normalise to datetimes or parseable date strings.
channel_columns X Yes Every listed channel column must exist in X.
target_column y Yes y.name should match target_column.
control_columns X No If configured, every listed control column must exist in X.
dims X No One column per configured panel dimension, such as geo or brand.

X and y

When you call fit(X, y) or build_model(X, y):

  • Keep the target out of X.
  • Keep X and y row-aligned.
  • If both are pandas objects, keep the same index on both. The shared regression builder checks index equality before fitting.
  • If you pass y as a NumPy array, its length must match len(X).
  • For panel models, each date_column + dims combination must appear exactly once. Duplicate rows are rejected.

Abacus uses target_column as the target name throughout the panel reshape. If y is a Series, its name must match target_column.

Date column

date_column is required in X.

Abacus expects calendar dates, not integer date codes. In practice:

  • Use datetime64[ns] where possible.
  • Parse string dates with pd.to_datetime(...) before fitting when you use the Python API.
  • Do not rely on numeric date values such as 0, 1, 2. Pandas can interpret them as offsets from the Unix epoch, which is usually not what you want.

The YAML builder normalises X[date_column] with pd.to_datetime(...) after loading the dataset. Direct Python use does not add an equivalent preprocessing step for you.

Channel columns

channel_columns is a required constructor argument and must be a non-empty list.

Each listed channel:

  • must be present in X
  • must be fully observed for every row you pass into fit or posterior prediction; Abacus does not silently convert missing channel values to zero
  • should represent the raw media variable that you want the adstock and saturation transformations to consume

Target column

target_column names the dependent variable. It defaults to "y", but you can set a different name such as "sales" or "conversions".

For direct Python use:

  • pass the target as y
  • name the Series with target_column
  • keep the target fully observed; missing target values are rejected rather than zero-filled

For combined-file YAML or pipeline flows:

  • keep the target column in the source dataset
  • Abacus splits it out of the combined dataset before fitting

Control columns

control_columns is optional.

If you configure it, every listed control column must be present in X. Controls stay in the design matrix as separate regressors; they are not part of y.

Like channels, configured controls must be fully observed for every row passed into fit or posterior prediction.

Abacus does not automatically scale controls. See Scaling and Preprocessing.

Panel dimensions with dims

dims is optional. Use it when you want a panel model, for example by geo, brand, or market.

If you set dims=("geo", "brand"):

  • X must contain geo and brand columns
  • each row in X represents one date + geo + brand observation
  • each new date must include every fitted panel slice when you later call posterior-predictive methods with new data

Do not use reserved internal names in dims:

  • date
  • channel
  • control
  • fourier_mode

For row layout and rectangularity guidance, see Panel Data Layout.

Supported shapes and alignment

Workflow Supported shape
Direct PanelMMM.fit() / build_model() X: DataFrame; y: Series or 1D ndarray
YAML builder with data.dataset_path One tabular file containing both predictors and the target column
Pipeline runner with dataset_path Same as above
Pipeline runner with x_path and y_path Separate feature and target files; the runner extracts target_column from the target file

Abacus also has an internal alignment helper that can work with a MultiIndex target Series indexed by [date_column, *dims], but that is mainly used in fit-data rebuild and load flows. For normal fitting, keep y row-aligned with X.

Python example

import pandas as pd

from abacus.mmm import GeometricAdstock, LogisticSaturation
from abacus.mmm.panel import PanelMMM

dataset = pd.DataFrame(
    {
        "date": pd.to_datetime(
            ["2025-01-06", "2025-01-06", "2025-01-13", "2025-01-13"]
        ),
        "geo": ["UK", "US", "UK", "US"],
        "tv": [120.0, 150.0, 125.0, 152.0],
        "search": [40.0, 55.0, 42.0, 58.0],
        "price_index": [1.02, 0.99, 1.01, 1.00],
        "sales": [820.0, 910.0, 835.0, 925.0],
    }
)

X = dataset.drop(columns=["sales"])
y = dataset["sales"].rename("sales")

mmm = PanelMMM(
    date_column="date",
    channel_columns=["tv", "search"],
    target_column="sales",
    control_columns=["price_index"],
    dims=("geo",),
    adstock=GeometricAdstock(l_max=8),
    saturation=LogisticSaturation(),
)

mmm.fit(X, y)

YAML note

If you use a combined dataset in YAML, the file at data.dataset_path must contain every configured column:

  • date_column
  • every entry in channel_columns
  • every entry in control_columns, if any
  • every entry in dims, if any
  • target_column

Example:

data:
  dataset_path: panel_dataset.csv
  date_column: date

target:
  column: sales
  type: revenue

dimensions:
  panel: [geo]

media:
  channels: [tv, search]
  controls: [price_index]
  adstock:
    type: geometric
    l_max: 8
  saturation:
    type: logistic

Common pitfalls

  • Missing date_column, channel, control, or dimension columns in X
  • Passing a y Series whose name does not match target_column
  • Passing pandas X and y with different indexes
  • Passing a NumPy y with a different length from X
  • Passing duplicate panel rows or incomplete panel slices for a given date
  • Passing missing observed channel, control, or target values and expecting Abacus to treat them as structural zeroes
  • Expecting the YAML builder or pipeline to find a target column that is not present in the combined dataset
  • Leaving date values as numeric codes instead of normalising them first

Panel Data Layout

This page explains how PanelMMM expects panel rows to be organised in X. For the column-level contract, see Input Data Requirements.

What “panel” means in Abacus

In Abacus, a panel dataset repeats the same time axis across one or more categorical dimensions in dims.

Each row represents:

  • one date_column value
  • one combination of dims values, if any
  • one set of channel and optional control values for that slice

With no extra panel dims, each date appears once. With dims=("geo",), each date appears once per geo. With dims=("geo", "brand"), each date appears once per geo + brand combination.

How dims work

Pass panel dimensions when you construct the model:

from abacus.mmm import GeometricAdstock, LogisticSaturation
from abacus.mmm.panel import PanelMMM

mmm = PanelMMM(
    date_column="date",
    channel_columns=["tv", "search"],
    target_column="sales",
    dims=("geo", "brand"),
    adstock=GeometricAdstock(l_max=8),
    saturation=LogisticSaturation(),
)

dims columns stay in X. They are not moved into y.

Abacus reserves these names for internal coordinates, so do not use them in dims:

  • date
  • channel
  • control
  • fourier_mode

No extra panel dims

If dims=(), X should have one row per date.

date tv search sales
2025-01-06 120 40 820
2025-01-13 125 42 835
2025-01-20 130 45 850

Internally, Abacus reshapes this into:

  • channels: (date, channel)
  • target: (date,)
  • controls, if present: (date, control)

Single panel dim example: geo

If dims=("geo",), each date should appear once for each geo value.

date geo tv search sales
2025-01-06 UK 120 40 820
2025-01-06 US 150 55 910
2025-01-13 UK 125 42 835
2025-01-13 US 152 58 925

Internally, Abacus reshapes this into:

  • channels: (date, geo, channel)
  • target: (date, geo)
  • controls, if present: (date, geo, control)

Multiple panel dims example: geo and brand

If dims=("geo", "brand"), each row identifies one date, one geo, and one brand.

import pandas as pd

X = pd.DataFrame(
    {
        "date": pd.to_datetime(
            [
                "2025-01-06",
                "2025-01-06",
                "2025-01-06",
                "2025-01-06",
                "2025-01-13",
                "2025-01-13",
                "2025-01-13",
                "2025-01-13",
            ]
        ),
        "geo": ["UK", "UK", "US", "US", "UK", "UK", "US", "US"],
        "brand": ["A", "B", "A", "B", "A", "B", "A", "B"],
        "tv": [80.0, 55.0, 92.0, 60.0, 82.0, 58.0, 95.0, 63.0],
        "search": [20.0, 18.0, 24.0, 19.0, 21.0, 18.5, 25.0, 20.0],
    }
)

y = pd.Series(
    [510.0, 370.0, 590.0, 405.0, 520.0, 380.0, 605.0, 418.0],
    name="sales",
)

For a rectangular panel, the row count is:

n_dates * n_geo * n_brand

Internal reshape

Abacus converts the pandas inputs into xarray datasets before building the PyMC model.

Input role Internal variable xarray dims
X[channel_columns] _channel (date, *dims, channel)
X[control_columns] _control (date, *dims, control)
y _target (date, *dims)

The channel and control dimensions come from the configured column names, not from row values.

Rectangularity, duplicates, and missing rows

Abacus builds xarray coordinates from the unique values it sees in:

  • date_column
  • each configured dimension column
  • the configured channel or control names

That has three practical consequences:

  • Keep the panel rectangular. Provide one row for every expected date_column + dims combination.
  • Use explicit zeroes for structural no-spend or no-activity rows.
  • Keep declared channel, control, and target values observed within those rows. Abacus rejects missing metric cells instead of silently converting them to zeroes.
  • Do not use missing rows to mean “unknown”. Abacus validates panel shape before reshape and raises an error if panel cells are missing.

Abacus also requires each date_column + dims combination to appear exactly once. It does not aggregate duplicates for you. If you have duplicate rows, deduplicate or aggregate them before fitting or posterior prediction.

Sorting and uniqueness

Sort your data before fitting:

  • first by date_column
  • then by each entry in dims

Abacus keeps dates in the order they appear in X, and time-varying features infer time resolution from adjacent rows. A sorted dataset makes the time axis deterministic and easier to reason about.

Also make sure that each date_column + dims combination appears once in the prepared table, and that every expected panel slice is present for every date.

DataFrame versus MultiIndex handling

For normal fitting:

  • use a regular DataFrame for X
  • keep date_column and any dims as columns in that DataFrame
  • use a row-aligned Series for y

Abacus does have internal helpers that can align a MultiIndex target Series indexed by [date_column, *dims], but that is not the main user-facing data preparation pattern for fit().

Practical checklist

  • One row per date_column + dims combination
  • No duplicate rows for the same panel cell
  • Same set of dates for every panel slice
  • Explicit zeroes for true zero activity
  • No missing observed channel, control, or target values
  • Sorted rows before fitting

For scaling choices once the layout is correct, see Scaling and Preprocessing.

Scaling and Preprocessing

Abacus scales channels and the target automatically before it builds the PyMC graph for PanelMMM. This page explains what is scaled, how the Scaling configuration works, and what you still need to preprocess yourself.

What Abacus scales automatically

Abacus computes scales from the reshaped xarray dataset immediately before model construction.

Variable role Automatic scaling Notes
Target (y) Yes Divided by target_scale before the likelihood is built.
Channels (channel_columns) Yes Divided by channel_scale before adstock and saturation.
Controls (control_columns) No Controls enter the model on their original scale.
Date and dims columns No These define coordinates, not modelled numeric inputs.

Abacus stores the resulting scalers in the model as xarray data:

  • _target scaler data in model.scalers["_target"]
  • _channel scaler data in model.scalers["_channel"]

Default behaviour

If you do not pass scaling, PanelMMM uses:

Scaling(
    target=VariableScaling(method="max", dims=dims),
    channel=VariableScaling(method="max", dims=dims),
)

This means:

  • the target is divided by the maximum over date and all configured dims
  • each channel is divided by its maximum over date and all configured dims

With no extra panel dims:

  • target_scale is a scalar
  • channel_scale has dimension channel

With dims=("geo",) and the default scaling:

  • target_scale is still a scalar, because scaling reduces over both date and geo
  • channel_scale still has dimension channel, so each channel is pooled across all geos

If you want per-panel scales instead of pooled scales, set dims=() inside the relevant VariableScaling. See Dimension semantics.

Scaling and VariableScaling

Use abacus.mmm.scaling.Scaling and abacus.mmm.scaling.VariableScaling to control automatic scaling.

Setting Purpose Allowed values
VariableScaling.method Reduction used to compute the scale "max" or "mean"
VariableScaling.dims Extra dimensions to reduce across, in addition to date String or tuple of strings
Scaling.target Scaling rule for the target VariableScaling
Scaling.channel Scaling rule for channels VariableScaling

Rules enforced by the implementation:

  • date is always assumed in the reduction and must not be listed in VariableScaling.dims.
  • Duplicate scaling dims are not allowed.
  • Target scaling dims must come from the model dims.
  • Channel scaling dims must come from the model dims, with optional inclusion of channel.

You can pass either:

  • a Scaling object
  • a plain dictionary with target and channel keys

If the dictionary omits one side, Abacus fills the missing target or channel rule with the default method="max", dims=dims configuration.

Dimension semantics

VariableScaling.dims tells Abacus which dimensions to reduce across in addition to date. It does not tell Abacus which dimensions to keep.

Assume a model with dims=("geo",) so channel data has dimensions (date, geo, channel) and target data has dimensions (date, geo).

Configuration Reduction performed Resulting scale dims Meaning
target.dims=() over date (geo,) One target scale per geo
target.dims=("geo",) over date, geo () One pooled target scale
channel.dims=() over date (geo, channel) One scale per geo-channel pair
channel.dims=("geo",) over date, geo (channel,) One pooled scale per channel
channel.dims=("geo", "channel") over date, geo, channel () One pooled scale for all channels

Python example

This example keeps separate scales for each geo by reducing only over date:

from abacus.mmm import GeometricAdstock, LogisticSaturation
from abacus.mmm.panel import PanelMMM
from abacus.mmm.scaling import Scaling, VariableScaling

mmm = PanelMMM(
    date_column="date",
    channel_columns=["tv", "search"],
    target_column="sales",
    dims=("geo",),
    scaling=Scaling(
        target=VariableScaling(method="mean", dims=()),
        channel=VariableScaling(method="max", dims=()),
    ),
    adstock=GeometricAdstock(l_max=8),
    saturation=LogisticSaturation(),
)

In that configuration:

  • the target is divided by the per-geo mean over time
  • each channel is divided by the per-geo, per-channel maximum over time

YAML example

The YAML builder accepts the same structure through a top-level scaling block:

data:
  date_column: date

target:
  column: y
  type: revenue

dimensions:
  panel: [market]

media:
  channels: [channel_1, channel_2]
  adstock:
    type: geometric
    l_max: 8
  saturation:
    type: logistic

scaling:
  target:
    method: max
    dims: []
  channels:
    method: max
    dims: [market]

In this example:

  • target is scaled separately for each market
  • channel is scaled across date and market, leaving one scale per channel

Original units versus model scale

The model is fit on scaled target and channel data.

That affects downstream interpretation:

  • posterior likelihood and many contribution variables live in scaled target space
  • channel inputs are transformed after scaling, not in raw units

If you want stored deterministics in original target units, add them explicitly after build_model(...):

mmm.add_original_scale_contribution_variable(
    var=["channel_contribution", "y"]
)

The YAML builder supports the same workflow through original_scale_vars:

original_scale_vars:
  - channel_contribution
  - y

original_scale_vars adds extra original-scale deterministic variables. It does not change how the model is fit.

What Abacus does not preprocess for you

Abacus does not automatically:

  • scale controls
  • impute missing data in a domain-aware way
  • reinterpret missing observed channel, control, or target values as zeroes
  • sort the dataset for you
  • repair non-rectangular panel layouts
  • tolerate duplicate panel rows or incomplete panel slices
  • coerce Python-API dates to datetimes before fitting

Practical preprocessing advice

Before fitting:

  • normalise date_column with pd.to_datetime(...)
  • sort by date_column and then by dims
  • make panel gaps explicit instead of leaving missing rows
  • ensure every date_column + dims panel cell appears exactly once
  • impute missing observed channel, control, and target values before fitting or posterior prediction instead of relying on implicit zero-fill
  • decide whether controls should be centred, standardised, log-transformed, or otherwise prepared before they go into control_columns
  • choose scaling dims deliberately instead of relying on the default when you use panel data

Common pitfalls

  • Expecting the default scaling to be per-group when it actually pools across the configured panel dims
  • Adding date to VariableScaling.dims; Abacus rejects this
  • Forgetting that controls are left on their original scale
  • Treating VariableScaling.dims as dimensions to keep rather than dimensions to reduce across
  • Assuming original_scale_vars changes fitting scale rather than adding extra outputs

For the input table shape that scaling operates on, see Panel Data Layout.