Data Preparation

This section explains how to prepare X and y for PanelMMM. It covers the required columns, how panel rows are organised when you use dims, and how Abacus scales channels and the target before fitting.

Input Data Requirements

Use this page together with Panel Data Layout and Scaling and Preprocessing when you prepare a dataset for PanelMMM.

Core contract

For direct Python use, PanelMMM expects:

X as a pandas.DataFrame
y as a pandas.Series named target_column, or a one-dimensional NumPy array of the same length as X

X must contain the date column, all media columns, and any configured control_columns or dims columns. y carries only the target values.

Role	Where it must be present	Required	Notes
`date_column`	`X`	Yes	Normalise to datetimes or parseable date strings.
`channel_columns`	`X`	Yes	Every listed channel column must exist in `X`.
`target_column`	`y`	Yes	`y.name` should match `target_column`.
`control_columns`	`X`	No	If configured, every listed control column must exist in `X`.
`dims`	`X`	No	One column per configured panel dimension, such as `geo` or `brand`.

`X` and `y`

When you call fit(X, y) or build_model(X, y):

Keep the target out of X.
Keep X and y row-aligned.
If both are pandas objects, keep the same index on both. The shared regression builder checks index equality before fitting.
If you pass y as a NumPy array, its length must match len(X).
For panel models, each date_column + dims combination must appear exactly once. Duplicate rows are rejected.

Abacus uses target_column as the target name throughout the panel reshape. If y is a Series, its name must match target_column.

Date column

date_column is required in X.

Abacus expects calendar dates, not integer date codes. In practice:

Use datetime64[ns] where possible.
Parse string dates with pd.to_datetime(...) before fitting when you use the Python API.
Do not rely on numeric date values such as 0, 1, 2. Pandas can interpret them as offsets from the Unix epoch, which is usually not what you want.

The YAML builder normalises X[date_column] with pd.to_datetime(...) after loading the dataset. Direct Python use does not add an equivalent preprocessing step for you.

Channel columns

channel_columns is a required constructor argument and must be a non-empty list.

Each listed channel:

must be present in X
must be fully observed for every row you pass into fit or posterior prediction; Abacus does not silently convert missing channel values to zero
should represent the raw media variable that you want the adstock and saturation transformations to consume

Target column

target_column names the dependent variable. It defaults to "y", but you can set a different name such as "sales" or "conversions".

For direct Python use:

pass the target as y
name the Series with target_column
keep the target fully observed; missing target values are rejected rather than zero-filled

For combined-file YAML or pipeline flows:

keep the target column in the source dataset
Abacus splits it out of the combined dataset before fitting

Control columns

control_columns is optional.

If you configure it, every listed control column must be present in X. Controls stay in the design matrix as separate regressors; they are not part of y.

Like channels, configured controls must be fully observed for every row passed into fit or posterior prediction.

Abacus does not automatically scale controls. See Scaling and Preprocessing.

Panel dimensions with `dims`

dims is optional. Use it when you want a panel model, for example by geo, brand, or market.

If you set dims=("geo", "brand"):

X must contain geo and brand columns
each row in X represents one date + geo + brand observation
each new date must include every fitted panel slice when you later call posterior-predictive methods with new data

Do not use reserved internal names in dims:

date
channel
control
fourier_mode

For row layout and rectangularity guidance, see Panel Data Layout.

Supported shapes and alignment

Workflow	Supported shape
Direct `PanelMMM.fit()` / `build_model()`	`X`: `DataFrame`; `y`: `Series` or 1D `ndarray`
YAML builder with `data.dataset_path`	One tabular file containing both predictors and the target column
Pipeline runner with `dataset_path`	Same as above
Pipeline runner with `x_path` and `y_path`	Separate feature and target files; the runner extracts `target_column` from the target file

Abacus also has an internal alignment helper that can work with a MultiIndex target Series indexed by [date_column, *dims], but that is mainly used in fit-data rebuild and load flows. For normal fitting, keep y row-aligned with X.

Python example

import pandas as pd

from abacus.mmm import GeometricAdstock, LogisticSaturation
from abacus.mmm.panel import PanelMMM

dataset = pd.DataFrame(
    {
        "date": pd.to_datetime(
            ["2025-01-06", "2025-01-06", "2025-01-13", "2025-01-13"]
        ),
        "geo": ["UK", "US", "UK", "US"],
        "tv": [120.0, 150.0, 125.0, 152.0],
        "search": [40.0, 55.0, 42.0, 58.0],
        "price_index": [1.02, 0.99, 1.01, 1.00],
        "sales": [820.0, 910.0, 835.0, 925.0],
    }
)

X = dataset.drop(columns=["sales"])
y = dataset["sales"].rename("sales")

mmm = PanelMMM(
    date_column="date",
    channel_columns=["tv", "search"],
    target_column="sales",
    control_columns=["price_index"],
    dims=("geo",),
    adstock=GeometricAdstock(l_max=8),
    saturation=LogisticSaturation(),
)

mmm.fit(X, y)

YAML note

If you use a combined dataset in YAML, the file at data.dataset_path must contain every configured column:

date_column
every entry in channel_columns
every entry in control_columns, if any
every entry in dims, if any
target_column

Example:

data:
  dataset_path: panel_dataset.csv
  date_column: date

target:
  column: sales
  type: revenue

dimensions:
  panel: [geo]

media:
  channels: [tv, search]
  controls: [price_index]
  adstock:
    type: geometric
    l_max: 8
  saturation:
    type: logistic

Common pitfalls

Missing date_column, channel, control, or dimension columns in X
Passing a y Series whose name does not match target_column
Passing pandas X and y with different indexes
Passing a NumPy y with a different length from X
Passing duplicate panel rows or incomplete panel slices for a given date
Passing missing observed channel, control, or target values and expecting Abacus to treat them as structural zeroes
Expecting the YAML builder or pipeline to find a target column that is not present in the combined dataset
Leaving date values as numeric codes instead of normalising them first

Panel Data Layout

This page explains how PanelMMM expects panel rows to be organised in X. For the column-level contract, see Input Data Requirements.

What “panel” means in Abacus

In Abacus, a panel dataset repeats the same time axis across one or more categorical dimensions in dims.

Each row represents:

one date_column value
one combination of dims values, if any
one set of channel and optional control values for that slice

With no extra panel dims, each date appears once. With dims=("geo",), each date appears once per geo. With dims=("geo", "brand"), each date appears once per geo + brand combination.

How `dims` work

Pass panel dimensions when you construct the model:

from abacus.mmm import GeometricAdstock, LogisticSaturation
from abacus.mmm.panel import PanelMMM

mmm = PanelMMM(
    date_column="date",
    channel_columns=["tv", "search"],
    target_column="sales",
    dims=("geo", "brand"),
    adstock=GeometricAdstock(l_max=8),
    saturation=LogisticSaturation(),
)

dims columns stay in X. They are not moved into y.

Abacus reserves these names for internal coordinates, so do not use them in dims:

date
channel
control
fourier_mode

No extra panel dims

If dims=(), X should have one row per date.

date	tv	search	sales
2025-01-06	120	40	820
2025-01-13	125	42	835
2025-01-20	130	45	850

Internally, Abacus reshapes this into:

channels: (date, channel)
target: (date,)
controls, if present: (date, control)

Single panel dim example: `geo`

If dims=("geo",), each date should appear once for each geo value.

date	geo	tv	search	sales
2025-01-06	UK	120	40	820
2025-01-06	US	150	55	910
2025-01-13	UK	125	42	835
2025-01-13	US	152	58	925

Internally, Abacus reshapes this into:

channels: (date, geo, channel)
target: (date, geo)
controls, if present: (date, geo, control)

Multiple panel dims example: `geo` and `brand`

If dims=("geo", "brand"), each row identifies one date, one geo, and one brand.

import pandas as pd

X = pd.DataFrame(
    {
        "date": pd.to_datetime(
            [
                "2025-01-06",
                "2025-01-06",
                "2025-01-06",
                "2025-01-06",
                "2025-01-13",
                "2025-01-13",
                "2025-01-13",
                "2025-01-13",
            ]
        ),
        "geo": ["UK", "UK", "US", "US", "UK", "UK", "US", "US"],
        "brand": ["A", "B", "A", "B", "A", "B", "A", "B"],
        "tv": [80.0, 55.0, 92.0, 60.0, 82.0, 58.0, 95.0, 63.0],
        "search": [20.0, 18.0, 24.0, 19.0, 21.0, 18.5, 25.0, 20.0],
    }
)

y = pd.Series(
    [510.0, 370.0, 590.0, 405.0, 520.0, 380.0, 605.0, 418.0],
    name="sales",
)

For a rectangular panel, the row count is:

n_dates * n_geo * n_brand

Internal reshape

Abacus converts the pandas inputs into xarray datasets before building the PyMC model.

Input role	Internal variable	xarray dims
`X[channel_columns]`	`_channel`	`(date, *dims, channel)`
`X[control_columns]`	`_control`	`(date, *dims, control)`
`y`	`_target`	`(date, *dims)`

The channel and control dimensions come from the configured column names, not from row values.

Rectangularity, duplicates, and missing rows

Abacus builds xarray coordinates from the unique values it sees in:

date_column
each configured dimension column
the configured channel or control names

That has three practical consequences:

Keep the panel rectangular. Provide one row for every expected date_column + dims combination.
Use explicit zeroes for structural no-spend or no-activity rows.
Keep declared channel, control, and target values observed within those rows. Abacus rejects missing metric cells instead of silently converting them to zeroes.
Do not use missing rows to mean “unknown”. Abacus validates panel shape before reshape and raises an error if panel cells are missing.

Abacus also requires each date_column + dims combination to appear exactly once. It does not aggregate duplicates for you. If you have duplicate rows, deduplicate or aggregate them before fitting or posterior prediction.

Sorting and uniqueness

Sort your data before fitting:

first by date_column
then by each entry in dims

Abacus keeps dates in the order they appear in X, and time-varying features infer time resolution from adjacent rows. A sorted dataset makes the time axis deterministic and easier to reason about.

Also make sure that each date_column + dims combination appears once in the prepared table, and that every expected panel slice is present for every date.

DataFrame versus MultiIndex handling

For normal fitting:

use a regular DataFrame for X
keep date_column and any dims as columns in that DataFrame
use a row-aligned Series for y

Abacus does have internal helpers that can align a MultiIndex target Series indexed by [date_column, *dims], but that is not the main user-facing data preparation pattern for fit().

Practical checklist

One row per date_column + dims combination
No duplicate rows for the same panel cell
Same set of dates for every panel slice
Explicit zeroes for true zero activity
No missing observed channel, control, or target values
Sorted rows before fitting

For scaling choices once the layout is correct, see Scaling and Preprocessing.

Scaling and Preprocessing

Abacus scales channels and the target automatically before it builds the PyMC graph for PanelMMM. This page explains what is scaled, how the Scaling configuration works, and what you still need to preprocess yourself.

What Abacus scales automatically

Abacus computes scales from the reshaped xarray dataset immediately before model construction.

Variable role	Automatic scaling	Notes
Target (`y`)	Yes	Divided by `target_scale` before the likelihood is built.
Channels (`channel_columns`)	Yes	Divided by `channel_scale` before adstock and saturation.
Controls (`control_columns`)	No	Controls enter the model on their original scale.
Date and `dims` columns	No	These define coordinates, not modelled numeric inputs.

Abacus stores the resulting scalers in the model as xarray data:

_target scaler data in model.scalers["_target"]
_channel scaler data in model.scalers["_channel"]

Default behaviour

If you do not pass scaling, PanelMMM uses:

Scaling(
    target=VariableScaling(method="max", dims=dims),
    channel=VariableScaling(method="max", dims=dims),
)

This means:

the target is divided by the maximum over date and all configured dims
each channel is divided by its maximum over date and all configured dims

With no extra panel dims:

target_scale is a scalar
channel_scale has dimension channel

With dims=("geo",) and the default scaling:

target_scale is still a scalar, because scaling reduces over both date and geo
channel_scale still has dimension channel, so each channel is pooled across all geos

If you want per-panel scales instead of pooled scales, set dims=() inside the relevant VariableScaling. See Dimension semantics.

`Scaling` and `VariableScaling`

Use abacus.mmm.scaling.Scaling and abacus.mmm.scaling.VariableScaling to control automatic scaling.

Setting	Purpose	Allowed values
`VariableScaling.method`	Reduction used to compute the scale	`"max"` or `"mean"`
`VariableScaling.dims`	Extra dimensions to reduce across, in addition to `date`	String or tuple of strings
`Scaling.target`	Scaling rule for the target	`VariableScaling`
`Scaling.channel`	Scaling rule for channels	`VariableScaling`

Rules enforced by the implementation:

date is always assumed in the reduction and must not be listed in VariableScaling.dims.
Duplicate scaling dims are not allowed.
Target scaling dims must come from the model dims.
Channel scaling dims must come from the model dims, with optional inclusion of channel.

You can pass either:

a Scaling object
a plain dictionary with target and channel keys

If the dictionary omits one side, Abacus fills the missing target or channel rule with the default method="max", dims=dims configuration.

Dimension semantics

VariableScaling.dims tells Abacus which dimensions to reduce across in addition to date. It does not tell Abacus which dimensions to keep.

Assume a model with dims=("geo",) so channel data has dimensions (date, geo, channel) and target data has dimensions (date, geo).

Configuration	Reduction performed	Resulting scale dims	Meaning
`target.dims=()`	over `date`	`(geo,)`	One target scale per geo
`target.dims=("geo",)`	over `date`, `geo`	`()`	One pooled target scale
`channel.dims=()`	over `date`	`(geo, channel)`	One scale per geo-channel pair
`channel.dims=("geo",)`	over `date`, `geo`	`(channel,)`	One pooled scale per channel
`channel.dims=("geo", "channel")`	over `date`, `geo`, `channel`	`()`	One pooled scale for all channels

Python example

This example keeps separate scales for each geo by reducing only over date:

from abacus.mmm import GeometricAdstock, LogisticSaturation
from abacus.mmm.panel import PanelMMM
from abacus.mmm.scaling import Scaling, VariableScaling

mmm = PanelMMM(
    date_column="date",
    channel_columns=["tv", "search"],
    target_column="sales",
    dims=("geo",),
    scaling=Scaling(
        target=VariableScaling(method="mean", dims=()),
        channel=VariableScaling(method="max", dims=()),
    ),
    adstock=GeometricAdstock(l_max=8),
    saturation=LogisticSaturation(),
)

In that configuration:

the target is divided by the per-geo mean over time
each channel is divided by the per-geo, per-channel maximum over time

YAML example

The YAML builder accepts the same structure through a top-level scaling block:

data:
  date_column: date

target:
  column: y
  type: revenue

dimensions:
  panel: [market]

media:
  channels: [channel_1, channel_2]
  adstock:
    type: geometric
    l_max: 8
  saturation:
    type: logistic

scaling:
  target:
    method: max
    dims: []
  channels:
    method: max
    dims: [market]

In this example:

target is scaled separately for each market
channel is scaled across date and market, leaving one scale per channel

Original units versus model scale

The model is fit on scaled target and channel data.

That affects downstream interpretation:

posterior likelihood and many contribution variables live in scaled target space
channel inputs are transformed after scaling, not in raw units

If you want stored deterministics in original target units, add them explicitly after build_model(...):

mmm.add_original_scale_contribution_variable(
    var=["channel_contribution", "y"]
)

The YAML builder supports the same workflow through original_scale_vars:

original_scale_vars:
  - channel_contribution
  - y

original_scale_vars adds extra original-scale deterministic variables. It does not change how the model is fit.

What Abacus does not preprocess for you

Abacus does not automatically:

scale controls
impute missing data in a domain-aware way
reinterpret missing observed channel, control, or target values as zeroes
sort the dataset for you
repair non-rectangular panel layouts
tolerate duplicate panel rows or incomplete panel slices
coerce Python-API dates to datetimes before fitting

Practical preprocessing advice

Before fitting:

normalise date_column with pd.to_datetime(...)
sort by date_column and then by dims
make panel gaps explicit instead of leaving missing rows
ensure every date_column + dims panel cell appears exactly once
impute missing observed channel, control, and target values before fitting or posterior prediction instead of relying on implicit zero-fill
decide whether controls should be centred, standardised, log-transformed, or otherwise prepared before they go into control_columns
choose scaling dims deliberately instead of relying on the default when you use panel data

Common pitfalls

Expecting the default scaling to be per-group when it actually pools across the configured panel dims
Adding date to VariableScaling.dims; Abacus rejects this
Forgetting that controls are left on their original scale
Treating VariableScaling.dims as dimensions to keep rather than dimensions to reduce across
Assuming original_scale_vars changes fitting scale rather than adding extra outputs

For the input table shape that scaling operates on, see Panel Data Layout.

Data Preparation

Pages

Subsections of Data Preparation

Input Data Requirements

Core contract

X and y

Date column

Channel columns

Target column

Control columns

Panel dimensions with dims

Supported shapes and alignment

Python example

YAML note

Common pitfalls

Panel Data Layout

What “panel” means in Abacus

How dims work

No extra panel dims

Single panel dim example: geo

Multiple panel dims example: geo and brand

Internal reshape

Rectangularity, duplicates, and missing rows

Sorting and uniqueness

DataFrame versus MultiIndex handling

Practical checklist

Scaling and Preprocessing

What Abacus scales automatically

Default behaviour

Scaling and VariableScaling

Dimension semantics

Python example

YAML example

Original units versus model scale

What Abacus does not preprocess for you

Practical preprocessing advice

Common pitfalls

`X` and `y`

Panel dimensions with `dims`

How `dims` work

Single panel dim example: `geo`

Multiple panel dims example: `geo` and `brand`

`Scaling` and `VariableScaling`