This section explains how to prepare X and y for PanelMMM.
It covers the required columns, how panel rows are organised when you use
dims, and how Abacus scales channels and the target before fitting.
Pages
Input Data Requirements — Required X and y
inputs, column roles, alignment rules, and common input errors.
Panel Data Layout — How to structure rows for no
panel dims, one dim such as geo, or multiple dims such as geo and
brand.
Scaling and Preprocessing — What Abacus
scales automatically, how Scaling works, and what to preprocess yourself.
y as a pandas.Series named target_column, or a one-dimensional NumPy
array of the same length as X
X must contain the date column, all media columns, and any configured
control_columns or dims columns. y carries only the target values.
Role
Where it must be present
Required
Notes
date_column
X
Yes
Normalise to datetimes or parseable date strings.
channel_columns
X
Yes
Every listed channel column must exist in X.
target_column
y
Yes
y.name should match target_column.
control_columns
X
No
If configured, every listed control column must exist in X.
dims
X
No
One column per configured panel dimension, such as geo or brand.
X and y
When you call fit(X, y) or build_model(X, y):
Keep the target out of X.
Keep X and y row-aligned.
If both are pandas objects, keep the same index on both. The shared
regression builder checks index equality before fitting.
If you pass y as a NumPy array, its length must match len(X).
For panel models, each date_column + dims combination must appear
exactly once. Duplicate rows are rejected.
Abacus uses target_column as the target name throughout the panel reshape.
If y is a Series, its name must match target_column.
Date column
date_column is required in X.
Abacus expects calendar dates, not integer date codes. In practice:
Use datetime64[ns] where possible.
Parse string dates with pd.to_datetime(...) before fitting when you use the
Python API.
Do not rely on numeric date values such as 0, 1, 2. Pandas can interpret
them as offsets from the Unix epoch, which is usually not what you want.
The YAML builder normalises X[date_column] with pd.to_datetime(...) after
loading the dataset. Direct Python use does not add an equivalent preprocessing
step for you.
Channel columns
channel_columns is a required constructor argument and must be a non-empty
list.
Each listed channel:
must be present in X
must be fully observed for every row you pass into fit or posterior
prediction; Abacus does not silently convert missing channel values to zero
should represent the raw media variable that you want the adstock and
saturation transformations to consume
Target column
target_column names the dependent variable. It defaults to "y", but you can
set a different name such as "sales" or "conversions".
For direct Python use:
pass the target as y
name the Series with target_column
keep the target fully observed; missing target values are rejected rather
than zero-filled
For combined-file YAML or pipeline flows:
keep the target column in the source dataset
Abacus splits it out of the combined dataset before fitting
Control columns
control_columns is optional.
If you configure it, every listed control column must be present in X.
Controls stay in the design matrix as separate regressors; they are not part of
y.
Like channels, configured controls must be fully observed for every row passed
into fit or posterior prediction.
One tabular file containing both predictors and the target column
Pipeline runner with dataset_path
Same as above
Pipeline runner with x_path and y_path
Separate feature and target files; the runner extracts target_column from the target file
Abacus also has an internal alignment helper that can work with a MultiIndex
target Series indexed by [date_column, *dims], but that is mainly used in
fit-data rebuild and load flows. For normal fitting, keep y row-aligned with
X.
Missing date_column, channel, control, or dimension columns in X
Passing a ySeries whose name does not match target_column
Passing pandas X and y with different indexes
Passing a NumPy y with a different length from X
Passing duplicate panel rows or incomplete panel slices for a given date
Passing missing observed channel, control, or target values and expecting
Abacus to treat them as structural zeroes
Expecting the YAML builder or pipeline to find a target column that is not
present in the combined dataset
Leaving date values as numeric codes instead of normalising them first
Panel Data Layout
This page explains how PanelMMM expects panel rows to be organised in X.
For the column-level contract, see
Input Data Requirements.
What “panel” means in Abacus
In Abacus, a panel dataset repeats the same time axis across one or more
categorical dimensions in dims.
Each row represents:
one date_column value
one combination of dims values, if any
one set of channel and optional control values for that slice
With no extra panel dims, each date appears once. With dims=("geo",), each
date appears once per geo. With dims=("geo", "brand"), each date appears
once per geo + brand combination.
How dims work
Pass panel dimensions when you construct the model:
Abacus converts the pandas inputs into xarray datasets before building the
PyMC model.
Input role
Internal variable
xarray dims
X[channel_columns]
_channel
(date, *dims, channel)
X[control_columns]
_control
(date, *dims, control)
y
_target
(date, *dims)
The channel and control dimensions come from the configured column names,
not from row values.
Rectangularity, duplicates, and missing rows
Abacus builds xarray coordinates from the unique values it sees in:
date_column
each configured dimension column
the configured channel or control names
That has three practical consequences:
Keep the panel rectangular. Provide one row for every expected
date_column + dims combination.
Use explicit zeroes for structural no-spend or no-activity rows.
Keep declared channel, control, and target values observed within those rows.
Abacus rejects missing metric cells instead of silently converting them to
zeroes.
Do not use missing rows to mean “unknown”. Abacus validates panel shape
before reshape and raises an error if panel cells are missing.
Abacus also requires each date_column + dims combination to appear exactly
once. It does not aggregate duplicates for you. If you have duplicate rows,
deduplicate or aggregate them before fitting or posterior prediction.
Sorting and uniqueness
Sort your data before fitting:
first by date_column
then by each entry in dims
Abacus keeps dates in the order they appear in X, and time-varying features
infer time resolution from adjacent rows. A sorted dataset makes the time axis
deterministic and easier to reason about.
Also make sure that each date_column + dims combination appears once in the
prepared table, and that every expected panel slice is present for every date.
DataFrame versus MultiIndex handling
For normal fitting:
use a regular DataFrame for X
keep date_column and any dims as columns in that DataFrame
use a row-aligned Series for y
Abacus does have internal helpers that can align a MultiIndex target Series
indexed by [date_column, *dims], but that is not the main user-facing data
preparation pattern for fit().
Practical checklist
One row per date_column + dims combination
No duplicate rows for the same panel cell
Same set of dates for every panel slice
Explicit zeroes for true zero activity
No missing observed channel, control, or target values
Abacus scales channels and the target automatically before it builds the PyMC
graph for PanelMMM. This page explains what is scaled, how the Scaling
configuration works, and what you still need to preprocess yourself.
What Abacus scales automatically
Abacus computes scales from the reshaped xarray dataset immediately before model
construction.
Variable role
Automatic scaling
Notes
Target (y)
Yes
Divided by target_scale before the likelihood is built.
Channels (channel_columns)
Yes
Divided by channel_scale before adstock and saturation.
Controls (control_columns)
No
Controls enter the model on their original scale.
Date and dims columns
No
These define coordinates, not modelled numeric inputs.
Abacus stores the resulting scalers in the model as xarray data: