Scaling and Preprocessing

Abacus scales channels and the target automatically before it builds the PyMC graph for PanelMMM. This page explains what is scaled, how the Scaling configuration works, and what you still need to preprocess yourself.

What Abacus scales automatically

Abacus computes scales from the reshaped xarray dataset immediately before model construction.

Variable role Automatic scaling Notes
Target (y) Yes Divided by target_scale before the likelihood is built.
Channels (channel_columns) Yes Divided by channel_scale before adstock and saturation.
Controls (control_columns) No Controls enter the model on their original scale.
Date and dims columns No These define coordinates, not modelled numeric inputs.

Abacus stores the resulting scalers in the model as xarray data:

  • _target scaler data in model.scalers["_target"]
  • _channel scaler data in model.scalers["_channel"]

Default behaviour

If you do not pass scaling, PanelMMM uses:

Scaling(
    target=VariableScaling(method="max", dims=dims),
    channel=VariableScaling(method="max", dims=dims),
)

This means:

  • the target is divided by the maximum over date and all configured dims
  • each channel is divided by its maximum over date and all configured dims

With no extra panel dims:

  • target_scale is a scalar
  • channel_scale has dimension channel

With dims=("geo",) and the default scaling:

  • target_scale is still a scalar, because scaling reduces over both date and geo
  • channel_scale still has dimension channel, so each channel is pooled across all geos

If you want per-panel scales instead of pooled scales, set dims=() inside the relevant VariableScaling. See Dimension semantics.

Scaling and VariableScaling

Use abacus.mmm.scaling.Scaling and abacus.mmm.scaling.VariableScaling to control automatic scaling.

Setting Purpose Allowed values
VariableScaling.method Reduction used to compute the scale "max" or "mean"
VariableScaling.dims Extra dimensions to reduce across, in addition to date String or tuple of strings
Scaling.target Scaling rule for the target VariableScaling
Scaling.channel Scaling rule for channels VariableScaling

Rules enforced by the implementation:

  • date is always assumed in the reduction and must not be listed in VariableScaling.dims.
  • Duplicate scaling dims are not allowed.
  • Target scaling dims must come from the model dims.
  • Channel scaling dims must come from the model dims, with optional inclusion of channel.

You can pass either:

  • a Scaling object
  • a plain dictionary with target and channel keys

If the dictionary omits one side, Abacus fills the missing target or channel rule with the default method="max", dims=dims configuration.

Dimension semantics

VariableScaling.dims tells Abacus which dimensions to reduce across in addition to date. It does not tell Abacus which dimensions to keep.

Assume a model with dims=("geo",) so channel data has dimensions (date, geo, channel) and target data has dimensions (date, geo).

Configuration Reduction performed Resulting scale dims Meaning
target.dims=() over date (geo,) One target scale per geo
target.dims=("geo",) over date, geo () One pooled target scale
channel.dims=() over date (geo, channel) One scale per geo-channel pair
channel.dims=("geo",) over date, geo (channel,) One pooled scale per channel
channel.dims=("geo", "channel") over date, geo, channel () One pooled scale for all channels

Python example

This example keeps separate scales for each geo by reducing only over date:

from abacus.mmm import GeometricAdstock, LogisticSaturation
from abacus.mmm.panel import PanelMMM
from abacus.mmm.scaling import Scaling, VariableScaling

mmm = PanelMMM(
    date_column="date",
    channel_columns=["tv", "search"],
    target_column="sales",
    dims=("geo",),
    scaling=Scaling(
        target=VariableScaling(method="mean", dims=()),
        channel=VariableScaling(method="max", dims=()),
    ),
    adstock=GeometricAdstock(l_max=8),
    saturation=LogisticSaturation(),
)

In that configuration:

  • the target is divided by the per-geo mean over time
  • each channel is divided by the per-geo, per-channel maximum over time

YAML example

The YAML builder accepts the same structure through a top-level scaling block:

data:
  date_column: date

target:
  column: y
  type: revenue

dimensions:
  panel: [market]

media:
  channels: [channel_1, channel_2]
  adstock:
    type: geometric
    l_max: 8
  saturation:
    type: logistic

scaling:
  target:
    method: max
    dims: []
  channels:
    method: max
    dims: [market]

In this example:

  • target is scaled separately for each market
  • channel is scaled across date and market, leaving one scale per channel

Original units versus model scale

The model is fit on scaled target and channel data.

That affects downstream interpretation:

  • posterior likelihood and many contribution variables live in scaled target space
  • channel inputs are transformed after scaling, not in raw units

If you want stored deterministics in original target units, add them explicitly after build_model(...):

mmm.add_original_scale_contribution_variable(
    var=["channel_contribution", "y"]
)

The YAML builder supports the same workflow through original_scale_vars:

original_scale_vars:
  - channel_contribution
  - y

original_scale_vars adds extra original-scale deterministic variables. It does not change how the model is fit.

What Abacus does not preprocess for you

Abacus does not automatically:

  • scale controls
  • impute missing data in a domain-aware way
  • reinterpret missing observed channel, control, or target values as zeroes
  • sort the dataset for you
  • repair non-rectangular panel layouts
  • tolerate duplicate panel rows or incomplete panel slices
  • coerce Python-API dates to datetimes before fitting

Practical preprocessing advice

Before fitting:

  • normalise date_column with pd.to_datetime(...)
  • sort by date_column and then by dims
  • make panel gaps explicit instead of leaving missing rows
  • ensure every date_column + dims panel cell appears exactly once
  • impute missing observed channel, control, and target values before fitting or posterior prediction instead of relying on implicit zero-fill
  • decide whether controls should be centred, standardised, log-transformed, or otherwise prepared before they go into control_columns
  • choose scaling dims deliberately instead of relying on the default when you use panel data

Common pitfalls

  • Expecting the default scaling to be per-group when it actually pools across the configured panel dims
  • Adding date to VariableScaling.dims; Abacus rejects this
  • Forgetting that controls are left on their original scale
  • Treating VariableScaling.dims as dimensions to keep rather than dimensions to reduce across
  • Assuming original_scale_vars changes fitting scale rather than adding extra outputs

For the input table shape that scaling operates on, see Panel Data Layout.