Data Analytics

Causal Inference Without an RCT: The Playbook Big Companies Use

· 25 min read

Most analysts cannot defend a policy-change number, and CFOs know it. Four methods, one decomposition discipline, one sensitivity table, applied in the order that matches the data and the stakes.

In This Article

  1. Most analysts cannot defend a policy-change number, and CFOs know it
  2. The fix is a four-tier ladder, not a fancier dashboard
  3. Policy changes are the modal “big number” in modern boardrooms
  4. Tier 1Box-Tiao ARIMA solves 80 percent of policy-impact questions cleanly
  5. Tier 2When the effect ramps and fades, switch to piecewise-linear intervention
  6. Tier 3For overlapping shocks, isolate intervention neurons inside the network
  7. Tier 4When time-series alone won’t defend the number, reach for a quasi-experiment atlas
  8. The decomposition discipline is what separates the playbook from a textbook
  9. Sequencing the four-tier ladder inside your business

Most analysts cannot defend a policy-change number, and CFOs know it

A few years ago a leadership team asked the analytics group a simple question. The pricing committee had reorganized the way regional discounts were approved six months earlier. What was the impact on margin?

The team came back with a confident answer. They had compared the three months after the change to the three months before, controlled for seasonality with a year-over-year index, and reported a margin lift in the high single digits. The deck was clean. The chart was clean. The CFO killed it in two questions.

“What would margin have been if you’d done nothing?”

“How do you know it wasn’t the category mix shift you mentioned on slide four?”

The team did not have answers. The number went into a footnote. Six months of work, retired into a footnote.

That scene plays out in most $50M-revenue-and-up businesses I work with as a fractional CDO. The analytics team produces a pre/post number. Finance kills it. Leadership stops trusting the analytics team for the next high-stakes question. The cycle repeats.

The fix is not more dashboards. The fix is a small set of methods that produce defensible counterfactuals for events you cannot A/B test, paired with a discipline for decomposing concurrent shocks, paired with a sensitivity table that survives the CFO’s first round of challenges.

The fix is a four-tier ladder, not a fancier dashboard

Each tier exists because the simpler tier above it failed in a specific, diagnosable way. The discipline is not picking the most sophisticated model. The discipline is picking the simplest model that passes residual diagnostics and sensitivity tests for the question on the table.

Policy changes are the modal “big number” in modern boardrooms

Five or six years ago a finance leader could ignore most policy-impact questions because the policy lever was small. Pricing tiers shifted by single-digit percent. Marketing reallocations happened inside an annual budget envelope. The cost of a wrong estimate was bounded.

That is no longer the shape of the question. Pricing committees reorganize on quarterly cycles. Regulatory events arrive without warning and disrupt unit economics for two to four quarters. Marketing platforms shift channel attribution rules in ways that move CAC by 20 to 40 percent without any change in actual marketing spend. None of these can be tested with an RCT. Every one of them produces a “what was the impact” question that has to be answered with a number leadership can defend in a board meeting.

The default tool most analytics teams reach for is the pre/post comparison with a year-over-year control. In a typical business with secular growth, that approach inflates the headline number by 30 to 60 percent. The inflation comes from extrapolating the pre-event trend forward and crediting the policy change for what the trend would have delivered anyway. You can replicate the failure on any synthetic dataset in under an hour. The reason the failure persists in practice is that the alternative methods are not in the average analytics curriculum.

fig. 1 Same data, two baselines. Holding the pre-event mean flat (coral) assigns the underlying trend’s growth to the policy change. Extrapolating the pre-event trend (blue) is what the Box-Tiao model does and what survives audit.

That is the gap this playbook closes.

Tier 1
Box-Tiao ARIMA solves 80 percent of policy-impact questions cleanly

The Box-Tiao formulation has been in print since the 1970s. Fit an ARIMA model on the pre-event series, extend it with a step-function dummy variable (zero before the event date, one after), refit on the full series with the dummy as an external regressor. The coefficient on the dummy is the level-shift magnitude, with a standard error that survives audit because it accounts for the serial dependence in the residuals.

In R the setup is short:

library(forecast)
pre_model <- auto.arima(y_pre)
xreg      <- c(rep(0, n_pre), rep(1, n_post))
model     <- Arima(y_full, order = pre_model$arma[c(1,6,2)], xreg = xreg)
effect    <- coef(model)["xreg"]
effect_se <- sqrt(diag(vcov(model)))["xreg"]

One anonymized case for context. A top-three brewer in an emerging consumer-goods market needed to quantify the impact of a marketing-platform shift on weekly volume for one of its mid-tier SKUs. The team had run a regression of weekly volume on a binary post-period flag with calendar fixed effects. The headline read as a high single-digit lift. After refitting as Box-Tiao with an auto.arima base and the dummy as xreg, the level-shift coefficient came in materially smaller than the original estimate, with a confidence interval that included the original number only at its upper bound. The simple regression had absorbed pre-existing trend acceleration into the policy effect.

Four effect shapes show up often enough to be worth naming explicitly. A step (the level shifts and stays shifted) fits most reorganizations and regulatory changes. A temporary impulse (one period of disruption, return to baseline) fits one-off events like a logistics outage. A permanent ramp (gradual move to a new level) fits behavioral adjustments to price changes. A gradual decay (shock that fades over time) fits crisis responses. The choice is not aesthetic. You look at the residuals of the simplest step specification; if they show a systematic post-event pattern, that pattern tells you which alternative shape to refit.

fig. 2 The four canonical intervention shapes. The diagnostic is the same across all of them: fit a step first, read the residuals, refit with the shape the residuals point to.

Tier 1 fails in three diagnosable ways. The pre-event series is too short, under 24 months, and the ARIMA fit is unstable. Outliers contaminate the fit; you run tsoutliers detection and refit. The residual ACF stays significant after the dummy is added, which means the shape is wrong and you move to Tier 2.

When none of those fail, stop at Tier 1. The model has a long publication record, a clear coefficient interpretation, and a confidence interval that survives CFO challenge. Anything beyond Tier 1 is engineering overhead you should buy only when you must.

Tier 2
When the effect ramps and fades, switch to piecewise-linear intervention

Some events do not produce a clean level shift. A consumer-goods price increase rolls through the supply chain over six to twelve weeks. A regulatory change triggers a behavioral spike followed by adjustment toward a new equilibrium. A crisis produces a sharp drop, partial recovery over six months, and a new lower level. A step dummy fits these series badly. The residuals will tell you in advance.

The fix is a parametric intervention function I(t) with five parameters: event start t0, peak time t1, ramp rate a1, decay rate a2, total duration T. Before t0 the function is zero. Between t0 and t1 it ramps up at rate a1. After t1 it decays at rate a2 toward a new steady-state level over duration T. The fitted I(t) vector becomes your xreg in the ARIMA model.

Parameters are not free. You fit them by grid search, minimizing residual variance on the focal series or RMSE on an auxiliary control series that did not experience the event. The grid is small: t1 takes integer values within a plausible window, a1 and a2 take values on a coarse log scale, T takes integer values within the period you expect the effect to persist. The grid search runs in seconds for a monthly series.

A case that earned its keep. A national hypermarket chain had a store, call it unit A, hit by two concurrent events: a macro shock affecting the broader consumer environment, and a piece of local infrastructure disruption that knocked out one of the access roads for an extended period. Comparing pre-event to post-event sales gave a confidently wrong number, because the two effects pointed in the same direction and the team could not say which deserved how much credit. The decomposition worked as follows. A second store, unit B, in a comparable market, was hit by the macro shock but not the infrastructure event. The team fit a Tier 2 piecewise-linear intervention to unit B’s series, isolating the macro-shock effect with its decay shape. They then subtracted the fitted macro-effect series from unit A’s actual series, and fit a Tier 2 model again on the residual to isolate the infrastructure effect. The two effects came out cleanly separated, with sensitivity bands that survived the CFO’s pushback.

fig. 3 Decomposing two concurrent events with an auxiliary unit. The trick is finding a unit B that saw one of the events but not the other. The infrastructure effect becomes visible only after the macro fit from unit B is subtracted from unit A.

Tier 2 is also where most teams should stop. The vocabulary is reachable for analysts comfortable with ARIMA. The implementation is a grid search you can read in twenty lines of R. The output is a coefficient and a sensitivity table. Tier 3 exists because some situations break Tier 2’s grid-search assumption, but those situations are rarer than the methods literature suggests.

Tier 3
For overlapping shocks, isolate intervention neurons inside the network

A small minority of cases will resist Tier 2. The residuals after grid-search piecewise-linear fitting still show structure. The series is non-stationary in a way ARIMA cannot fully absorb. Or the event interacts with seasonal patterns in a way that breaks the linearity assumption underneath piecewise-linear I(t).

The methodology that earned its place in this playbook is a special-architecture neural network for these cases. The architecture works like this. Inputs are lagged values of the series, exogenous calendar features, and a parameterized intervention vector I(t). The hidden layer is split into two regions. The first region is a normal feedforward layer that learns trend and seasonal structure. The second region contains a small number of intervention neurons with activation functions matching the parametric intervention shapes. The intervention neurons receive only the I(t) input. The two regions feed into a linear output layer. Parameters of I(t) and weights of both regions are fit jointly by backpropagation.

The key trick is that the intervention neurons are isolated from the trend-and-seasonal region in connectivity. Without that isolation, the network learns to absorb the intervention signal into the general hidden layer, the intervention coefficient becomes meaningless, and you lose the interpretability that justified moving past ARIMA in the first place. Isolation preserves a coefficient you can quote.

An anonymized used-vehicle case justified the method. A national used-vehicle market faced two macro events in the late 2000s. One was a broad demand contraction tied to a global financial event. The other was a regulatory tariff that lifted the cost of imported used vehicles by a significant margin. Twelve ARIMA variants were fit, including auto.arima and several handcrafted specifications, each with four intervention-function shapes. The best ARIMA specifications by MSE produced sign-inconsistent intervention coefficients: the tariff coefficient came out positive in models that should have shown a negative effect, and vice versa. The piecewise-linear neural network with isolated intervention neurons produced economically sensible signs on both interventions, with magnitudes that matched the qualitative judgement of industry analysts who were not shown the model output.

That case is also a warning. The cost of Tier 3 is high. You need an environment that supports neural-network training and persistence of fitted weights. You need a colleague who can debug a non-converging fit. You need monitoring on the intervention coefficients to catch the case where the network slowly absorbs the intervention signal into the general layer. None of those costs are worth paying if Tier 2 was working. The fact that Tier 3 exists in the literature is not a reason to use it.

Tier 4
When time-series alone won’t defend the number, reach for a quasi-experiment atlas

Some events are too entangled with concurrent shocks for any time-series method to give a defensible answer. The CFO’s confounder challenge has merit, residuals will not stabilize regardless of intervention shape, and the auxiliary-unit decomposition cannot find a clean comparison series. At that point the right move is to step back from forecasting entirely and pick an explicit experimental design.

The Campbell-Stanley framework catalogues 16 designs across three categories: pre-experimental (weakest), true experimental (RCT family, usually not feasible for the kinds of events we are discussing), and quasi-experimental. Four of these designs cover roughly 90 percent of the practical cases in business analytics.

A non-equivalent control group design uses a similar untreated unit (different region, different store cluster, different product line) as a counterfactual. Pre/post differences are compared between treated and control. Validity hinges on the assumption that the two units would have moved in parallel absent the treatment, which you test by checking pre-period parallel trends.

A regression-discontinuity design exploits a sharp cutoff rule for treatment assignment. Customers above a threshold get the new policy, customers below do not. Near the cutoff, the two groups are nearly identical except for treatment status, which gives a local causal estimate. The design produces remarkably clean inference when a real cutoff exists.

A multiple time-series design adds a control series to the basic interrupted time series. You fit Tier 1 or Tier 2 to both the treated and the control series, and the causal estimate is the difference of differences. This is the design most often suitable when one Tier 1 model alone would be insufficient.

An interrupted time series with non-equivalent control combines all of the above and is the design I most often deploy when stakes are high enough to justify the analytical investment.

The atlas is not a substitute for time-series methods. It is a complement. You reach for Tier 4 when leadership requires explicit defense against confounders that Tier 1 to 3 cannot provide. The escalation criterion is institutional, not technical: if the audience of the analysis includes finance, regulators, or external auditors, the cost of a confounder challenge surviving in public is high enough to justify the extra design work upfront.

The decomposition discipline is what separates the playbook from a textbook

The four tiers are the skeleton. The discipline that makes them work in production is two practices that 80 percent of analytics teams skip.

The first is auxiliary-unit decomposition for overlapping events. Almost every real business event is overlapping. A marketing platform shift coincides with a category seasonality break. A pricing change happens in the same quarter as a supplier outage. A regulatory event arrives in the middle of an internal reorganization. Time-series methods fit to one focal series cannot distinguish between effects of concurrent events. The only path through is to find an auxiliary unit that experienced one event but not the others, fit the model to that auxiliary, subtract the effect, and refit. The discipline of doing this carefully, and of running the decomposition with multiple auxiliary candidates to bound the sensitivity, is what separates a number the CFO accepts from a number the CFO retires into a footnote.

The second is sensitivity reporting on every quoted magnitude. The default deliverable for a Tier 1 to 3 result is not a point estimate. It is a point estimate plus a sensitivity table covering: alternative ARIMA specifications, plus or minus one month event window, plus or minus ten percent perturbations of key intervention parameters, and the result of refitting on a held-out segment of the pre-event series. A point estimate without that table will be challenged in the second meeting. With the table, the number survives the second meeting and the third.

These two practices look procedural. They are the practical content of the work. Teams that adopt them stop losing arguments. Teams that skip them keep losing.

Sequencing the four-tier ladder inside your business

A working sequencing plan for an analytics function that wants to install this capability:

First, catalog the events your business has been asked to defend numbers on in the last 18 months. Most leadership teams find five to twelve such events when they actually count. Rank them by stakes: a finance audience or an external audience pushes the event up the list.

Second, for each top-ranked event, verify the four preconditions of intervention analysis. Pre-event series of at least 24 months exists. The event has a clean date. The event applied to all units in the focal population. Leadership needs a defendable magnitude estimate, not a directional read.

Third, start at Tier 1. Fit the Box-Tiao model. Run residual diagnostics. If diagnostics pass, write up the result with a sensitivity table and stop. The temptation to climb the tier ladder because Tier 3 sounds more sophisticated is the most reliable way to waste analytical investment.

Fourth, when Tier 1 fails diagnostics, escalate to Tier 2 with piecewise-linear intervention and grid search. When Tier 2 fails, audit the auxiliary-unit decomposition before moving to Tier 3. Most apparent Tier 3 cases are actually decomposition failures that Tier 2 with the right auxiliary will solve.

Fifth, reserve Tier 4 for events with finance or regulatory audiences where the cost of a confounder challenge is asymmetric.

Sixth, every quoted magnitude pairs with a sensitivity table. No exceptions.

This is the workflow I install when an engagement involves rebuilding the analytics function’s credibility with finance. The work is not glamorous. It is two weeks of methodology audit, two weeks of refitting historical estimates, and a permanent change in how the team delivers numbers to leadership. The payoff is that the next high-stakes question gets a defendable answer, the analytics group keeps its seat at the table, and the cycle of footnoted analyses stops.

Pillar A of the causal inference series. Companion pieces in this pillar: Why your A/B test won’t measure that policy change, Box-Tiao for product managers, Confounders, SUTVA, and four ways quasi-experiments fail. References and code for the R snippet are in the appendix of the methodology brief, available on request.

Keep reading

Enjoyed this article?

Get weekly data strategy insights delivered to your inbox.

Get in Touch

Let's Discuss Your Project

Book a 30-minute discovery call. We'll assess your data maturity and recommend the right approach — no strings attached.

Book a Discovery Call →
Need help with your data strategy? Book a Discovery Call →