Counterfactual Policy Evaluation¶

What do you do when the new policy looks promising but you cannot risk running it against real patients, customers, or high-stakes infrastructure? The hospital, the recommender system, and the e-commerce platform all share the same dilemma: the candidate policy has not acted yet, but the training data were collected under an older logging policy with different incentives, different instrumentation, and unobserved confounders. Counterfactual policy evaluation is the science of reconstructing the unobserved worlds that might have unfolded under the new policy, quantifying how biased the old logs are, and returning a confidence interval tight enough for regulators to say “go” or “stop” without ever deploying the candidate.

By the end of this page you will understand precisely how to cast logged trajectories as a structural causal model, how doubly robust estimators reuse different pieces of that model, how research tightens the remaining ambiguity with representation learning and parametric uncertainty sets, and finally how you can build a concrete offline evaluation pipeline that produces the same kind of worst-case bound that production teams use today.

The territory¶

Counterfactual policy evaluation (OPE) sits where offline reinforcement learning meets causal inference. In online RL an agent interacts with the environment, lets the candidate policy play out, and averages the returns. Here the environment is off-limits: the data consist of trajectories collected under a logging policy \(\pi_b\), and every decision \(a\) taken at state \(s\) was sampled from \(\pi_b(a \mid s)\) alongside latent noise. That logged tuple \((s, a, r, s')\) is a realization of a structural causal model (SCM) that maps latent variables \(u\) (capturing confounders, instrument drift, and patient physiology) and the history to the next reward and state. The policy evaluation problem becomes “over all SCMs that could have generated the logs, what are the extreme values of the expected return of the policy \(\pi_e\) we care about?” So we are not synthesizing a simulator from scratch; we are identifying the subset of plausible causal worlds consistent with the data and scoring our target policy in each of them.

Key primitives anchor this reading. The logging policy is the one that generated the dataset, while the behavior policy describes the same thing from the modeler’s perspective if they distinguish between different logged conditions. The propensity score is the conditional probability \(\pi_b(a \mid s)\) and appears inside every importance-sampling ratio. Overlap is the assumption that \(\pi_b(a \mid s)>0\) whenever \(\pi_e(a \mid s)>0\); without it, the counterfactual world is entirely unidentifiable. Identification refers to whether the expected reward of \(\pi_e\) can be expressed purely in terms of observed quantities. Where this concept appears is in hospital triage audits, recommendation-system gradebooks, and any offline RL safety pipeline where new decision rules are scanned before deployment.

This territory borrows estimators from RL—importance sampling, direct models, doubly robust combinations—and vocabulary from causal inference—potential outcomes, SCMs, and integral probability metrics (IPMs). Representation learning, score-based disentanglement, and Gumbel-Max counterfactual techniques are the levers that shrink the set of compatible causal worlds to something the logged data can actually speak to. The consequence is that policy evaluation becomes the disciplined project of rebuilding counterfactual worlds from biased historical data; understanding that mechanism starts from the SCM structure and the estimators it gives rise to.

How it works¶

We start with the generative SCM of the logged trajectories. Each state-action pair is determined by deterministic functions of the previous state, the chosen action, and latent noise,

\[ s_{t+1} = f_s(s_t, a_t, u_t), \quad r_t = f_r(s_t, a_t, u_t), \]

where \(s_t\) is the state at time \(t\), \(a_t\) is the action taken, \(r_t\) is the resulting reward, and \(u_t\) is latent noise sampled from an unknown density \(p(u_t)\) that encodes confounders, instrumentation shifts, and any unobserved physiology.

Every logged tuple \((s_i, a_i, r_i, s'_i)\) is therefore generated by first sampling \(u_t\), then selecting \(a_t \sim \pi_b(\cdot \mid s_t)\) under the SCM, and finally producing \(r_t, s_{t+1}\). The unknown mapping \(f_s, f_r, p(u_t)\) defines a family of SCMs compatible with the logs. Counterfactual evaluation means scoring \(\pi_e(a \mid s)\) across this family without ever sampling new \(u_t\) trajectories. The estimators we build reuse the same \(\{u_t\}\) by reweighting or modeling their consequences.

Importance sampling, direct models, and the doubly robust blend¶

Let \(D = \{(s_i, a_i, r_i, s'_i)\}_{i=1}^N\) be the logged data. A simple importance-sampling (IS) estimator takes each reward and scales it by how much more (or less) probable the logged action would be under \(\pi_e\) versus \(\pi_b\):

\[ \hat{V}_{\text{IS}}(\pi_e) = \frac{1}{N} \sum_{i=1}^N \frac{\pi_e(a_i \mid s_i)}{\pi_b(a_i \mid s_i)} r_i, \]

where \(\pi_b\) is estimated from the logs and \(\pi_e\) is the policy under evaluation. The ratio acts like a reweighting of the latent noise \(u_i\): we pretend the same noise produced \(a_i\), but attribute it a new probability under \(\pi_e\). However, if \(\pi_b\) rarely took an action that \(\pi_e\) highly prefers, then \(\pi_e/\pi_b\) explodes and the estimator’s variance inflates, reflecting inflated uncertainty in the underlying SCM.

A complementary idea is to train a direct method—a reward model \(\widehat{Q}(s,a)\) that approximates the conditional expectation \(\mathbb{E}[r \mid s,a]\). The direct estimator is

\[ \hat{V}_{\text{DM}}(\pi_e) = \frac{1}{N} \sum_{i=1}^N \widehat{Q}(s_i, \pi_e(s_i)), \]

effectively using the model to “simulate” what would happen if the policy always picked \(\pi_e(s_i)\). This removes the variance coming from large importance weights but introduces bias whenever \(\widehat{Q}\) misestimates regions that \(\pi_e\) visits but \(\pi_b\) did not.

Doubly robust (DR) estimators combine the best of both: use the direct model where it is strong and correct it with importance ratios where bias arises. One widely used form is

\[ \hat{V}_{\text{DR}}(\pi_e) = \frac{1}{N} \sum_{i=1}^N \left[ \widehat{Q}(s_i, a_i) + \frac{\pi_e(a_i \mid s_i)}{\pi_b(a_i \mid s_i)} \big( r_i - \widehat{Q}(s_i, a_i) \big) \right], \]

where \(\widehat{Q}(s_i, a_i)\) serves as the baseline prediction and the importance weight corrects the residual only when the action was observed. Kallus & Uehara (2021) arxiv:2102.11107 showed that performing cross-fitting—training the reward model \(\widehat{Q}\) and the logged-policy estimator \(\pi_b\) on folds that exclude the current sample \(i\)—keeps the estimator unbiased and asymptotically efficient. The squared bias then is bounded by the product of the estimation errors of the two components, so if either the reward model or the importance ratios is consistent, the entire estimator remains consistent. For the theory student, this gives a precise error decomposition:

\[ \mathrm{Bias}^2 \leq \mathbb{E}\left[ \left(\widehat{Q}(s,a) - Q^\star(s,a)\right)^2 \right] \cdot \mathbb{E}\left[ \left(\frac{\pi_e(a \mid s)}{\pi_b(a \mid s)} - \frac{\pi_e(a \mid s)}{\pi_b^\star(a \mid s)}\right)^2 \right], \]

where \(Q^\star\) and \(\pi_b^\star\) are the true reward function and logging policy, respectively. The direct modeling bias multiplies with importance weight error, capturing the trade-off formally.

Representations, score-based disentanglement, and IPMs¶

Shalit, Johansson, and Sontag (2016) arxiv:1605.03661 framed counterfactual inference as representation learning. They introduce \(\Phi: \mathcal{X} \to \mathbb{R}^d\) to map raw states into a shared space where the treated and control distributions overlap, optimizing

\[ \min_{\Phi, h} \frac{1}{n} \sum_{i=1}^n \mathcal{L}\big(h(\Phi(s_i), a_i), r_i\big) + \lambda \cdot \mathrm{IPM}(\Phi(\mathcal{D}_0), \Phi(\mathcal{D}_1)), \]

where \(\mathcal{L}\) is the predictive loss, \(h\) is the outcome model in representation space, and \(\mathrm{IPM}\) measures the mismatch (e.g., Wasserstein or maximum mean discrepancy). By balancing \(\Phi(\mathcal{D}_0)\) and \(\Phi(\mathcal{D}_1)\), the model enforces that unobserved confounders must depend similarly on both groups, shrinking the set of SCMs consistent with the logs. Varıcı et al. (2023) arxiv:2301.08230 extend this idea by learning score-based representations under interventions, showing identifiability even when the latent factors change across environments. For practitioners this means the reward model and ratio estimator operate on a representation where the overlap assumption is more plausible, directly improving the doubly robust term above.

Sequential counterfactual resampling¶

Offline RL’s sequential nature adds another challenge: the latent noise \(u_t\) is not i.i.d., and replaying a trajectory under \(\pi_e\) means resampling the noise conditioned on different actions at each timestep. Oberst & Sontag (2019) arxiv:1905.05824 leverage the Gumbel-Max trick for discrete actions. If the logging policy selects action \(a\) by drawing a Gumbel vector \(g_t\) and computing \(a = \arg\max_{a'} \theta(s, a') + g_{t,a'}\), then switching to \(\pi_e\) is as simple as rerouting which \(g_{t,a}\) entries are used. The resulting counterfactual trajectory \((s_t', r_t')\) shares the same latent noise but permutes the actions. This construction pinpoints what is learnable from the logs: the latent randomness is fixed, and we either reuse it via ratios or reconstruct it via models.

Bounding counterfactual returns with parametric uncertainty¶

Even after doubly robust estimation and representation balancing, the true SCM remains ambiguous. Bura et al. (2022) arxiv:2207.05259 close the gap by computing tight parametric bounds on transition probabilities. They describe each transition probability vector \(P(s' \mid s,a)\) as belonging to an uncertainty set \(\mathcal{P}_{s,a}\) parameterized by smooth features (e.g., a softmax over linear weights with bounded norm), leading to a worst-case value function

\[ V_{\text{worst}}(\pi_e) = \min_{P \in \prod_{s,a} \mathcal{P}_{s,a}} \sum_{s'} P(s' \mid s,a) \left[ r(s,a,s') + \gamma V_{\pi_e}(s') \right], \]

where \(V_{\pi_e}(s)\) is the fixed point of the Bellman operator under \(\pi_e\) and \(\gamma\) is the discount factor. When each \(\mathcal{P}_{s,a}\) is defined by linear constraints, the inner minimization reduces to a linear program with a closed-form solution for each state-action pair, turning the nebulous “set of SCMs” into a computable interval. Applied regulators therefore obtain a lower bound on expected return that accounts for both biased logs and structural ambiguity—this is the quantitative guarantee they require before green-lighting a new policy.

The result is a layered pipeline: the logging policy and SCM determine what data are observed; representation learning reshapes the inputs to the doubly robust estimator so the predictive model generalizes to new actions; importance weighting, cross-fitting, and doubly robust combination cancel bias wherever possible; and finally parametric uncertainty sets convert the residual ambiguity into a worst-case band. That layered view is what you need to evaluate, compare, and certify candidate policies before they are deployed.

Where the field is now¶

Now that we understand the estimators and uncertainty quantification, we can see how research and engineering have advanced them. On the research side, Varıcı et al. (2023) arxiv:2301.08230 tightened the identifiability of representation learning by exploiting score-based generative models under interventions, making it easier to learn representations that peel apart hidden confounders even when the latent graph changes. Concurrently, Bura et al. (2022) arxiv:2207.05259 delivered the first practical algorithms to compute tight parametric worst-case bounds on transition probabilities in sequential environments, opening the door to optimizing a policy’s worst-case return directly within policy search. These advances have renewed interest in constructing sets of SCMs with guarantees, bridging structural causality and offline policy optimization.

On the engineering side, Amazon’s AWS Machine Learning blog (2021) [https://aws.amazon.com/blogs/machine-learning/off-policy-evaluation/] details how their advertising and recommendation teams run conservative offline evaluations before any policy reaches production. Their pipeline ingests logged bandit data, fits a behavior-policy model, trains a reward model with doubly robust corrections, and only ships a new policy when the lower bound exceeds the incumbent’s performance even in the worst plausible counterfactual. This is the engineering frontier: industry-grade stacks now chain the SCM-based uncertainty quantification onto the doubly robust estimator, producing auditable worst-case guarantees that align with regulators’ expectations.

Together these trajectories illustrate a bifurcated frontier. The research frontier is tightening counterfactual bounds through more identifiable representations and more expressive uncertainty sets. The engineering frontier is embedding those bounds inside large-scale logging systems so that every candidate policy automatically earns a safety certificate before it touches real users. Understanding both fronts lets you ask “What guarantees does my policy need?” and “How do I compute those guarantees efficiently?” before deploying anything.

What's still open¶

Non-parametric, high-dimensional bounds. Can we maintain tight counterfactual intervals in high-dimensional, continuous state-action spaces without relying on restrictive parametric structural causal models? Current bounds either assume small discrete SCMs or require known smoothness structures; a general estimator that delivers practicality and tightness simultaneously is missing.
Sequential representation balancing. How do we learn the minimal representation \(\Phi\) that preserves both reward structure and latent noise across logs generated by multiple policies over time? Existing balance metrics like those in Shalit et al. (2016) focus on binary treatments; a sequence-level balance that works across several logging policies is still undefined.
Scalable worst-case SCM search. What is the algorithmic cost of locating the worst-case SCM within an uncertainty set defined by a generative model (e.g., a diffusion-based causal model) rather than a convex polytope? Without such search, we cannot compute the tight bounds that Bura et al. (2022) inspire.
Uncertainty propagation across hierarchies. How can we propagate counterfactual uncertainty through hierarchical decision structures (multi-stage treatment plans, recommendation hierarchies with latent user segments) without the intervals collapsing to triviality or requiring impractical Monte Carlo sampling?

These questions define the immediate research frontier for anyone who wants to push policy evaluation beyond today’s assumptions.

Where to read next¶

If you want the causal scaffolding that underlies the SCM stories here, → [[structural-causal-models]] provides the interventions, counterfactuals, and do-calculus vocabulary. If you care about the estimators in detail, → [[counterfactual-inference]] walks through potential outcomes, doubly robust bounds, and cross-fitting theory. If your next interest is in scaling safe decision-making, → [[offline-reinforcement-learning]] explains how these ideas plug into documented evaluation stacks.

Build it¶

What you're building: An offline counterfactual evaluation pipeline that loads logged policy data, fits a doubly robust estimator with representation regularization, and reports a lower bound on the expected return your candidate policy would achieve.

Why this is valuable: You will reproduce the core components of a production safety certificate: a logging-policy model, a reward model with representation-based regularization, doubly robust scoring, and a worst-case bound you can present to stakeholders before deploying any policy change.

Stack: - Model: PyTorch MLPs for the reward model and the behavior policy estimator. - Dataset: akaburia/policy-evaluations — reported bandit feedback logs with context features, action probabilities, and rewards. - Framework: PyTorch (1.15+) + Hydra for configuration management. - Compute: Runs on a single RTX 4060 (8 GB) or free Colab T4; full pipeline executes in under 60 minutes.

The recipe: 1. Install the stack with pip install torch==1.15 hydra-core numpy pandas scikit-learn. 2. Load akaburia/policy-evaluations, normalize context features, and split into train/val folds for cross-fitting; keep the logged action probabilities as the propensity estimates. 3. Train the behavior policy \(\pi_b\) with a 2-layer MLP using cross-entropy loss on logged actions, and simultaneously train the reward model \(\widehat{Q}(s,a)\) with mean-squared error, adding an IPM penalty (e.g., MMD) between representations of frequently and infrequently taken actions. Use early stopping on validation MSE. 4. Compute the doubly robust estimate \(\hat{V}_{\text{DR}}(\pi_e)\) for a candidate policy (e.g., greedy softmax over \(\widehat{Q}\)) plus the uncertainty-aware lower bound induced by fitting perturbed transition models (projected via linear constraints like in Bura et al. 2022). 5. Report the lower bound as your policy’s guaranteed performance and visualize how it compares to the logged policy’s baseline and the unconstrained direct method estimate.

Expected outcome: A notebook that loads the HuggingFace dataset, trains the models, computes the doubly robust estimate, and plots the guaranteed lower bound, delivering the same kind of safety certificate practitioners use before shipping policies.

Variants per persona: - Applied AI/ML engineer (forward-deployed): Replay the pipeline with your own logged data, set a firm requirement that the lower bound must exceed the incumbent’s reward by 2%, and deploy the model through a simple API that returns “Safe to deploy” only when the bound is met. - Research engineer: Reproduce Figure 2 of Kallus & Uehara (2021), hitting their reported coverage of cross-fitted doubly robust estimators within ±3% on off-policy evaluation datasets such as the Open Bandit Dataset. - Applied researcher: Hypothesis: injecting an IPM penalty into the reward model reduces the worst-case bound variance by 15%. Falsification criterion: the variance of the lower bound computed on 5 held-out folds should remain within 5% if the penalty is inactive.

If this build worked for you — a ⭐ on GitHub is the only signal we collect.