
Understanding Causal Inference: Models, Estimation, and Prediction
Explore the world of causal models for counterfactual prediction and causal effect estimation. Unravel the complexities of causal inference with detailed insights and implementations in causallib. Learn about the crucial aspects of identification, estimation, and the difference between supervised learning and counterfactual prediction.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
The Zoo of Causal Models An overview on counterfactual prediction and causal effect estimation (and some implementations in causallib)
Causal Roadmap Causality requires identification + estimation Identification: Specify a causal question Specify the observed data Translate the causal question to a statistical problem Define an estimand in the observed data Estimation: Estimand + statistical model = statistical estimation problem Causal specific estimators might be more efficient Different causal estimators have different properties
Causal Roadmap Causality requires identification + estimation Identification: Specify a causal question Specify the observed data Translate the causal question to a statistical problem Define an estimand in the observed data Estimation: Estimand + statistical model = statistical estimation problem Causal specific estimators might be more efficient Different causal estimators have different properties
Causal Inference by counterfactual prediction Conceptually, to quantify the causal effect: Compare what would have happened if we did something to what would have happened if we did nothing
Causal Inference by counterfactual prediction Ideally, The Multiverse! Universe splits just before intervention is administrated
Causal Inference by counterfactual prediction Ideally, The Multiverse! Universe splits just before intervention is administrated The rest is identical - Effect is then the contrast between two worlds: /
Causal Inference is not supervised learning Supervised learning prediction ?: features ?: target Counterfactual prediction ?: features ?: target ?: intervention/action ? ??=? ??=? ? ? ? 45 0 0 0 52 0 1 1 33 1 0 0 31 1 0 0 40 1 Test-set Observed Data Outcome Potential Outcome
Causal Inference is not supervised learning Supervised learning prediction ?: features ?: target Counterfactual prediction ?: features ?: target ?: intervention/action Interpolation: Predict ? for unseen ? Extrapolation: Predict ? for unseen ? (what would ve happened if we acted differently)
Causal Inference is not supervised learning A prediction task Captures the current way a treatment is administered Might capture differences in population, rather than treatment An intervention task What would have happened if we gave patients a drug that they didn t actually get. Can be outside the distribution of the observed past patterns
Causal inference methods ??=? ??=? ? Basic principle: Take the outcomes from one group and extrapolate them to the other group 0 0 0 1 1 0 1 0 Potential Outcome Two* families of methods: Weight models model ? given ? to balance between treatment groups Direct outcome models model ? directly using ? and ? *Doubly robust models combine weight and outcome models together
~Weight Models: Matching A very popular and intuitive method for adjustment: For each treated unit find the control most similar to it. Sick Healthy
~Weight Models: Matching A very popular and intuitive method for adjustment: For each treated unit find the control most similar to it. Sick Healthy
~Weight Models: Matching Using matching as a potential outcome prediction model My counterfactual outcome is sick! My outcome is sick!
~Weight Models: Matching Using matching as a potential outcome prediction model a transformer/preprocessor Gather up, we re going into regression! *Match by confounders but use outcome- predicting features for sample efficiency
~Weight Models: Matching Advantage: Epistemology: very intuitive / convincing Statistical: avoids explicit modeling of response surface Disadvantage: Estimate effect on treated, not in the population
~Weight Models: Matching Advantage: Epistemology: very intuitive / convincing Statistical: avoids explicit modeling of response surface Disadvantage: Estimate effect on treated, not in the population Sample inefficient
~Weight Models: Matching Solution: use all units
~Weight Models: Matching Solution: use all units Match 1:many Give each unit a proportional weight
Weight Models Assign a weight for each sample so the distribution of covariates is similar between treatment groups Generates an RCT-like (balanced) pseudo-population Example: Inverse Probability Weighting (IPW) Use machine learning to model Pr ? ? Assign each sample ??= Weighted average 1 Pr ???? Treated Untreated Pr ? = 1 ? =4 Pr ? = 0 ? =1 ? =5 ? = 5 4 5 5
Outcome Models meta-learners Predict the counterfactual directly from carefully-selected confounders and treatment assignment Fit ? ? ?,? Predict ? ? ?,?? ? = 1 ? ? ?,?? ? = 0 Force a synthetic assignment of ? = 1 and ? = 0 on everyone and predict Many models are meta-learners They have a machine-learning base-estimator which they leverage Use ML to cleverly summarize high-dimensional data
Outcome Models meta-learners+ Incorporating treatment information: S-Learner as an additional feature Limited flexibility of treatment effect T-Learner as a task indicator Lower statistical power Hierarchical Bayesian multilevel hyper-prior sharing Neural network multi-task network (single model) (model per treatment) ?0 ?0 ?0 Model 0 Head 0 ? ? ? Model Base ?1 ?1 ?1 Model 1 Head 1 ? ? ?
Outcome Models meta-learners S-learners can be as simple as linear regression: Fit ? ? ?,? A regression model with features and treatment assignment as input Predict ? ? ?,?? ? = 1 ? ? ?,?? ? = 0 Force a synthetic assignment of ? = 1 and ? = 0 on everyone and predict What makes it causal? How does it differ from supervised learning? Only a specific choice of ? will result in a valid estimate ? ? ?
Outcome Models with deep learning Multi-task network design Replace weighting with representation learning Use DL to get a balanced representation between groups Minimize distribution distance between group representations [auxiliary loss] Use reverse gradients to create a representation that cannot predict the treatment Force ignorability
Outcome Models non-meta-learners Causal trees and causal forests A regression/classification tree that Uses features to split the data into strata Split criteria: maximize group differences max ? Use treatment prevalence at nodes or IPW to balance the difference (weighted average) Honest tree: separate sample for tree generation and node-estimation Avoid double-dipping in an overfit-prone model ? ? ??,? = 1 ? ? ??,? = 0 Searches for heterogeneous effect What split will produce biggest treatment difference across leaves K nzel et al.https://www.pnas.org/content/116/10/4156
Model type comparison Response surface models Weight models Flexible Limited Average effect Inference Heterogeneous effect Harder to fit Easier to fit Reproduce treatment decision - get free outcome ?~? Estimate the entire response profile ?~? + ? Trustworthiness
Doubly robust models Build a treatment model ?~? ? Build an outcome model ?~? ?,? Combine ? and ? into a single model Pros May benefit from both worlds Can be more data efficient Some combinations ensure a consistent estimation if either model is consistent
Doubly robust models TMLE motivation Optimizing ? ? ?,? optimizing ? ? ?,? = 1 ? ? ?,? = 0 prediction causal inference Not the same parameter of interest Supervised learning will not directly result in causal effect estimation If you fit A dog classifier A cat classifier You don t expect them to tell the differences between dogs and cats Floppy ears, pointed face, etc. Not their primary objective when optimizing and
Doubly robust models TMLE motivation Optimizing ? ? ?,? optimizing ? ? ?,? = 1 ? ? ?,? = 0 prediction causal inference Not the same parameter of interest Supervised learning will not directly result in causal effect estimation Extreme case: If the treatment assignment is not predictive of the outcome it will be ignored For example: the tree-based S-learner could have not chosen the treatment at any point Doesn t mean there s no treatment effect Just that its predictiveness (association) with outcome is smaller than other features
Doubly robust models TMLE Targeted Maximum Likelihood Estimation A framework to combine flexible machine learning estimation into causal estimation Retarget the prediction-optimized parameter to the causal parameter Nudge the estimation from prediction to counterfactual prediction Correct the outcome-model estimation using IPW
Doubly robust models TMLE 1. Fit a flexible outcome model ? ?,? : ? ? ?,? 2. Fit a propensity model ? ?,? : Pr ? ? 1 ? 0,? : Pr ? = 0 ? 3. Obtain inverse probability features*: ????????= 1 ? 1,? : Pr ? = 1 ? 4. Fit a logistic regression** model: ?~ 1 + ? 0,? + ? 1,? + offset ? ?,? 1. Regress outcome on the IP features, fixing ? ?,? as an intercept 5. Obtain counterfactual predictions by setting ? ?,0 ,? ?,1 ,? ?,0 ,? ?,1 * Single covariate flavor / weighted regression flavor ** for continuous outcome you bound it to 0-1
Doubly robust models TMLE Different clever-covariate flavors: 1 ? 0,? : Pr ? = 0 ? a. A matrix of IP feature: ????????= 1 ? 1,? : Pr ? = 1 ? 1 b. A vector of IP feature: ????????= 2?? 1 Untreated get a negative IPW feature c. Move the IP to weighted regression, rather than as a feature a. Matrix the feature is One-Hot encoding b. Vector the feature is 1 (depending on ??) Pr ??= ???
Doubly robust models AIPW Augmented IPW Corrects the IPW average effect with the outcome model s effect 1 ?? ? ?,?? Pr ??= ? ?? IPW predicted outcome ??=?= 1 Pr ??= ? ?? surprisal weighting ?? Pr ??= ? ?? IPW ?:??=? Augmentation ?1= ??=1 ? 0,? ?0= ? 1,? ??=0 Surprisal weighting = Good treatment prediction less surprise ? is in its group smaller outcome-model correction Can only estimate average effect
Causal Inference assumptions To make the leap between hypothetical to measurable quantities Consistency well defined interventions To be able to translate from the hypothetical world to the real world Exchangeability No unmeasured confounding (causal sufficiency) Disentangle spurious correlations from causal contributions Positivity Everyone have some chance to be in both treatment groups half the potential outcomes table is already full Specific choice of ? IPW breaks if there are zerodata points