External validity
In research methodology, External Validity is crucial for generalizing findings from Randomized Controlled Trials (RCTs) to target populations. Dr. John Jerrim from UCL Institute of Education delves into the significance of External Validity, exploring the difference between Sample Average Treatment Effect (SATE) and Population Average Treatment Effect (PATE). Discover methods to investigate and correct estimates to better reflect PATE, aiming to grasp the true impact of interventions on the desired population. The challenges of non-random recruitment and loosely defined populations in RCTs are also discussed, emphasizing the importance of sound sampling frameworks in evaluating the transferability of results. By aiming to estimate PATE, researchers can enhance the applicability of study outcomes and guide evidence-based policies effectively.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
External validity Dr. John Jerrim UCL Institute of Education
Aims To understand what external validity is and why it is important . The difference between sample and population average treatment effects (SATE and PATE) The assumptions under which estimates of SATE = PATE How you may investigate external validity of your RCT further . Methods of correcting SATE estimates to get closer to PATE Gain experience of considering external validity of trial data using Stata
Name of the game = PATE Why do we do evaluations (RCTs)? - Work out is it good to role out policy/intervention more widely? Therefore, what do we want to know? - Likely effect in the population we want to role out to - Hence want an estimate of PATE - Average treatment effect in the population of interest External validity - Extent we can generalise results from RCT to population of interest . - The extent to which we believe we have estimated PATE - I.E. Got what we really want .
Recall: The best way to estimate PATE. Step 1: Population of interest Step 2: Sampling frame for population Step 3: Recruit random sample Step 4: Random Treatment group Step 4: Random control group Step 5: 100% follow up Step 5: 100% follow up
The problem - - - RCTs don t typically randomly recruit into the study (step 3 doesn t happen) .. Often not a good sampling frame (step 2 doesn t happen) Population of interest often loosely defined (step 1 doesn t happen) . The result - Non-random convenience samples .. - Different from population in observed (and unobserved) ways - Testing treatment on a strange group? - E.g. Particularly adventurous? Enthusiastic? Concern: Will our results really generalise!?
The problem A Bradford Hill (1966) Reflections on the Controlled Trial (The Heberden Oration), Annals of the Rheumatic Diseases
RCTs = SATE - - What most RCTs really give you is SATE (Sample Average Treatment Effect) . Effectiveness of treatment for your sample SATE is a useful piece of information - Does treatment work even when people are willing / enthusiastic about it? - If no, then would seem even less likely to work in population - .where some individuals less willing / enthusiastic about change - Likely to be important in context of social interventions But, at the end of the day, SATE isn t what we really want! - SATE likely to give upper bound for PATE?
Other issues. - Standard errors, p-values, confidence intervals, power calculations . - Fundamental in RCT analysis - But rely upon an assumption of random / probabilistic sampling .. How do we estimate sampling variation? - Not clear! - Such statistics do not technically exist - Hard to judge uncertainty in estimates due to having a sample - Hard-line view. Should not event report them. Big limitation Many of our standard tools no longer technically appropriate / valid..
Why does this matter? Case study: The Polio (Salk) vaccine RCT
Note Polio rate is much lower in no consent group than the control group .. This is despite neither group getting the vaccine . Why? Non-random selection into the trial! Poor more at risk of Polio .. .so more likely to consent to take part! Wealthy. Less at risk of polio. Hence less likely to take part! Rate refers to polio rate per 100,000 population
How much is external validity considered in social science RCTs? Claims to have external validity Nothing Less than a paragraph One clear paragraph Attempt to assess Correction
When will our estimate of SATE = An estimate of PATE?
When will SATE = PATE 1. Random recruitment into trial (as noted) - Ensures, in expectation, that characteristics of sample = characteristics of population 2. Assumption of homogeneous treatment effect - You may recruit more of one type of individual than another .. - But if this characteristic does not interact with the treatment . - Then . so what!! - Won t result in any difference between SATE and PATE . If either condition holds, it is enough to mean SATE = PATE
When will SATE PATE 1. When treatment effects are heterogeneous -E.g. Intervention more effective for those enthusiastic about it . -E.g. Intervention more effective for motivated individuals . And 2. When we disproportionately recruit such groups into the RCT - E.g. People who believe treatment more effective more likely to take part - E.g. Highly motivated individuals more likely to take part Both conditions have to hold for SATE PATE!
Think about this in the context of social science vs medicine Medicine (e.g. a new oral drug) Those believe it will be effective probably more likely to enter to RCT . But as long as person takes tablet when meant to . .hard to see treatment varying greatly by motivation (biological reaction) Hence SATE approximating PATE may be credible? Social science (e.g. teaching children how to play chess) Those believe it will be effective probably more likely to enter to RCT . Seems very likely effectiveness will depend upon motivation / willingness to try new things / believe it will work Hence highly unlikely SATE = PATE .
How to further consider external validity of your RCT? (Assuming random sampling is not possible)?
1. Compare sample to population (in terms of observables) - RCT sample and population must differ for SATE PATE . - Therefore compare sample and population in terms of observables . - - Closer correspondence between the population and sample .. .More credible argument that SATE = PATE Why? - - If sample looks like population in terms of observables .. then any heterogeneous effect of treatment by these variables will not matter! Limitation - Only as credible as those characteristics we can observe in both sample and population - Important things we can t observe in population data (e.g. motivation)
Example: Maths Mastery.. Not a random sample of schools Compare pupils in trial to those in England state school population using NPD. Trial has: - - - - More FSM Fewer white More black & Asian More low achievers (figures not shown)
2. Investigate possible heterogeneity (observables) SATE and PATE will only differ if treatment effect heterogeneous .. .has more impact on some sub-groups than others. As part of RCTs, typically collect additional baseline information... - Baseline test scores - Demographics (gender, ethnicity, measure of poverty) Can do sub-group analysis by these variables .or can include an interaction term in our statistical model.
2. Investigate possible heterogeneity (observables) Limitations Observable characteristics only .. unobservable heterogeneous treatment effects likely to be important Statistical power . Often limited in our ability to detect even main effects We have a lot less power to detect interactions / sub-group effects Most investigations of interactions will probably be statistically insignificant but this doesn t mean they don t exist!
3. Model selection into the RCT.. Can think of non-random participation into RCT as a selection problem .. E.g. Just like we think about survey non-response Can therefore model the selection process (in terms of observables) .. and create Inverse Probability Weights to apply in analysis If we can accurately model the selection process (in terms of observables) . .we can correct our SATE estimates into PATE estimates Limitations Requires rich population level data Correction in terms of observables only
Creating and applying IPW in RCTs Stage 1: Estimate selection model by probit/logit - Every observation in population of interest included in model - Response. 0 = not in trial; 1 = in trial. Stage 2: Create weights - Create predicted probability of being in trial for each observation - Create IPW as the reciprocal of this probability Stage 3: Estimate adjusted SATE - Standard methods covered in previous lectures . - Just now apply the IPW in analysis
4. Consider an observational study as well? RCTs Observational data Low internal validity High external validity High internal validity Low external validity RCTs and observational studies have different +ives and ives Use both to complement each other Observational study - Make sure it covers you population of interest (plus high response rate) - As plentiful controls as possible (longitudinal data = even better) Consistent evidence: You are probably in business!
Imai (2008): Pros and cons of different research designs http://gking.harvard.edu/files/matchse.pdf
5. All else fails be honest!! RCTs are often made out to be the gold standard They have many benefits but also limitations . These limitations (external validity in particular) need to be more widely recognised Common to say generalisability / external validity limited . But maybe should do more? E.g. Recognise that an observational study may help overcome some weaknesses
The intervention Children to receive 30 hours of chess lessons during one academic year (year 5) Follows a fully developed curriculum by the Chess in Schools and Communities (CSC) team Chess lessons likely to be accompanied by an after school chess club RQ. Does teaching primary school children how to play chess lead to an improvement in their educational attainment?
Step 1. Defined the population using administrative data.. 11 LEA s (geographic areas) in England purposefully selected Year 5 (age 9 / 10) children in 2013 / 14 academic year (born Sep 2003 Aug 2004 ) Disadvantaged schools - > 37% of KS 2 pupils eligible for FSM in the last six years Total of 442 on population list (sampling frame)
Step 2. Randomly sample from these 442 schools.. Could not do / achieve this . Ended up recruiting 100 out of the 442 schools .. In other words, like having a 22% response rate to a survey = Not great! (Though better than what most people do!) Attempt to get some sense of external validity by comparing characteristics of pupils in trial to the population as a whole!
How did the sample compare to study population? Trial Population of interest Population of interest participants Trial participants Key Stage 1 maths Level 1 Level 2A Level 2B Level 2C Level 3 Missing Eligible for FSM No Yes Gender Female Male Language Group English Other Ethnic Group White Black Asian Mixed Other Unclassified Chinese School n Pupil n 12% 24% 31% 19% 12% 2% 12% 24% 30% 20% 11% 3% 66% 35% 65% 35% 50% 50% 50% 51% 65% 34% 63% 37% KS1 average points School n Pupil n -0.280 100 3,775 -0.289 442 16,397 52% 22% 12% 8% 4% 1% 0% 100 4,003 54% 19% 14% 7% 4% 1% 1% 442 16,397 Representativity? Pretty good!!
Sample compared to England as a whole? Trial Population of interest Population of interest participants Trial participants Key Stage 1 maths Level 1 Level 2A Level 2B Level 2C Level 3 Missing Eligible for FSM No Yes Gender Female Male Language Group English Other Ethnic Group White Black Asian Mixed Other Unclassified Chinese School n Pupil n 12% 24% 31% 19% 12% 2% 8% 27% 57% 15% 20% 2% 66% 35% 82% 18% 50% 50% 49% 51% 65% 34% 82% 18% KS1 average points School n Pupil n -0.280 100 3,775 0.00 570,344 52% 22% 12% 8% 4% 1% 0% 100 4,003 77% 5% 10% 5% 2% 1% 0% Representative? NO! Can t generalise results to country as a whole. 570,344
External validity vs internal validity for some other evaluation methods ..
Before and After. Example of seatbelts. Terrible internal validity . I.e. This is actually what happened in our population of interest! Perfect external validity From this evidence, are we convinced that the introduction of seatbelts saved lives? Lesson Lets not abandon common sense!
Before and after. Estimated counterfactual.. Estimated counterfactual Observed values .
Question Think about this example of seatbelts. If an RCT was run instead, would the evidence of this being a good policy be more or less convincing? (Opinion! There is no correct answer!)
RDD. Example = Tuition programme Very good internal validity .. KS2 score 70 External validity = Very narrow population! 60 50 Only those within the space of the discontinuity .. 40 30 20 Receive treatment Do not receive treatment 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Rank
Extending the region around discontinuity. Trade-off! KS2 score 70 Trade-off Bad = internal validity .. Good = external validity .. 60 50 40 30 20 Receive treatment Do not receive treatment 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Rank
Propensity Score Matching Match treated individuals to controls who look similar Create propensity score; match individuals with a similar score... and throw out any observation that can not be matched. Narrow caliper Better match = internal validity More observations throw out = external validity Altering caliper = Trading off internal and external validity ..
Instrumental variables (LATE interpretation..) LATE = Effect of instrument inducedshift in treatment I.E. Individuals who changed behaviour because of the IV If IV assumptions met, high internal validity .. ..but what about external validity? IV estimate will be instrument specific. Potentially different if you were to use a different IV. A weird population who results generalise to not really chosen by the researcher apriori but determined by the data and who responds to the IV
Conclusions External validity is important! Most RCTs give SATE and not PATE SATE PATE if there are heterogeneous treatment effects and non-random samples Methods to look into / account for external validity - Compare sample to population - Look for heterogeneous treatment effects - IPW - Heckman selection models - Observational study to complement RCT