
Resampling for Testing Educational Outcomes in Paired Cohorts Over Years
Explore how resampling can test differences in educational outcomes for paired cohorts observed over several years. The method offers a significance test and effect size measure compared to ANOVA. It addresses robustness, power, and other considerations using a motivating case in educational evaluation. Discover how to analyze aggregated data when raw data is unavailable for conducting ANOVA tests.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
JSM 2015, Session #256 Resampling as a Way to Test for Differences in Educational Outcomes for Paired Cohorts Observed over Several Years by William M. Goodman, Ph.D. University of Ontario Institute of Technology
Outline Introduce the motivating case. (It s in education-evaluation; but other applications are possible.) Introduce the proposed method. (Provides a Significance Test, and an Effect Size Measure.) Explore if ANOVA could work instead. Discuss: Robustness, Power, and other considerations Concluding example
The Motivating Case1 Studies compared Pathway students academic performance with that of Traditional students . Traditional Students enter their program in Year 1. Pathway students join a program mid-stream , with a prior (not necessarily related) 2-yr. diploma from community-college. (Sometimes with bridge courses.) Average marks on grade-point scale 1. Table originally presented at the Student Pathways in Higher Education Conference. Ontario Council for Articulation and Transfer, Feb. 2013
The Motivating Case1 Studies compared Pathway students academic performance with that of Traditional students . Traditional Students enter their program in Year 1. Pathway students join a program mid-stream , with a prior (not necessarily related) 2-yr. diploma from community-college. (Sometimes with bridge courses.) Average marks on grade-point scale Suppose these data are representative for similar contexts in future (so we can make inferences)
Consider in the Business program: By eye , the Pathway cohort had higher GPA s for every year comparisons could be made. But is that apparent pattern significant? And what would be a good measure for the effect size? The displayed numbers appear paired ; yet something like a paired t test would not apply. The displayed values are averages, and allowance needs to be made for the different n s and standard deviations in the cells.
Etc.. Etc. . Etc. . Etc. . Etc. . Etc. . Etc. . Etc. . If we had the raw data underlying the cells, ANOVA would be applicable: We re looking for a Main Effect for the row variable (Traditional versus Pathway) But often not the case. For example: Working from already-aggregated data (e.g. a table in a paper) Or privacy/permissions issues (requesting data from the Registrar)
nyr2,T syr2,T nyr3,T syr3,T nyr4,T syr4,T nyr5,T syr5,T nyr2,P syr2,P nyr3,P syr3,P nyr4,P syr4,P nyr5,P syr5,P nyeari,cohortj syeari,cohortj GPAs(i.e. means)yeari,cohortj What if we could get the summary data, but not the raw data? Can we run an ANOVA given only the summary data? Yes if One-Way, or Two-Way . Balanced. (n/a here) Two-way unbalanced? ..well, in theory:
One 2x2 case has been worked out and posted* at: www.stat.ufl.edu/~winner /cases/ethicgen.ppt {Thanks to Dr. Larry Winner, University of Florida} I ve not found an expanded example or computerized version. And Dr. Winner is not (yet, to my knowledge) working on these.
Introduction to Template (with Winners example2): : Input the known summary data here. (Values are scores from an ethics test ) Treatments 1, 0 are Genders: Female, Male Control for Ranks: F1, F2 are Officers, Enlisted 2. Summary data (only) provided as Exhibit 3 in the paper White, R.D. Are women more ethical . Journal of Public Administration Research and Theory. 1999. URL: www.jstor.org/stable/1181652
Introduction to Template (with Winners example): : Ranks Calculated values for each column: Total n s F M Difference in means for each column Sample Statistic: n-Weighted Mean of the Differences
Introduction to Template (with Winners example): Ranks Calculated values (cont d) Weighted means for columns Pooled standard deviations for columns F M For a No Difference test for scores based on Treatment Level, the assumption is that, for each column(Fi), both cell s values are generated by the same, common mean and standard deviation.
Resampling Steps 1. Simulate the null assumption (re-stated below) in action: Randomly generate* values for each cell in the source table based on assumed, common values for mean and std. dev. for cells in the same column. 1. Each column names a cell position in the source table For a No Difference test for scores based on Treatment Level, the assumption is that, for each column(Fi), both cell s values are generated by the same, common mean and standard deviation. *(Presumed: Distributions are normal.)
Resampling Steps 2. For each TiFjcolumn at lower left, interpret the first nTreatment,Factor random entries in the column as the new (re)sample for that specific, corresponding cell in the table. Observe, empirically, and record, the means for each cell in the just- resampled table. 2.
Resampling Steps 3. The output from Step (re-displayed) represents one sample that might be generated from the underlying population if the null hypothesis is true. 3. For this resample, calculate and record the n-Weighted Mean of the in-column Differences of the means.
Resampling Steps 4. Add this re-sample s outcome to the bottom of the list of all outcomes 4. Re-sampling Steps 1 3 are repeated 5000 times. The resulting list of outcomes from the loop described approximates the expected sampling distribution for the Wt dMeanOfDiff s parameter, under null assumptions.
Resampling Steps 5. The p-value for the resample test (two tailed) is the proportion of outputs for the weighted mean of differences that have a magnitude of at least the size of the Sample Statistic being tested 5 This is the re-sampling- based estimate for the p-value.
Comparison of Results to Winners ANOVA-based* *(as developed for the 2 x 2 case) * ** *(Exact p-values produced by the method will vary slightly if the program is re-run) **(Not formally tested by this method. But a graph can convey sufficient information.)
Comparison of Results to Winners ANOVA-based* *(as developed for the 2 x 2 case) *** ***(For Resampling, you could reverse which variable is interpreted as the Treatment (Rows), and re-run.)
Comparison of Results to Winners ANOVA-based* *(as developed for the 2 x 2 case) **** **** NOTE: This method does not control for the effect of the column variable (Rank) in the ordinary-least-squares sense of parsing up the error terms. Yet the method does control for Rank analogously to how a paired t-test controls for a second variable: Calculations of differences based on treatment are not made all-at-once, but by columns relative to each columns specific Treatment0 values for levels of the Factor variable.
Effect Size Measures An intuitive, sample-based estimate for: How large is the likely true , pair-wise difference in outcomes, when compared by treatment levels. Does not depend on the resampling (can be estimated directly)
Effect Size Measures Estimate a standard error for each column s variation: (Pooled s) / sqrt(nColumnFi/2) Aggregate the standard error estimate by an n-weighted mean of the column estimates * Conventionally: Estimate the magnitude of the effect based on: z* = (Sample Stat (compared to zero)) / (standard error estimate) * {Repeated testing confirms this estimate closely matches a resampling-based estimate for the sample stat s standard error}
Effect Size Measures Caution: It is not recommended to use the z* estimate, parametrically, to generate the p-value for the hypothesis test. The sampling distributions of (differences of sample stat from zero) appear to have a peaked distribution, which veers from normal on the tails--precisely where conventional p-value estimates are focused. The resampling method does not require the sampling distribution will be normal (or t, etc.) on the tails. Normal Probability Plot for the Distribution of Resampled Sample Stats.
Robustness and Power {Fuller details will be provided in the written, Proceedings version of this paper.} Robustness: The resamples generated for the hypothesis test presume normal distributions for data in the columns. What if they are not? Upon testing, the actual Type 1 error rates by the new method appeared to equal or stay close to the nominal presumed by the method for these underlying distributions: (a) censored normal (e.g. if raw grades are percents, and many students fail, but all grades below 50% are classified as GPA = 0 ) (b) uniform; (c) skewed
Robustness and Power Power: Simplifying assumptions: Source data are normally distributed Common standard deviation for all cells, each run. In-column True differences were of same size (in units of the common )
Robustness and Power For realistic sample sizes for such problems, the average cell n has little impact on power If (right graph) the true average column difference of means is approaching 1/2 a standard deviation (relative to the common in the table s cells), it will most likely be picked up. Expressed in terms aggregate standard errors for the column differences: A difference will likely be picked up if the z* to be detected is at least 1.5 standard errors. (relative to common )
Other Issues The usual caveats apply about observation-based studies For the educational application (depending on the study and data access), the same individual students may contribute multiple marks for the different years of the study or even within single years. (Violating independence assumptions.) This limitation would apply for a variety of methods
Thank You! If anyone would like a copy of these slides or of the template used for running the model, please write me at: bill.goodman@uoit.ca