
Linear Regression Analysis in Intervention Study: Kenya PRIMR Case
Explore the application of linear regression in the Kenya PRIMR case, including replicating T-test analysis, DiD analysis, controlling for variables, and interpreting estimates. See examples and interpretations to grasp the significance of linear regression in modeling continuous data.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Intervention Study: Kenya PRIMR Case Regression Analysis March 2017 Susan Edwards, RTI International Sarrynna Sou, RTI International www.rti.org 1 RTI International is a registered trademark and a trade name of Research Triangle Institute.
Overview Linear Regression Replicating T-Test Analysis Replicating DiD Analysis Controlling for Other Variables Interpreting Estimates Logistic Regression STATA Code: Similar to Linear Regression Controlling for Other Variables Interpreting Odds Ratios 2
Linear Regression Analysis When would you want to use linear regression? Used to . . . . Model Continuous Data Examples: Super Summary Variables: CLPM, ORF, etc. Estimates . . . Averages General Form ? = +?1?1+ ?2?2 + ?? Model Assumptions Linearity Observed Data are Fixed Constants Errors are IID and N(0, ) 3
Linear Regression Analysis - Example Recall: ? = +?1?1+ ?2?2 + ?? orf = 75 + 2.7 I(female) 3.7 (age) Interpretations: This model suggests that girls read on average 2.7 words per minute more than boys when controlling for student age. Assuming age is a continuous variable, for every year of age students have an average decrease of 3.7 words per minute when controlling for gender. A 10 year old male student will read on average 75 3.7*10 = 38 words per minute. 4
Linear Regression Analysis Reference Cell Coding Recall: orf = 75 + 2.7 I(female) 3.7 (age) Interpretations: This model suggests that girls read on average 2.7 words per minute more than boys when controlling for student age. Reference Cell Coding: One level of a categorical variable is determined to be the reference. All other estimates are presented in comparison to the reference. Example: Female 2 levels 0 = Male 2.7 I(female) = the # of wpm difference between males and females 1 = Female 5
Linear Regression Analysis Reference Cell Coding Example with More than 2 Levels: Age Category 0 = Younger than Grade Level 1 = At Grade Level 2 = Older than Grade Level 3 levels = Below 7 = 7 or 8 = Above 8 Model for ORF: ORF = 50 + 0.2 I(At Grade Level) + -13 I(Older than Grade Level) Questions: What is the average fluency for students in public schools? What is the average fluency for students in private schools? Do students in public schools preform better on average than students in religious schools? Do students in private schools preform better on average than students in religious schools? 6
Categorical vs. Continuous Independent Variables Why do we care? STATA cares. Categorical Continuous Definition: A variable that can be divided into distinct categories. Definition: A variable that theoretically could go on forever Examples: gender age category Examples: orf age Reading comprehension score? Generally ranges from 0 to 5. STATA code: Start variables with i. followed by variable name STATA code: List variable name in equation line. i.<variable name> Reference Cell Coding 7
Linear Regression Analysis STATA Example Recall: ? = +?1?1+ ?2?2 + ?? orf = 75 + 2.7 I(female) 3.7 (age) STATA Code: svy: reg eq_orf i.female age 8
Linear Regression Analysis STATA Activity Recall: STATA code to fit a model for gender and age. svy: reg eq_orf i.female age Fit a linear model for English fluency (eq_orf) that accounts for the following school factors (nonformal; enrolment) svy: reg eq_orf i.nonformal enrolment Why does nonformal have an i. in front of the variable name? What type of variable is enrolment in this model? How would we change enrolment to be a categorical variable? Would the model work if we typed the following? svy: reg enrolment i.nonformal eq_orf 9
T-Test Results with Linear Regression in STATA Recall: T-Tests compare the means of two groups. Example:ttest eq_orf, by (treat_phase) Is there a different between baseline and endline scores? October 2012 October 2013 48 wpm (913) 53 wpm (922) Mean (N) 4.4 wpm (1.7) Difference (S.E.) 2.59 (1833) T-Stat (DOF) P-Value = 0.0095; Reject H0 H0: = ; Ha: != How can we use Linear Regression to duplicate these results? 10
T-Test Results with Linear Regression in STATA Recall: ttest eq_orf, by (treat_phase) How can we use Linear Regression to duplicate these results? How many variables are in used in the ttest command? eq_orf treat_phase Use a linear regression model that only contains the two variables of interest. What would the STATA code for the model look like? reg eq_orf i.treat_phase 11
T-Test Results with Linear Regression in STATA Recall: ttest eq_orf, by (treat_phase) reg eq_orf i.treat_phase October 2012 October 2013 48 wpm (913) 53 wpm (922) Mean (N) 4.4 wpm (1.7) Difference (S.E.) 2.59 (1833) T-Stat (DOF) P-Value = 0.0095; Reject H0 H0: = ; Ha: != 12
T-Test Results with Linear Regression in STATA Recall: ttest eq_orf, by (treat_phase) reg eq_orf i.treat_phase October 2012 October 2013 48 wpm (913) 53 wpm (922) Mean (N) 4.4 wpm (1.7) Difference (S.E.) 2.59 (1833) T-Stat (DOF) P-Value = 0.0095; Reject H0 H0: = ; Ha: != 13
Linear Regression in STATA Controlling for Other Variables Want to Know: Effect of certain variables when other variables we know to be influential are controlled. Recall: In this model, we may already know that older students are less fluent readers because they are repeating the grade or have taken a long break between school years. But we want to know if gender influences fluency once age is controlled. orf = 75 + 2.7 I(female) 3.7 (age) When do we use models with multiple variables? Determine Demographic and SSME Impact What variables must be in these models? Variables that we know strongly influence the outcome. Sample design variables Treatment; Gender; Time 14
Linear Regression in STATA Controlling for Other Variables - Example Fit a model for English fluency that accounts for treatment, time, gender, and formal/nonformal school type. Question of Interest: Once design variables are controlled for, is there a difference between students in formal and nonformal schools? STATA Code: svy: reg eq_orf i.treatment i.treat_phase i.treatment#i.treat_phase i.female i.nonformal Interpretation: Students in nonformal schools read on average 29 wpm more than students in formal schools when study design is controlled. 15
Linear Regression in STATA Controlling for Other Variables - Activity Activity: Determine if any of the other SSME variables make a difference on student English reading fluency (eq_orf). 16
Linear vs Logistic Regression When would you want to use logistic regression? Used to . . . . Model Binomial Categorical Data Examples: Zero Scores 0 = Score above Zero on Task 1 = Score equal Zero on Task Reading Comprehension of 80% or Better 0 = Reading Comprehension Score < 80% 1 = Reading Comprehension Score >= 80% Estimates . . . Probabilities and Odds Ratios 17
Linear vs Logistic Regression When would you want to use logistic regression? Used to . . . . Model Binomial Categorical Data Estimates . . . Probabilities and Odds Ratios ?? General Form ????? ? = log = ?0+ ?1?1, 1 ?? exp ????? 1 + exp(?????) ? ??? ? = % ?? ??????? = Model Assumptions Data are from a stratified SRS Independence of responses between respondents. Sample size is large; 80% of predicted counts at or about 5; all expected counts are larger than 2 Model is specified correctly 18
Linear vs Logistic Regression Covariates & Odds Ratios Covariates . . . Connected to Odds Ratios Example: Reading Comprehension 80%+ = -3 + 0.76 I(Has English Book) Odds Ratio: English Book vs. No English Book = exp(0.76) = 2.14 Interpretation: On average students with English books will be 2 times more likely than students without English books to comprehend at least 80% of a connected text. 19
Linear vs Logistic Regression Covariates & Probabilities Covariates . . . Connected to Probabilities Example: Reading Comprehension 80%+ = -3 + 0.76 I(Has English Book) Probability: exp 3+0.76 1+exp( 3+0.76)= 0.098 Pr(80%+ | English book) = Interpretation: On average a student with an English book is 9.8% likely to comprehend at least 80% of a passage. 20
Logistic Regression Analysis STATA Example Recall: Reading Comprehension 80%+ = -3 + 0.76 I(Has English Book) code is very similar to linear regression NOTE: STATA Code: svy: logistic eq_read_comp_score_pcnt80 i.e_book svy: logistic eq_read_comp_score_pcnt80 i.e_book, coef Why does e_book have an i. in front of the variable name? Why doesn t eq_read_comp_score_pcnt80 have an i. ? What is the difference between the two lines of code? 21
More Information Susan Edwards Research Statistician 919.316.3541 SEdwards@RTI.org Sarrynna Sou Statistician 919.485.2722 SSou@RTI.org