
Logistic Regression for Generalized Linear Models
Explore the relationship between temperature and O-ring failure in the context of the tragic Challenger space shuttle launch. Learn about logistic regression, parameter estimation, and likelihood methods in predictive modeling.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Generalized Linear Models Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca
Ex1: Spaceship launch Jan. 28, 1986, the tenth flight of Space Shuttle Challenger (OV-99): failure at 73 seconds into its flight, killing all seven crew members. The spacecraft disintegrated over the Atlantic Ocean, off the coast of Cape Canaveral, Florida, at 11:39 EST, due to failure of an O-ring seal. Forecasts for January 28 predicted an unusually cold morning, with temperatures close to 1 C (30 F), the minimum temperature permitted for launch. Relationship between O-ring failure and temperature Xuhua Xia
Data In the past 24 launches, the failure of at least one O- ring is recorded (1: failure, 0: success) with temperature in Fahrenheit. Objective: What is probability of failure (p) at a given temperature (T)? Such a relationship should allow us to predict p given T and decide whether we should take the chance or not. This is a problem for logistic regression which is a special case of generalized linear model. However, we can quickly demonstrate the relationship between T and p by observing that the mean T is 64.4 for failure and 72.2 for sucess, which differ significantly (p = 0.0156, two-tailed). Thus, low temperature is associated with O-ring failure. However, what is the probability of failure at 30 F? LaunchT Failure 53 56 57 63 66 67 67 67 68 69 70 70 70 70 72 73 75 75 76 76 78 79 80 81 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 Xuhua Xia
Logistic regression p: probability of failure logit: ln(p/q), where q = 1-p linear model specifying the relationship between p and T: ln(p/q) = a+b*T, i.e., p is a function of T with parameters a and b How to obtain the best estimate of a and b? There are various approaches to estimate a and b: The simplest is just to guess: Failure is less likely in high temperature, i.e., p decreases with increasing T, so b should be negative. If T = 0, then p must be close to one (O-rign is not designed to work under such low temperature), so ln(p/q) must be very large, i.e., a >>0.
Consequences of various guesses 0.7 1 0.9 0.6 a=10.875 b=-0.171 0.8 a = 7 b = -0.12 b=-0.12 a=7 0.5 0.7 0.6 0.4 E(p) E(p) 0.5 0.3 0.4 0.3 0.2 0.2 0.1 0.1 0 0 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 T T 1 Who has the best guess? 0.98 a=10 a = 10 b = -0.1 b=-0.1 0.96 Model and parameter estimation 0.94 E(p) 0.92 We want to find a and b that maximizes the likelihood or log- likelihood (lnL) 0.9 0.88 0.86 0 10 20 30 40 50 60 70 80 90 T Xuhua Xia
The likelihood method LaunchT Failure 53 56 57 63 66 67 67 67 68 69 70 70 70 70 72 73 75 75 76 76 78 79 80 81 a = 10.875 b = -0.171 E(P) lnL ? ln 1 0.857584287 -0.15364 1 0.782688677 -0.24502 1 0.752144458 -0.28483 0 0.520526483 -0.73507 0 0.393693486 -0.50037 0 0.35362685 -0.43638 0 0.35362685 -0.43638 0 0.35362685 -0.43638 0 0.315515885 -0.37909 0 0.27973467 -0.32814 0 0.246549634 -0.28309 1 0.246549634 -1.40019 1 0.246549634 -1.40019 1 0.246549634 -1.40019 0 0.188506602 -0.20888 0 0.16368453 -0.17875 0 0.121991133 1 0.121991133 -2.10381 0 0.104796562 0 0.104796562 0 0.076726849 -0.07983 0 0.065436759 -0.06768 0 0.055707728 -0.05732 0 0.047351904 -0.04851 lnL = -11.5152 = ? + ?? 1 ? ??+?? 1 + ??+?? E(p) ? = -0.1301 To find a and b that maximizes lnL, double-click the EXCEL insert, copy/paste to an EXCEL sheet and use SOLVER to solve for a and b. -0.1107 -0.1107 Xuhua Xia
Analytically: 24 24 LaunchT Failure 53 56 57 63 66 67 67 67 68 69 70 70 70 70 72 73 75 75 76 76 78 79 80 81 a = 10.875 b = -0.171 E(P) lnL 1 0.857584287 -0.15364 1 0.782688677 -0.24502 1 0.752144458 -0.28483 0 0.520526483 -0.73507 0 0.393693486 -0.50037 0 0.35362685 -0.43638 0 0.35362685 -0.43638 0 0.35362685 -0.43638 0 0.315515885 -0.37909 0 0.27973467 -0.32814 0 0.246549634 -0.28309 1 0.246549634 -1.40019 1 0.246549634 -1.40019 1 0.246549634 -1.40019 0 0.188506602 -0.20888 0 0.16368453 -0.17875 0 0.121991133 1 0.121991133 -2.10381 0 0.104796562 0 0.104796562 0 0.076726849 -0.07983 0 0.065436759 -0.06768 0 0.055707728 -0.05732 0 0.047351904 -0.04851 lnL = -11.5152 ? = ?? ??? = ?? ?? ?=1 ?=1 ??+??? 1 + ??+??? for Failure =1 ??+??? 1 + ??+??? for Failure =0 ??= 1 -0.1301 Take partial derivatives of lnL with respect to a and b, set the two partial derivatives to 0 and solve the two simultaneous equations for a and b. -0.1107 -0.1107 Xuhua Xia
Or use R: Copy the first two columns of data (LaunchT and Failure) md <- read.table("clipboard",header=T) fit <- glm(Failure~LaunchT,binomial,data=md) summary(fit) summary(fit)$coefficients[,2] # get s_a and s_b confint(fit) # 95% CI for the coefficients predict(fit, type="response") # predicted values Call: glm(formula = Failure ~ LaunchT, family = binomial, data = md) Deviance Residuals: Min 1Q Median 3Q Max -1.2125 -0.8253 -0.4706 0.5907 2.0512 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 10.87535 5.70291 1.907 0.0565 . LaunchT -0.17132 0.08344 -2.053 0.0400 *
Fitted curve and extrapolated p ? ln = ? + ?? = 10.87542 0.17132? 1 ? 1 0.9 0.8 Probability of failure (p) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 30 40 50 60 70 80 Temperature (T) Xuhua Xia
Still birth & toxicity N_sb 15 17 22 38 144 N_fetus 297 242 312 299 285 Concentration 0 62.5 125 250 500 Fit GLM in EXCEL Compute null and residual deviance and perform LRT Compute deviance residual md<-read.table("clipboard",header=T) attach(md) N_good<-N_fetus-N_sb Y<-cbind(N_sb,N_good) fit<-glm(Y~Concentration, family=binomial) summary(fit) summary(fit)$coefficients[,1] summary(fit)$coefficients[,2] # resid(fit, type='response')*fit$prior.weights confint(fit) # 95% CI for the coefficients predict(fit, type="response") # predicted values Xuhua Xia
GLM vs conventional transformation Take logistic regression for example (probit or logit transformation versus GLM 1. probit or logit transformation of p cannot handle proportions of 0 and 1, but GLM can. 2. using proportions obscure differences in sample size, i.e., 0.5 can be from 1/2 or from 100/200, but GLM incorporate the numbers in the likelihood function. Xuhua Xia
Survival data of the Titanic available in R: Titanic Class Crew does not have Child category, so: df = 2 for Class:Age, but df = 3 for Class:Sex df = 2 for 3-way interaction, e.g. if background is 1st_Female_Adult, then 3-way differences include: 2nd_male_child 3rd_male_child Class 1st 1st 1st 1st 2nd 2nd 2nd 2nd 3rd 3rd 3rd 3rd Crew Crew Sex Female Female Male Male Female Female Male Male Female Female Male Male Female Male Age Adult Child Adult Child Adult Child Adult Child Adult Child Adult Child Adult Adult Survive Yes 140 1 57 5 80 13 14 11 76 14 75 13 20 192 No 4 0 118 0 13 0 154 0 89 17 387 35 3 670 Copy the red part and read the clipboard in R
Logistic regression: Survival md<-read.table("Titanic.txt",header=T) attach(md) # fullModel <-glm(Survived~Class*Sex*Age,family=binomial,weights=Freq) Survived<-cbind(Yes,No) fullModel <-glm(Survived~Class*Sex*Age,family=binomial) confint(fullModel) # 95% CI for the coefficients predict(fullModel, type="response") # predicted p_survival values fit<-step(fullModel,direction="both") # will find the best model summary(fit) summary(fit)$coefficients[,2] # get SE confint(fit) # 95% CI for the coefficients nd<-data.frame(md,Survival=predict(fit, type="response")) # predicted values Coefficients: (1 not defined because of singularities) Estimate Std. Error z value Pr(>|z|) (Intercept) 3.55535 0.50709 7.011 2.36e-12 *** Class2nd -1.73827 0.58870 -2.953 0.00315 ** Class3rd -3.77275 0.52878 -7.135 9.69e-13 *** ClassCrew -1.65823 0.80030 -2.072 0.03826 * SexMale -4.28298 0.53213 -8.049 8.36e-16 *** AgeChild 15.28493 392.50617 0.039 0.96894 Class2nd:SexMale 0.06801 0.67120 0.101 0.91929 Class3rd:SexMale 2.89768 0.56364 5.141 2.73e-07 *** ClassCrew:SexMale 1.13608 0.82048 1.385 0.16616 Class2nd:AgeChild 2.19967 520.81278 0.004 0.99663 Class3rd:AgeChild -14.94702 392.50626 -0.038 0.96962
Predicted survival > nd<-data.frame(md,Survival=predict(fit, type="response")) > nd Class Sex Age Yes No Survival 1 1st Female Adult 140 4 0.97222222 2 1st Female Child 1 0 1.00000000 3 1st Male Adult 57 118 0.32571429 4 1st Male Child 5 0 1.00000000 5 2nd Female Adult 80 13 0.86021505 6 2nd Female Child 13 0 1.00000000 7 2nd Male Adult 14 154 0.08333333 8 2nd Male Child 11 0 1.00000000 9 3rd Female Adult 76 89 0.44586174 10 3rd Female Child 14 17 0.53009073 11 3rd Male Adult 75 387 0.16760349 12 3rd Male Child 13 35 0.22014973 13 Crew Female Adult 20 3 0.86956522 14 Crew Male Adult 192 670 0.22273782 Xuhua Xia