Addressing Endogeneity Concerns in Non-linear Models: Econometric Methods

econometrics i n.w
1 / 70
Embed
Share

Learn about addressing endogeneity concerns in non-linear models in econometrics, focusing on projects involving employee departure and charitable giving. Discover the challenges faced with non-linear models like Hazard and Hurdle models and explore ways to handle endogeneity. Gain insights into constructing the endogeneity issue and understanding correlations in your models for effective analysis.

  • Econometrics
  • Non-linear Models
  • Endogeneity
  • Hazard Model
  • Hurdle Model

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Econometrics I Professor William Greene Stern School of Business Department of Economics 12-1/54 Part 12: Endogeneity

  2. Econometrics I Part 12 Endogeneity and IV Estimation 12-2/54 Part 12: Endogeneity

  3. Sources of Endogeneity Omitted Variables Ignored Heterogeneity Measurement Error Endogenous Treatment Effects Nonrandom Sampling (or Attrition) 12-3/54 Part 12: Endogeneity

  4. In two of my projects, I was asked by reviewers to address the endogeneity concerns. In one project, I regress employee departure on project termination. Arguably project termination is not exogenous. In the other, I regress firms charitable giving in specific countries on their business activities in the local community. Again, business presence in countries are not exogenous. The problem is, both papers used non-linear models (Hazard model in one, and hurdle model in the other), which are required by the data I have. Are you aware of any econometric methods to deal with endogeneity in non-linear models? My search online did not go anywhere. Hazard Model: Not a linear model. Prob[event happens in time interval t to t+ | event happens after time t] = a function of (x ) 12-4/54 Part 12: Endogeneity

  5. I have been asked this question (or ones like it) dozens of times. I think the issue is getting way overplayed. But, I'm not the majority voice, so you are going to have to deal with this. Step 1: you or the referee need to figure out (make a case for) by what construction is "project termination" endogenous. What is correlated with what **in your hazard model** that makes the variable endogenous? There must be a second equation that implies that project termination is endogenous. What is it? What unobservable in that equation is correlated with what unobservable in the hazard model that makes it endogenous. Same questions for your hurdle model. By what construction is SNAP endogenous in the HEALTH equation? SNAP = X SNAP + Z + HEALTH = X HEALTH + SNAP + v 12-5/54 Part 12: Endogeneity

  6. Source of Endogeneity: Omitted Variable Aggregate Data and Multinomial Choice: The Model of Berry, Levinsohn and Pakes 12-6/54 Part 12: Endogeneity

  7. Theoretical Foundation Consumer market for J differentiated brands of a good j =1, , Jt brands or types i = 1, , N consumers t = i, ,T markets (like panel data) Consumer i s utility for brand j (in market t) depends on p = price x = observable attributes f = unobserved attributes w = unobserved heterogeneity across consumers = idiosyncratic aspects of consumer preferences Observed data consist of aggregate choices, prices and features of the brands. 12-7/54 Part 12: Endogeneity

  8. BLP Automobile Market Jt N P X t 12-8/54 Part 12: Endogeneity

  9. Random Utility Model Utility: Uijt=U(wi,pjt,xjt,fjt, ijt | ), i = 1, ,(large) N, j=1, ,J wi = individual heterogeneity; time (market) invariant. w has a continuous distribution across the population. pjt, xjt, fjt, = price, observed attributes, unobserved features of brand j; all may vary through time (across markets) Revealed Preference: Choice j provides maximum utility Across the population, given market t, set of prices pt and features (Xt,ft), there is a set of values of wi that induces choice j, for each j=1, ,Jt; then, sj(pt,Xt,ft| ) is the market share of brand j in market t. There is an outside good that attracts a nonnegligible market share, j=0. Therefore, tJ j=1s ( p X f , , | ) 1 < j t t t 12-9/54 Part 12: Endogeneity

  10. Endogenous Prices: Demand side Uijt=U(wi,pjt,xjt,fjt, ijt | ) = xjt' i pj + fjt + ijt fjt is unobserved features of model j Utility responds to the unobserved fjt Price pjt is partly determined by features fjt. In a choice model based on observables, price is correlated with the unobservables that determine the observed choices. 12-10/54 Part 12: Endogeneity

  11. An Early Study of an Endogeneity Problem (Snow, J., On the Mode of Communication of Cholera, 1855) http://www.ph.ucla.edu/epi/snow/snowbook3.html London Cholera epidemic, ca 1853-4 Cholera = f(Water Purity,u) + . What is the Causal effect of water purity on cholera? Purity=f(cholera prone environment (poor, garbage in streets, rodents, etc.). Regression does not work. Two London water companies LambethSouthwark & Vauxhall Main sewage discharge River Thames Paul Grootendorst: A Review of Instrumental Variables Estimation of Treatment Effects http://individual.utoronto.ca/grootendorst/pdf/IV_Paper_Sept6_2007.pdf A review of instrumental variables estimation in the applied health sciences. Health Services and Outcomes Research Methodology 2007; 7(3-4):159-179. 12-11/54 Part 12: Endogeneity

  12. Investigation Using an Instrumental Variable Theory : Model: + + Cholera = C = BadWater B + ) (B=0/1=good/bad) ( =other factors) Other Factors 0 1 + (Stylized) 0 1 (C=0/1=no/yes Interesting measure of causal effect of bad water: Endogeneity Problem: 1 Cholera prone environment u affects B and . B(u) and (u) are correlated because of u. E[C|B] B because E[ |B] E[ |B=1] E[ |B=0] E[C|B=1] - E[C|B=0] = {E[ |B + Comparing cholera rates of those with bad water (measurable) to those with good water, P(C|B=1) - P(C|B=0), does not reveal the water effect. Interpret this to say Confounding Effect: + 0 0 1 + + E[C|B=1] = E[C|B=0] = 0 1 + 0 =1] E[ |B=0]} 1 Conclusion: 12-12/54 Part 12: Endogeneity

  13. Instrumental Variable: L = 1 if water supplied by Lambeth L = 0 if water supplied by Southwark/Vauxhall Is E[B|L=1] E[B|L=0]? That i Relevant? the water supply is partly the culprit, and because of their location, Lambeth provided purer water than Southwark. Exogenous Is E[ |L=1]-E[ |L=0]=0? Water supply is randomly supplied to houses. Homeowners do not even know which supplier is providing their water. "Assignm in E[C|L] = E[B|L] E[ |L]: E[C|L 1] E[B|L E[C|L 0] E[B|L E[C|L Estimating Equation: E[ |L 1] E[ |L 0] + = = s Snow's theory, that ? ent is random." + = + = + + Using the IV 0 1 = = = + = 1] E[ |L 0] E[ |L = + 0] = 1] = E[B|L 0 1 0] 0 1 = 1] E[C|L = = 1] E[B|L = 0] 1 (z ero because L is exogeno us) 12-13/54 Part 12: Endogeneity

  14. = 1] E[C|L = = = 1] E[B|L = IV Estimator: E[C|L 0] E[B|L 0] 1 = = = = E[C|L E[B|L 1] E[C|L 1 ] E[B|L 0] 0] = (Note: nonz ero denominator is the r elev ance condition. ) 1 Operational: P(C|L=0) = Proportion of observations supplied by Southwark that have Cholera P(B|L 1) Proportion of observations sup = = P(B|L 0) Proportion of observations supplied by Southwark with Bad Water P(C|L 1) P(C|L 0) b (broadly) P(B|L 1) P(B|L 0) = = P(C|L=1) = Proportion of observations supplied by Lambeth that have Cholera plied by Lambeth with Bad Water = = = = Cov(C,L Cov(B,L) ) (The Wald estimator) = = Estimate: 1 12-14/54 Part 12: Endogeneity

  15. Cornwell and Rupert Data Cornwell and Rupert Returns to Schooling Data, 595 Individuals, 7 Years Variables in the file are EXP WKS OCC IND SOUTH SMSA MS FEM UNION ED LWAGE = work experience = weeks worked = occupation, 1 if blue collar, = 1 if manufacturing industry = 1 if resides in south = 1 if resides in a city (SMSA) = 1 if married = 1 if female = 1 if wage set by union contract = years of education = log of wage = dependent variable in regressions These data were analyzed in Cornwell, C. and Rupert, P., "Efficient Estimation with Panel Data: An Empirical Comparison of Instrumental Variable Estimators," Journal of Applied Econometrics, 3, 1988, pp. 149-155. See Baltagi, page 122 for further analysis. The data were downloaded from the website for Baltagi's text. 12-15/54 Part 12: Endogeneity

  16. Specification: Quadratic Effect of Experience 12-16/54 Part 12: Endogeneity

  17. The Effect of Education on LWAGE 2 2 EXP EXP = + 1 + + + + ... LWAGE EDUC EXP 2 3 4 What is ? Abili ,... + everything else ty = f( , , , Ab ility ,...,u) EDUC GENDER SMSA SOUTH 12-17/54 Part 12: Endogeneity

  18. What Influences LWAGE? = + 1 ( , Ability EXP EXP ,...) LWAGE EDUC X 2 2 2 + + + ) + Increased ( , What looks be an increase in the effect of ... EXP 3 4 Ability ( is associated with increases in ,...,u) and ( Ability Abil ik e an effect due to increase in . The estimate of and the hidden effe EDUC Ability ity EDUC X ) l may EDUC 2 t o A bility picks up Abil i y. f c t 12-18/54 Part 12: Endogeneity

  19. An Exogenous Influence = + + + + ( ) Ability ( , , EXP EXP Ability ,...) LWAGE EDUC X Z 1 2 2 2 + ... EXP 3 4 Increased is associated with increases in ( , Abil , ,...,u) and not ( An eff ect due to the effect of an increase on only be an increase in the effect of only. Z is an Instrument Z ity Abilit y E DUC X Z ) Z will EDUC 2 . The estimate of C picks up ED U EDUC al Var i able 12-19/54 Part 12: Endogeneity

  20. How to use the instrument (the exogenous information)? y x = + Regression y x / / z / / x x y x x x x x = + = + , does not reveal x x Instrumental variable, y x z z / / / / y x z z z z = + = + = , ! z x y is cholera prevalence (the outcome) x is presence of bad water (the cause) z is Lambeth is the water supplier (the instrument) 12-20/54 Part 12: Endogeneity

  21. I have been asked this question (or ones like it) dozens of times. I think the issue is getting way overplayed. But, I'm not the majority voice, so you are going to have to deal with this. Step 1: you or the referee need to figure out (make a case for) by what construction is "project termination" endogenous. What is correlated with what **in your hazard model** that makes the variable endogenous? There must be a second equation that implies that project termination is endogenous. What is it? What unobservable in that equation is correlated with what unobservable in the hazard model that makes it endogenous. Same questions for your hurdle model. By what construction is SNAP endogenous in the HEALTH equation? SNAP = X SNAP + Z + HEALTH = X HEALTH + SNAP + v 12-21/54 Part 12: Endogeneity

  22. X Z 12-22/54 Part 12: Endogeneity

  23. Instrumental Variables Structure LWAGE (ED,EXP,EXPSQ,WKS,OCC, SOUTH,SMSA,UNION) ED (MS, FEM) Reduced Form: LWAGE[ ED (MS, FEM), EXP,EXPSQ,WKS,OCC, SOUTH,SMSA,UNION ] 12-23/54 Part 12: Endogeneity

  24. Two Stage Least Squares Strategy Reduced Form: LWAGE[ ED (MS, FEM,X), EXP,EXPSQ,WKS,OCC, SOUTH,SMSA,UNION ] Strategy (1) Purge ED of the influence of everything but MS, FEM (and the other variables). Predict ED using all exogenous information in the sample (X and Z). (2) Regress LWAGE on this prediction of ED and everything else. Standard errors must be adjusted for the predicted ED 12-24/54 Part 12: Endogeneity

  25. OLS 12-25/54 Part 12: Endogeneity

  26. The weird results for the coefficient on ED happened because the instruments, MS and FEM are dummy variables. There is not enough variation in these variables. 12-26/54 Part 12: Endogeneity

  27. The Ultimate Source of Endogeneity LWAGE = f(ED, EXP,EXPSQ,WKS,OCC, SOUTH,SMSA,UNION) + ED = f(MS,FEM, EXP,EXPSQ,WKS,OCC, SOUTH,SMSA,UNION) + u 12-27/54 Part 12: Endogeneity

  28. Remove the Endogeneity LWAGE = f(ED, EXP,EXPSQ,WKS,OCC, SOUTH,SMSA,UNION) + u + Strategy Estimate u Add u to the equation. ED is uncorrelated with when u is in the equation. 12-28/54 Part 12: Endogeneity

  29. Auxiliary Regression for ED to Obtain Residuals IVs Exog. Vars 12-29/54 Part 12: Endogeneity

  30. OLS with Residual (Control Function) Added 2SLS 12-30/54 Part 12: Endogeneity

  31. A Warning About Control Function Estimators: The standard errors must be adjusted. 1.29053 0.35343 = 0.38395 1.40197 12-31/54 Part 12: Endogeneity

  32. I am here to ask a little help for endogeneity. I have a main regression, in which the independent variabels are lagged 1 year (this is an unbalanced panel dataset); I use fixed effect, xtreg: Main Regression: Yt = Xt-1 + Qt-1 + Z3t-1 I suspect endogeneity: variable X may be itself determined by prior-year Y. As a solution, I read this strategy: regress the endogenous variable Xt-1 on the dependent variable (Yt-2) and other independent variables (i.e., Qt-2 and Zt-2); these Y Q and Z are all in year t-2, while X is in t-1. Then, from this regression, calculate the predicted values for X, and include them as a control-for- endogeneity (e.g., a variable named Endogeneity-control ) in the main regression above. Question 1: in the Main Regression above, when including the control for endogeneity (i.e., the variable Endogeneity-control ), do I have to lag its value? That is, do I have to include Endogeneity-control in t-1? or just the predicted values, without lagging? 12-32/54 Part 12: Endogeneity

  33. The two stage LS strategy: (The two stage button in your software.) The software regresses EDUC on all independent variables plus the two instrumental variables (stage 1), then takes the predicted value on education and regresses lwage on that predicted value plus the original independent variables (stage 2). Is this correct? Then the second method you showed is the same except the predicted residuals are included in the second stage OLS. Is one method preferred over another? They produce the same results. 12-33/54 Part 12: Endogeneity

  34. The General Problem = + + y X X 1 2 = X X 0 Cov( Cov( X , ) , ) , K variables , K variables 0 1 1 2 2 endogenous is 2 X X OLS regression of y on ( consistently. Some other estimator is needed. Additional structure: X = + where Cov( , )= . An be able to estimate ( , ) consistently. , ) cannot estimate ( , ) 1 2 Z instrumental variable (IV) V Z 0 2 X X ,Z estimator based on ( , ) may 1 2 12-34/54 Part 12: Endogeneity

  35. Instrumental Variables Fully General Framework: y = X + , K variables in X. There exists a set of M=K variables, Z such that plim(Z X/n) 0 but plim(Z /n) = 0 An alternative (to least squares) estimator of is The variables in Z are called instrumental variables. bIV = (Z X)-1Z y We consider the following: Why use this estimator? What are its properties compared to least squares? We will also examine an important application 12-35/54 Part 12: Endogeneity

  36. IV Estimators Consistent bIV = (Z X)-1Z y = (Z X/n)-1 (Z X/n) + (Z X/n)-1Z /n = + (Z X/n)-1Z /n Asymptotically normal (same approach to proof as for OLS) Inefficient to be shown. 12-36/54 Part 12: Endogeneity

  37. The General Result By construction, the IV estimator is consistent. So, we have an estimator that is consistent when least squares is not. 12-37/54 Part 12: Endogeneity

  38. LS as an IV Estimator The least squares estimator is (X X)-1X y = (X X)-1 ixiyi = + (X X)-1 ixi i If plim(X X/n) = Q nonzero plim(X /n) = 0 Under the usual assumptions LS is an IV estimator X is its own instrument. 12-38/54 Part 12: Endogeneity

  39. IV Estimation Why use an IV estimator? Suppose that X and are not uncorrelated. Then least squares is neither unbiased nor consistent. Recall the proof of consistency of least squares: b = + (X X/n)-1(X /n). Plim b = requires plim(X /n) = 0. If this does not hold, the estimator is inconsistent. 12-39/54 Part 12: Endogeneity

  40. A Popular Misconception A popular misconception. If only one variable in X is correlated with , the other coefficients are consistently estimated. False. Suppose only the first variable is correlated with 1 0 ... . Under the assumptions, plim( X' /n) = . -1 = 11 q q 1 0 ... . 21 -1 Then, plim b- = plim( X'X /n) 1 ... K q 1 = time s the first column of Q . 1 The problem is smeared over the other coefficients. 12-40/54 Part 12: Endogeneity

  41. Asymptotic Covariance Matrix of bIV 1 = b ( b Z'X ) Z ' = IV b 1 -1 ( E[( )( )' ( Z'X X,Z ) Z ' ' ( ( Z'X Z X'Z ) Z Z X'Z IV b IV b 2 1 -1 = )( )'| ] ) ' ( ) IV IV 12-41/54 Part 12: Endogeneity

  42. Asymptotic Efficiency Asymptotic efficiency of the IV estimator. The variance is larger than that of LS. (A large sample type of Gauss- Markov result is at work.) (1) It s a moot point. LS is inconsistent. (2) Mean squared error is uncertain: MSE[estimator| ]=Variance + square of bias. IV may be better or worse. Depends on the data 12-42/54 Part 12: Endogeneity

  43. Two Stage Least Squares How to use an excess of instrumental variables (1) X is K variables. Some (at least one) of the K variables in X are correlated with . (2) Z is now M > K variables. Some of the variables in Z are also in X, some are not. None of the variables in Z are correlated with . (3) Which K variables to use to compute Z X and Z y? 12-43/54 Part 12: Endogeneity

  44. X Z 12-44/54 Part 12: Endogeneity

  45. Choosing the Instruments Choose K randomly? Choose the included Xs and the remainder randomly? Use all of them? How? A theorem: (Brundy and Jorgenson, ca. 1972) There is a most efficient way to construct the IV estimator from this subset: (1) For each column (variable) in X, compute the predictions of that variable using all the columns of Z. (2) Linearly regress y on these K predictions. This is two stage least squares 12-45/54 Part 12: Endogeneity

  46. Algebraic Equivalence Two stage least squares is equivalent to (1) each variable in X that is also in Z is replaced by itself. (2) Variables in X that are not in Z are replaced by predictions of that X with all the variables in Z. Coefficients in augmented regression are added to match 2SLS. (They match if residuals are used instead of predictions.) 12-46/54 Part 12: Endogeneity

  47. Sum=2sls 12-47/54 Part 12: Endogeneity

  48. 2SLS Algebra X -1 = Z(Z'Z) Z'X X'X X'y 1 = 2SLS b ( ) -1 But, X'X = X' I -M Z(Z'Z) Z'X = ( I -M X ) and ( I -M ) is idempotent. Z Z ( X'X )( X'y I -M X = X' I -M X ) ( ) so Z Z Z 1 = 2SLS b ( ) = a real IV estimator by the definition. X' X /n) = Note, plim( of the columns of , all of which are uncorrelated with Z 0 since columns of are linear combinations -1 = = 2SLS b X' I -M X ( ) ] X' I -M y ( ) Z Z 12-48/54 Part 12: Endogeneity

  49. Asymptotic Covariance Matrix for 2SLS General Result for Instrumental Variable Estimation E[( )( )'| ] b b X,Z 2 1 -1 = ( Z'X ) Z Z X'Z ' ( ) IV IV X = I -M X Specialize for 2SLS, using = Z ( ) Z X'X ( ( X'X ' ( X X X'X ' ( X X X'X 2 1 -1 = E[( b )( b )'| X,Z ] ( ) ) 2SLS 2SLS 2 1 -1 = X'X ) ) 2 1 = ) 12-49/54 Part 12: Endogeneity

  50. 2SLS has larger variance (around its mean) than LS has around its mean. A comparison to OLS ( ' ) X X 2 -1 Asy.Var[2SLS]= Neglecting the inconsistency, Asy.Var[LS] = (This is the variance of LS around its mean, not ) Asy.Var[2SLS] Asy.Var[LS] in the matrix sense. To prove, compare the inverses: 2 -1 ( ' ) X X ' ] -1 -1 2 = {Asy.Var[LS]} - {Asy.Var[2SLS]} (1 / )[ ' X X - X I (1 / )[ ' )[ ' X X - X X X M X 2 2 = This matrix is nonnegative definite. (Not positive definite as it might have some rows and columns which are zero.) Implication for "precision" of 2SLS: Possibly very large variances. The problem of "Weak Instruments" '( ) ] = (1 / M X ] Z Z 12-50/54 Part 12: Endogeneity

More Related Content