Effective Techniques for Data Splitting and Model Validation

data splitting and cross validation n.w
1 / 7
Embed
Share

Discover the key concepts of data splitting and cross-validation for model validation in a relatively simple model. Learn how to prevent overfitting, select variables, build the final model, and apply data splitting techniques in PROC HPLOGISTIC. Enhance your understanding of partitioning strategies and optimize model performance effectively.

  • Data Splitting
  • Model Validation
  • Overfitting Prevention
  • PROC HPLOGISTIC
  • Partitioning Strategies

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Data splitting and cross-validation

  2. In a relatively simple model like the one we fit to large amounts of data, overfitting is probably not a problem (Hand 1997). The chance of overfitting is increased by variable selection methods that include more complex . 2

  3. The final model, again. %let screened= MIPhone MICCBal Dep MM ILS MTGBal Income POS CD IRA brclus1 Sav NSF Age SavBal LOCBal NSFAmt Inv MIHMVal CRScore MIAcctAg InvBal DirDep CCPurc SDB CashBk AcctAge InArea ATMAmt b_DDABal DDA brclus2 CC HMOwn DepAmt Phone ATM LORes brclus4; ods output NObs=NObs bestsubsets=score; proc proc logistic logistic data=d.develop_a; model ins(event="1")=&screened resr resu / selection=score best=1 1; run run; /*Schwarz Bayes criterion */ data data _NULL_; set NObs; where label = 'Number of Observations Used'; call symput('obs',n); run run; data data subset; set score; sbc=-scorechisq+log(&obs)*(numberofvariables+1 1); run run; proc proc sql select VariablesInModel into :selected from subset having sbc=min(sbc); quit quit; proc proc logistic logistic data=d.develop_a descending plots=roc; model ins=&selected; run run; sql;

  4. The primary purpose of data-splitting is to correct for over optimism.

  5. Split the data into training and test data sets. The training set is used for model development The test data is used for assessment. Most input-preparation steps can be done before the data is split. 5

  6. Data splitting in PROC HPLOGISTIC. %let screened= MIPhone MICCBal Dep MM ILS MTGBal Income POS CD IRA brclus1 Sav NSF Age SavBal LOCBal NSFAmt Inv MIHMVal CRScore MIAcctAg InvBal DirDep CCPurc SDB CashBk AcctAge InArea ATMAmt b_DDABal DDA brclus2 CC HMOwn DepAmt Phone ATM LORes brclus4; proc proc hplogistic hplogistic data=d.develop_a; model ins(descending)=&screened resr resu; selection method=backward; partition fraction(test=.25 run run; .25 seed=54321 54321); 6

  7. A different partitioning proc proc hplogistic hplogistic data=d.develop_a; model ins(descending)=&screened resr resu; selection method=backward; partition fraction(test=.25 run run; .25 seed=3377551 3377551); 7

Related


More Related Content