
Mastering Variable Selection with Random Forest in SAS
Explore the power of Random Forest in SAS for variable selection, as explained by statistical research analyst Denis Nyongesa. Learn how Random Forest techniques improve prediction models by selecting important variables and discover the significance of variable importance measurement. Enhance your predictor performance with essential insights on variable selection methods and criteria.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
VARIABLE SELECTION USING RANDOM FOREST IN SAS By Denis Nyongesa
Biography Denis Nyongesa is a Statistical Research Analyst at Kaiser Permanente Center for Health Research, Portland, Oregon. His work deeply involves data management and statistical modeling. He is more interested with Statistical computations, Machine Learning and Pattern Recognition. Denis enjoys hiking and road trips.
Random Forest (RF) Breiman (2001) defined a RF as a classifier that consists a collection of tree-structured classifiers that are independent & identically distributed (iid) random vectors and each tree casts a unit vote for the most popular class at input x. An increasingly used statistical method for classification and regression. It was introduced by Leo Breiman in 2001. A good prediction model begins with a great feature selection process. How do we select important variables using random forest? Next slides show how
RF Contd RF is combination of tree predictors . The generalization error for forests converges almost surely to a limit as the number of trees in the forest becomes large. RF are more robust with respect to noise. Trees of the forest tend to have reduced correlation, if not uncorrelated due to random selection of a subset of predictors (m<p) by RF method. RF do not overfit the data, even for a very large number of trees. (Breiman, 2001) The correlation between trees is correlation of predictions on the out-of-bag (OOB) samples (a set of observations not used in building the current tree).
Variable Importance Schwarz Information Criterion (SBC), Akaike's Information Criterion (AIC), etc are some criteria used to measure variable importance by most statistical procedures. RF approach is different. RF randomly permutes values of the variable for the OOB observations. Then the modified OOB data are passed down the tree to get new predictions. Difference between the misclassification rate for the modified and original OOB data, divided by the standard error, is a measure of the importance of the variable (Cutler et al. 2007). RF indexes or ranks variables according to the score of importance of each variable.
Variable Selection Important variables in the model improves predictor performance. Many variable selection methods incorporate the importance of the features, for example, support vector machines (SVM) scores, stepwise, forward, backward, p-values, score chi-square, etc. The cut-off point at which to select variables using RF depends on the analyst as far as purpose the model is intended for, etc.
Data Sashelp.JunkMail data was used. It was collected in Hewlett- Packard (HP) labs and donated by George Forman. Data set comes from a study that classifies whether an email is junk email (1 = junk/spam, 0 = not junk/not spam). The data set contains 4,601 observations with 59 variables. 57 predictor variables - records frequencies of some common words and characters and lengths of uninterrupted sequences of capital letters in emails (continuous). Response/Dependent variable: Class (categorical). NOTE: Test (categorical) was dropped.
Sashelp.junkmail Information Displaying information about Sashelp.junkmail First & last three Observations Sashelp.junkmail # Variable Type Length Label The CONTENTS procedure 1 Test Num 8 0 - Training, 1 - Test 2 Make Num 8 proc contents data = sashelp.junkmail varnum; ods select position; run; 3 Address Num 8 57 CapLong Num 8 Capital Run Length Longest 58 CapTotal Num 8 Capital Run Length Total 59 Class Num 8 0 - Not Junk, 1 - Junk
RANDOM FOREST THE HIGH-PERFORMANCE PROCEDURE Options in SAS Enterprise Miner 14.3 High-Performance Procedures SAS Code maxtrees specifies the maximum number of trees. ods trace on; proc hpforest data=sashelp.junkmail maxtrees=1000 vars_to_try=10 seed=1985 trainfraction=0.7 maxdepth=50 leafsize=6 alpha=0.5; target class /level=nominal; input Make Address All _3D Our Over Remove Internet Order Mail Receive Will People Report Addresses Free Business Email You Credit Your Font _000 Money HP HPL George _650 Lab Labs Telnet _857 Data _415 _85 Technology _1999 Parts PM Direct CS Meeting Original Project RE Edu Table Conference Semicolon Paren Bracket Exclamation Dollar Pound CapAvg CapLong CapTotal / level = interval; ods output FitStatistics = fit_at_runtime; ods output VariableImportance = Variable_Importance; ods output Baseline = Baseline; run; ods trace off; vars_to-try specifies the randomized number inputs to select at each node. seed sets the randomization seed for bootstrapping and feature selection. trainfraction specifies the fraction of the original observations used for bootstrapping each tree. leafsize indicates the minimum number of observations allowed in each branch. alpha specifies the p-value threshold a candidate variable must meet for a node to be split. maxdepth specifies the number of splitting rules for the nodes. preselect indicates the method of selecting a splitting feature
Fit Statistics The Average Square Error (ASE) The Misclassification Error (ME)
VARIABLE IMPORTANCE The plot of Variables Ranked in the order of Importance
MODEL COMPARISON Average Square and Misclassification Errors obtained by Variable Selection by RF and STEPWISE techniques Criteria Top 16 variables each obtained by RF variable importance and LOGISTIC regression s STEPWISE method were selected. Two separate RF models. variables selected by RF consistently performs better on both training and OOB data samples.
Contact Information Name: Denis Nyongesa Company: Kaiser Permanente Center of Health Research (KPCHR) City/State: Portland Phone: 503-335 6606 Email:denis.b.nyongesa@kpchr.org