Effective SAS Data Cleaning Strategies

sas data cleaning tips n.w
1 / 12
Embed
Share

Discover essential tips for SAS data cleaning, including handling variable names, missing values, and more. Learn how to efficiently clean and prepare your data for analysis in SAS.

  • SAS
  • Data Cleaning
  • Tips
  • Data Analysis

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. SAS: Data Cleaning Tips PICC Meeting - October 2022 Djhenne Dalmacy Biostatistician I Djhenne.Dalmacy@osumc.edu

  2. Outline 1- Variable Names: Enter first excel row as data label 2- Missing Values: Identify missing values & calculate % missing 3- Invalid Values: Identify Invalid values - Fix (special case): Numeric variable with character entry 4- Other Features: Handy procedures, functions, & options to modify data 5- Useful Links: Some pretty good resources 2

  3. Variable Names /*1strow=name/2ndrow=label*/ proc import datafile='L:\...\Tyler2v3.xlsx' out = temp1 dbms = xlsx replace; getnames = no; run; proc transpose data = my_labels_ out = my_labels (drop= _NAME_ _LABEL_); var _all_; run; proc datasets noprint; modify my_labels; rename COL1 = _NAME_; quit; /*Split into 2 datasets*/ data my_labels_ char_data; set temp1; if _n_ in (1 2) then output my_labels_; else output char_data; run; 3

  4. Variable Names /*Import data without labels*/ proc import datafile='L:\...\Tyler2v2.xlsx' out = raw_data dbms = xlsx replace; getnames = yes; run; /*Assign the labels to the raw_data*/ data _null_; set my_labels end = last; if _n_= 1 then call execute('proc datasets library=work nolist nodetails; modify raw_data;'); call execute(cat('label ',_name_,'="',col2,'";')); if last then call execute('quit;'); run; proc contents data = raw_data out = my_contents varnum noprint; run; 4

  5. Missing Values: /*Subset all obs with missing (use delete instead of output to subset complete cases)*/ data crlm_miss; set crlm_; if cmiss(of _ALL_) ~= 0 then output; *delete; run; Replace of _ALL_ with the list or a subset of variables of interest as needed Comparison: cmiss() will work on both character and numeric variables nmiss() will change everything to numeric missing() will only take one argument 5

  6. Missing Values: Categorical Variables varname = scan(Table, -1); %let p = 1; %let nxt_var = %scan(&cvars., &p); %do %while(&nxt_var. ne); if varname = &nxt_var. and cmiss(&nxt_var.) ~= 0; %let p = %eval(&p. + 1); %let nxt_var = %scan(&cvars., &p); %end; keep varname Frequency Percent; run; %mend; %let cvars = MALE ASA_CLASS TUMOR_LOCATION AJCC_T_STAGE MORTALITY; ods output OneWayFreqs = freq_tab_; proc freq data = crlm_; tables &cvars. /missing;* missprint; run; %macro cvars_mis; data cvars_mis; retain varname Frequency Percent; set cvars_mis_; option mprint symbolgen; %cvars_mis 6

  7. Missing Values: Continuous Variables %let nvars = AGE CEA_AT_META_DX FOLLOW_UP_DAYS; data nvars_miss; retain Variable NMISS N Percent; set nvars_miss_; Percent = round((nmiss/sum(n, nmiss))*100, 0.01); keep Variable NMISS N Percent MIN MAX; run; ods output summary = nvars_miss_; proc means data = crlm_ nmiss n min max maxdec = 2 fw = 20 stackodsoutput; var &nvars.; run; The nvars_miss dataset can be used to identify values that are out of range. Add mean variance median, as well as P25 and P75 as needed to quickly check on variation and skewness. 7

  8. Invalid values /*Use formats if you have some ideas of what to expect*/ proc format; value cs 300-850, . = 'OK'; value dti 0-100, . = 'OK'; run; data _null_; set exp_cred; file print; /*view in output/results window*/ title "Listing Invalid Input"; if put(vantage_v4_score, cs.) ne 'OK' then put PatientID = vantage_v4_score = ; if put(dti1_score_total_debt, dti.) ne 'OK' then put PatientID = dti1_score_total_debt = ; if verify(reh7110,' 0123456789') ne 0 then put PatientID = reh7110 =; if Gender not in ('F','M',' ') then put PatientID = Gender =; run; verify() returns position of first character that is not ' ' or 1-9; 8

  9. Invalid values-Fix Special Case %macro fix_invalids_num; %let i = 1; %do %while(%scan(&vars., &i) ne); %let next_var = %scan(&vars., &i); option mprint symbolgen; %fix_invalids_num The macro can be used to set obs. to missing when the variable is numeric and some obs. are characters (letters or symbol). First save all variables to be updated in a macro variable using %let vars= ; proc format; value &next_var.chk low-high, . = 'OK ; run; proc sql; update cl2.full_dm_cohort set &next_var. = . where put(&next_var.,&next_var.chk.) ne 'OK'; quit; %let i = %eval(&i. + 1); %end; %mend fix_invalids_num; 9

  10. Other Features: Modify existing datasets proc sql; create table bp_2017_qtr1 as select *, scan(VITAL_VALUE,1,'/') as SBP_, substr(VITAL_VALUE, index(VITAL_VALUE, '/') +1) as DBP_ from bp_2017_qtr1_ group by PatientID; alter table cl2.cl_dm_cohort_v1_101422 drop ALCOHOL_H_QTR, DEPRESS_HIST_QTR; alter table cl2.cl_dm_cohort_v1_101422 add DM_A1C_GRP num(8) 'Diabetes classification based on A1C ; alter table cl2.cl_dm_cohort_v1_101422 modify URBANICITY format = URBANICITY., DM_A1C_GRP format = comma5.1; update cl2.cl_dm_cohort_v1_101422 set URBANICITY = 'Metro County where County = 'Franklin City' or URBANICITY = '&Gren_Urb.'; quit; 10

  11. Other Features: modify/subset/clear/inspect proc datasets lib=cl2; /*use to modify dataset*/ modify full_dm_cohort_v2; rename CONSUMER_SEQUENCE = consumersequence; label aua5020 = 'Auto loan or lease - total balance reported in the last 3 months'; run; quit; /*use to remove duplicates and store them in a separate dataset*/ proc sort data = CRED DUPOUT = CRED_DUPOUT nodupkey; by DMRN archive; run; /*use to add unique ID*/ ID = _n_; /*in data step*/ monotonic()as n, 1 + calculated n - min(calculated n) as FLAG_ID 'Order of Observations' /*in proc sql*/ /*use to find obs that starts ('XX%'), ends ('%XX'), or contains ('%XX%'), certain characters*/ where lowcase(drug_name) like ("%scan(&x, &i.)%"); Put(); Input(); /*use to change variable formats*/ proc datasets noprint library=work kill; run; quit; /*use to clear a library*/ 11

  12. Useful links 1. https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/statug/titlepage.htm 2. https://communities.sas.com/ 3. https://v8doc.sas.com/sashtml/macro/index.htm 4. https://www.9to5sas.com/ 5. https://www.listendata.com/ 6. https://users.phhp.ufl.edu/rlp176/Courses/PHC6089/SAS_notes/intro.html 7. https://stats.oarc.ucla.edu/other/mult-pkg/whatstat/ 8. https://people.stat.sc.edu/hitchcock/stat541spring2018.html 9. https://www.lexjansen.com/nesug/ 10. https://www.lexjansen.com/sesug/ 12

Related


More Related Content