Mastering SAS Essentials for Data Analytics - Chapter 11: Analyzing Counts and Tables
In this chapter from "SAS Essentials: Mastering SAS for Data Analytics, 3rd Edition" by Alan Elliott and Wayne Woodward, the focus is on using PROC FREQ in SAS to create one-way and two-way frequency tables, calculate relative risk measures, and assess inter-rater reliability using Cohen's kappa. The chapter provides learning objectives and guidance on obtaining hands-on data files. Key concepts include using PROC FREQ for analyzing count data and understanding various options available for table creation. Explore the practical application and customization of PROC FREQ for data analysis tasks.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
SAS Essentials: Mastering SAS for Data Analytics, 3rdEdition Alan Elliott and Wayne Woodward 1 SAS ESSENTIALS -- Elliott & Woodward
Chapter 11: ANALYZING COUNTS AND TABLES These slides are provided to help you use to teach SAS using SAS Essentials, 3rd Edition (Elliott and Woodward). Feel free to modify them for your own needs. Please send comments about errors in the slides (or suggestions for improvements) to sas@alanelliott.com. Files available at www.alanelliott.com/sas. Thanks 2 SAS ESSENTIALS -- Elliott & Woodward
LEARNING OBJECTIVES To be able to use PROC FREQ to create one-way frequency tables To be able to use PROC FREQ to create two-way (cross- tabulation) tables To be able to use two-by-two contingency tables to calculate relative risk measures To be able to use Cohen's kappa to calculate inter-rater reliability 3 SAS ESSENTIALS -- Elliott & Woodward
Where to Get Hands-On Data Files As discussed in Chapter 1, the files used in the examples are located on the web at http//:www.alanelliott.com/sas Refer to the information in the first tutorial, or in the SAS ESSENTIALS text for additional information on how to download these files from the web, and copy them to you computer. 4 SAS Essentials 3 - Elliott & Woodward
11.1 USING PROC FREQ PROC FREQ is a multipurpose SAS procedure for analyzing count data. It can be used to obtain frequency counts for one or more individual variables or to create two-way tables (cross-tabulations) from two variables. A simplified syntax is PROC FREQ <Options(s)>; <Statements> TABLES requests </options>; 5 SAS ESSENTIALS -- Elliott & Woodward
Table 11.1. Option DATA=dataname Specify which dataset to use ORDER=option Specifies the order in which results are listed in the output table Options are DATA, FORMATTED, FREQ and ORDER. This is illustrated in an upcoming example. PAGE Specifies that only one table will appear per page (not applicable to HTML output.) ALPHA=n Sets the level for confidence limits (default 0.05) COMPRESS Begins the next table on the same page when possible (not applicable to HTML output.) NOPRINT Used when you want to capture output but not display tables. Common Options for PROC FREQ Meaning 6 SAS ESSENTIALS -- Elliott & Woodward
Table 11.2 Common Statements for PROC FREQ Option Meaning EXACT Produces exact p-values for tests. Fisher s Exact Test automatically calculated for a 2x2 table. OUTPUT= dataname Creates an output dataset containing statistics from an analysis WEIGHT variable Identifies a weight variable that contains summarized counts TABLES <variable- combinations/opti ons>; TEST Specifies which tables will be displayed. More information about this statement is given below. Specifies which statistical tests will be performed (Requires a TABLES statement) BY, FORMAT, LABEL, WHERE These statements are common to most procedures, and may be used here. 7 SAS ESSENTIALS -- Elliott & Woodward
11.1.1 The TABLES Statement The TABLES statement is required for all of the examples in this chapter. Its format is: TABLES<variable-combinations/options>; where variable-combinations specifies frequency or cross-tabulation tables. Options for the TABLE statement follow a slash (/). For example, TABLES A*B / CHISQ; requests that the chi-square and related statistics will be reported for the cross-tabulation A*B. 8 SAS ESSENTIALS -- Elliott & Woodward
More about TABLES To obtain counts of the number of subjects observed in each category of group (GP), use the following: PROCFREQ; TABLES GP; RUN; To produce a cross-tabulation of GENDER by treatment GP: PROCFREQ; TABLES GENDER*GP; RUN; The variables specified in the TABLES statement can be either categorical/character or numeric. To request chi-square statistics for a table, include the option /CHISQ at the end of the TABLES statement. For example, PROCFREQ; TABLES GENDER*GP/CHISQ; RUN; 9 SAS ESSENTIALS -- Elliott & Woodward
Table 11.3. Sample TABLES statements Table Specification Description TABLES A; Specifies frequencies for a single variable TABLES A*B; Specifies a crosstabulation between two variables TABLES A*B B*C X*Y; Also, TABLES A*(B C D); Specify several cross-tabulation tables is the same as TABLES A*B A*C A*D; TABLES (A -- C)*X; Use a range of variables in a TABLES statement is the same as TABLES A*X B*X C*X; 10 SAS ESSENTIALS -- Elliott & Woodward
TABLE 11.4 Options for the TABLE Statement Option Description AGREE Request Kappa statistic (inter-rater reliability). RELRISK Requests relative risk calculations FISHER Requests Fisher s Exact test for tables greater than 2 x 2 SPARSE Requests all possible combinations of levels MISSING Requests missing values treated as nonmissing CELLCHI2 Displays the contribution to chi-square NOCOL Suppresses column percentages for each cell NOCUM Suppresses cumulative frequencies NOFREQ Suppresses frequency count for each cell NOPERCENT Suppresses row percentage and column percentage in cross-tabulation tables NOPRINT Suppresses tables but displays statistics NOROW Suppresses row percentage for each cell TESTP=(list) Specifies a test based on percentages (Goodness-of-Fit test.) 11 SAS ESSENTIALS -- Elliott & Woodward
11.2 ANALYZING ONE-WAY FREQUENCY TABLES When count data are collected, you can use PROC FREQ to produce tables of the counts by category as well as to perform statistical analyses on the counts. The number and type of tables produced by the PROC FREQ procedure are specified with a TABLES statement. This section describes how to create tables of counts by category and how to perform a goodness-of-fit test. 12 SAS ESSENTIALS -- Elliott & Woodward
DO HANDS-ON EXAMPLE 11.1 Open the program file AFREQ1.SAS PROCFREQ DATA='C:\SASDATA\SOMEDATA'; TABLES STATUS; TITLE 'Simple Example of PROC FREQ'; RUN; PROCFREQ DATA='C:\SASDATA\SOMEDATA' ORDER=FREQ; TABLES STATUS; TITLE 'Example of PROC FREQ Using ORDER= Option'; RUN; 13 SAS ESSENTIALS -- Elliott & Woodward
Results TABLES STATUS; ORDER=FREQ; TABLES STATUS; 14 SAS ESSENTIALS -- Elliott & Woodward
The ORDER Option Following are some of the PROC FREQ ORDER= options. DATA Orders values by their order in the dataset. FORMATTED Orders values by ascending values determined by a FORMAT statement (see PROC FORMAT in Chapter 4) and next slide. FREQ Orders values by descending frequency count. ORDER Orders values by order in dataset. Exercise: Change the ORDER= to the options above, run the program and observe the output. Complete this example using steps in the text. 15 SAS ESSENTIALS -- Elliott & Woodward
More about the ORDER= Option The ORDER=FORMATTED option for PROC FREQ specifies the order in which the categories are displayed in the table. You must first create a custom format in a PROC FORMAT command to define the order that you want to be used in your output table. For Example: PROCFORMAT; VALUE $FMTRACE "AA"="African American" "H"="Hispanic" "OTH"="Other " "C"="White"; RUN; 16 SAS ESSENTIALS -- Elliott & Woodward
Apply Your Created Format To cause PROC FREQ to display categories in the Formatted order, apply your created FORMAT: The ORDER= option species the order in which the categories are displayed. In this case, they are displayed in FORMATTED order. PROCFREQORDER=FORMATTED DATA=" C: \SASDATA \SURVEY"; TABLES RACE; FORMAT RACE $FMTRACE.; RUN; You must also apply the format to the variable for it to be correctly used. 17 SAS ESSENTIALS -- Elliott & Woodward
DO HANDS-ON EXAMPLE 11.2 The PROC FORMAT defines the format you want to use the PROC FREQ Open the program file AFREQ2.SAS PROCFORMAT; VALUE $FMTRACE "AA"="African American" "H"="Hispanic" "OTH"="Other" "C"="White"; RUN; PROCFREQORDER=FORMATTED DATA="C:\SASDATA\SURVEY"; TABLES RACE; TITLE 'Example of PROC FREQ using OPTION=Formatted'; FORMAT RACE $FMTRACE.; RUN; The ORDER=FORMATTED tells SAS to sort the ouput The FORMAT statement applies the creted format to a variable 18 SAS ESSENTIALS -- Elliott & Woodward
Output for Exercise Complete this example using steps in the text. Note ordered by RACE (alpha) 19 SAS ESSENTIALS -- Elliott & Woodward
11.3 CREATING ONE-WAY FREQUENCY TABLES FROM SUMMARIZED DATA The following example illustrates how to summarize counts from a dataset into a frequency table Suppose your data is in this summarized form: CENTS 152 CENTS 100 NICKELS 49 DIMES 59 QUARTERS 21 HALF 44 DOLLARS 21 This means there are 49 nickels 20 SAS ESSENTIALS -- Elliott & Woodward
DO HANDS-ON EXAMPLE 11.3 Open program file AFREQ3.SAS DATA COINS; INPUT @1 CATEGORY $9. @11NUMBER3.; DATALINES; CENTS 152 CENTS 100 NICKELS 49 etc ; PROCFREQ; WEIGHT NUMBER; TITLE 'Reading Summarized Count data'; TABLES CATEGORY; RUN; WEIGHT NUMBER; tells PROC FREQ that the data for the variable NUMBER are counts. 21 SAS ESSENTIALS -- Elliott & Woodward
Exercise Results Complete this example using steps in the text. 22 SAS ESSENTIALS -- Elliott & Woodward
11.3.1 Testing Goodness of Fit in a One-Way Table A goodness-of-fit test of a single population is a test to determine if the distribution of observed frequencies in the sample data closely matches with the expected number of occurrences under a hypothetical distribution for the population. The hypotheses being tested are as follows: H0: The population follows the hypothesized distribution. Ha : The population does not follow the hypothesized distribution. 23 SAS ESSENTIALS -- Elliott & Woodward
Goodness-of-fit using PROC FREQ A chi-square statistic is calculated, and a decision can be made based on the p-value associated with that statistic. A low p-value indicates that the data do not follow the hypothesized, or theoretical, distribution. If the p-value is sufficiently low (usually <0.05), you will reject the null hypothesis. The syntax to perform a goodness-of-fit test is as follows: PROC FREQ; TABLES variable/ CHISQ TESTP=(list of ratios); 24 SAS ESSENTIALS -- Elliott & Woodward
Goodness-of-fit Example As an example, we ll use data from an experiment by the nineteenth-century monk Gregor Mendel. He began with two different pea features, green and yellow & smooth and wrinkled. He then cross bred them to observe how the offspring inherited the traits. According to a genetic theory, crossbred pea plants show a 9:3:3:1 ratio. From 556 plants, you expect (9/16) x 556 = 312.75 yellow smooth peas (56.25%) (3/16) x 556 = 104.25 yellow wrinkled peas (18.75%) (3/16) x 556 = 104.25 green smooth peas (18.75%) (1/16) x 556 = 34.75 green wrinkled peas (6.25%) 25 SAS ESSENTIALS -- Elliott & Woodward
Actual Observed Data After growing 556 of these pea plants, Mendel observed the following: 315 have yellow smooth peas 108 have yellow wrinkled peas 101 have green smooth peas 32 have green wrinkled peas Is this observation reasonable for the hypothesized ratio? 26 SAS ESSENTIALS -- Elliott & Woodward
DATA GENE; INPUT @1 COLORTYPE $13. @15 NUMBER 3.; DATALINES; YELLOWSMOOTH 315 YELLOWWRINKLE 108 GREENSMOOTH 101 GREENWRINKLE 32 ; DO HANDS-ON EXAMPLE 11.4 Open the program file AFREQ4.SAS Hypothesizing a 9:3:3:1 Ratio: * After entering data GENE; PROCFREQ ORDER=DATA ; WEIGHT NUMBER; TITLE 'GOODNESS OF FIT ANALYSIS'; TABLES COLORTYPE / NOCUM CHISQ TESTP=(0.5625 0.1875 0.1875 0.0625); RUN; Note these proportions are in a 9:3:3:1 ratio 27 SAS ESSENTIALS -- Elliott & Woodward
Note the following components of this SAS program: The data are summarized as indicated by the WEIGHT NUMBER statement in PROC FREQ. The ORDER=DATA option causes the output frequencies to be ordered as they were input into the dataset. Frequencies are based on the variable COLORTYPE. The /NOCUM CHISQ and TESTP= statements request the goodness-of-fit test. The test ratios are based on the percent progeny expected from each of the four categories. The NOCUM option requests a table without the cumulative column. The NOCUM option instructs SAS not to include the Cumulative Frequency column in the table. 28 SAS ESSENTIALS -- Elliott & Woodward
Results of Goodness-of-Fit Note that in this case, the p-value for the chi-square test (p = 0.9254), leads us not to reject the null hypothesis and conclude that the evidence indicates support for the theory that the population exhibits the 9:3:3:1 phenotypic ratios. 29 SAS ESSENTIALS -- Elliott & Woodward
11.4 ANALYZING TWO-WAY TABLES To create a cross-tabulation table using PROC FREQ for relating two variables, use the TABLES statement with both variables listed and separated by an asterisk (*), (e.g., A*B). A cross-tabulation table is formed by counting the number of occurrences in a sample across two grouping variables. The number of columns in a table is usually denoted by c and the number of rows by r. Thus, a table is said to be an r x c table, that is, it has r x c cells. 30 SAS ESSENTIALS -- Elliott & Woodward
Test of Independence The hypotheses associated with a test of independence are as follows: H0: The variables are independent (no association between them). Ha : The variables are not independent. For example, a null hypothesis could be that there is no association between handedness (left and right handed) to hair color. 31 SAS ESSENTIALS -- Elliott & Woodward
Test of Homogeneity The null hypothesis is that the populations have the same distribution (they are homogeneous). In this case, the hypotheses are as follows: H0: The populations are homogeneous. Ha : The populations are not homogeneous. For example, a null hypothesis could be that from two populations, (male and female) the distribution of handedness the same. 32 SAS ESSENTIALS -- Elliott & Woodward
Testing These Hypotheses The chi-square test of independence or homogeneity is reported by PROC FREQ (the tests are mathematically equivalent) by the use of the I CHISQ option in the TABLES statement. For example, PROCFREQ; TABLES GENDER*GP/ CHISQ; Use the same code to test for either independence or homogeneity. 33 SAS ESSENTIALS -- Elliott & Woodward
Summarized Data for the Crime Case CRIME DRINKER COUNT Arson 1 50 Arson 0 43 Rape 1 88 Rape 0 62 Violence 1 155 Violence 0 110 Stealing 1 379 Stealing 0 300 Coining 1 18 Coining 0 14 Fraud 1 63 Fraud 0 144 Data are in summarized for. For Arson, there were 50 Drinkers (DRINKER=1) and 43 Non-Drinkers) (DRINKER=0) Data comes from a study performed by Karl Pearson in 1909 involving the relationship between criminal behaviors and drinking alcoholic beverages. The category Coining refers to counterfeiting. 34 SAS ESSENTIALS -- Elliott & Woodward
DO HANDS-ON EXAMPLE 11.5 Open the program file AFREQ5.SAS Note that WEIGHT COUNT; is needed since the data are in summary form. PROCFREQ DATA=DRINKERS; WEIGHT COUNT; TABLES CRIME*DRINKER/CHISQ; TITLE 'Chi Square Analysis of a Contingency Table'; RUN; 35 SAS ESSENTIALS -- Elliott & Woodward
Example Output This part of the output shows the data. The four numbers in each cell are Count Total Percent Row Percent Column Percent 36 SAS ESSENTIALS -- Elliott & Woodward
Results of Chi Square Analysis The Chi-Square value is 49.73 and the p-value is p < 0.0001. Thus, you reject the null hypothesis of no association (independence) and conclude that there is evidence of a relationship between drinking status and type of crime committed. Complete this example using steps in the text. 37 SAS ESSENTIALS -- Elliott & Woodward
11.4.1 Creating a Contingency Table from Raw Data, the 2 x 2 Case In the previous example (CRIME) the data were in summary form, and you needed to use the WEIGHT COUNT; statement to reflect that. If your data are in raw form one record per observation, you do not need the WEIGHT statement. For this data, each subject has one record thus you have one record per observation. 38 SAS ESSENTIALS -- Elliott & Woodward
DO HANDS-ON EXAMPLE 11.6 Open the program file AFREQ6.SAS Data are in the file RASH. PROC FREQ DATA="C:\SASDATA\RASH"; TABLES CLEANER*RASH /CHISQ; TITLE 'CHI-SQUARE ANALYSIS FOR A 2X2 TABLE'; RUN; 39 SAS ESSENTIALS -- Elliott & Woodward
Output from 2x2 Chi-Square Analysis Statistical Results note in particular the Chi-Square and Fisher Two-Sided Pr<=p values. The Resulting Table of counts: Chi-square = 8.29, p = 0.004, indicates an association between CLEANER and RASH (rejects the null hypothesis). The two- sided Fisher results p = 0.0095 provides the same decision. 40 SAS ESSENTIALS -- Elliott & Woodward
11.4.2 Tables with Small Counts in Cells When you summarize counts in tables, and there are small numbers in one or more cells, a typical chi-square statistical analysis may not be valid. For example, suppose you have sampled freshmen and sophomores at a college to see if they took advantage of the Learning Enhancement Center (LEC) tutorials and you want to know if the proportion using the center differs by class. 41 SAS ESSENTIALS -- Elliott & Woodward
DO HANDS-ON EXAMPLE 11.7 Open program file AFREQ7.SAS DATA LEARN; INPUT CLASS $ LEC $ COUNT; DATALINES; S Y 4 S N 11 F Y 5 F N 3 ; PROCFREQ DATA=LEARN;WEIGHT COUNT; TABLES CLASS*LEC/CHISQ ; TITLE 'Chi Square Analysis of a Contingency Table'; RUN; WEIGHT specifies data are summarized The CHISQ option requests the CHI Square test 42 SAS ESSENTIALS -- Elliott & Woodward
Results Note the small numbers in the cells: 3 5 and 4 43 SAS ESSENTIALS -- Elliott & Woodward
Run the code Observe the warning message "WARNlNG: 50% of the cells have expected counts <5. Chi- square may not be a valid test. In this case, the Fisher's Exact test is the more reliable test and should be used instead of the Chi-Square test. Either way, there s not enough evidence to reject the null hypothesis. 44 SAS ESSENTIALS -- Elliott & Woodward
11.5 GOING DEEPER: CALCULATING RELATIVE RISK MEASURES Two-by-two contingency tables are often used when examining a measure of risk. A measure of this risk in a retrospective (case-control) study is called the odds ratio (OR). In a case- control study, a researcher takes a sample of subjects and looks back in time for exposure (or nonexposure). If the data are collected prospectively, where subjects are selected by presence or absence of a risk and then observed over time to see if they develop an outcome, the measure of risk is called relative risk (RR). Either way RR=1 or OR=1 means no risk observed. 45 SAS ESSENTIALS -- Elliott & Woodward
Testing Relative Risk in PROC FREQ In PROC FREQ, the option to calculate the values for OR or RR is RELRISK and appears as an option to the TABLES statement as shown here (for the RASH data): TABLES CLEANER*RASH /RELRISK; In the results, a risk measure > 1 indicates that exposure is harmful and a risk measure <1 implies that exposure is a benefit. 46 SAS ESSENTIALS -- Elliott & Woodward
DO HANDS-ON EXAMPLE 11.8 Open the program file AFREQ6.SAS (as used in a previous example.) Change the TABLES statement to TABLES CLEANER*RASH /RELRISK; Run the revised program. 47 SAS ESSENTIALS -- Elliott & Woodward
Results of Risk Analysis The OR= 0.1346 specifies the odds of Row1/Row2 - that is, for cleaner 1 versus cleaner 2. Because OR is <1, this indicates that the odds of a person's having a rash who is using cleaner 1 is less than they are when the person is using cleaner 2. Typically, the Odds Ratio is the statistic of interest 48 SAS ESSENTIALS -- Elliott & Woodward
11.6 GOING DEEPER: INTER-RATER RELIABILITY (KAPPA) A method for assessing the degree of agreement between two raters is Cohen's kappa coefficient. For example, kappa is useful for analyzing the consistency of two raters who evaluate subjects on the basis of a categorical measurement. Two Raters A and B compared 49 SAS ESSENTIALS -- Elliott & Woodward
DO HANDS-ON EXAMPLE 11.9 Open the program file AKAPPA1.SAS DATA KAPPA; INPUT RATER1 RATER2 WT; DATALINES; 1 1 1 2 1 3 2 1 2 2 2 3 3 1 3 2 3 3 ; First, read in the data and name the dataset KAPPA. 75 1 4 5 4 1 0 0 10 * Continues next slide ; 50 SAS ESSENTIALS -- Elliott & Woodward