Recoding II: Numerical & Graphical Descriptives in Epidemiology

Recoding II: Numerical & Graphical Descriptives in Epidemiology
Slide Note
Embed
Share

In this lecture, we delve into the process of recoding key variables in a dataset related to births. The focus is on understanding the functions for numeric descriptive statistics and base graphics. The session includes a review of key variables such as Exposure (Early prenatal care - mdif), Outcome (Preterm birth - wksgest), and Covariates for HW #1. Practical examples and guidance are provided on how to recode variables effectively. Through illustrations and analyses, participants learn to manipulate and interpret numerical and graphical descriptives. Techniques for creating numeric and factor variables are demonstrated to enhance data analysis skills.

  • Epidemiology
  • Recoding
  • Numerical Statistics
  • Graphical Descriptives
  • Data Analysis

Uploaded on Mar 13, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Recoding II: Recoding II: Numerical & Graphical Numerical & Graphical Descriptives Descriptives EPID 701, Lecture 6 Tuesday, 28 January 2020 Thank you Sarah Levintow! (EPID799C, Fall 2018)

  2. Todays Overview: Review! Today s Overview: Review! Recoding review: Key variables in births dataset Functions for numeric descriptive statistics Functions for base graphics 2

  3. HW1 Q2 Getting Started Getting Started Set your path setwd("./../R for epi 2020 data pack") # Read in data births <- read.csv("births2012.csv", stringsAsFactors = F, header = T) # or read_csv # Variable names in lowercase names(births) <- tolower(names(births)) # Load packages 3

  4. Recoding Review: Recoding Review: Key variables Exposure: Early prenatal care using mdif Outcome: Preterm birth using wksgest Covariates for HW #1: Maternal age using mage Cigarette use using cigdur Date of birth using dob Infant sex using sex Maternal race using mrace and methnic 4

  5. HW1 Q3 Exposure: Exposure: Early prenatal care What do we know about the variable mdif? How do we want to recode it? summary(births$mdif) table(births$mdif, useNA = "always") > summary(births$mdif) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 3.000 6.183 4.000 99.000 > table(births$mdif, useNA = "always") 1 2 3 4 5 6 7 8 9 88 99 <NA> 9292 36987 41252 14783 7142 3984 2336 1556 971 2055 2155 0 5

  6. HW1 Q5a,b Exposure: Exposure: Early prenatal care One option for creating numeric variable pnc5 from mdif: > table(births$mdif, births$pnc5, useNA = "always") 0 1 <NA> 1 0 9292 0 2 0 36987 0 3 0 41252 0 4 0 14783 0 5 0 7142 0 6 3984 0 0 7 2336 0 0 8 1556 0 0 9 971 0 0 88 2055 0 0 <NA> 0 0 2155 births$mdif[births$mdif==99] <- NA births$pnc5 <- ifelse(births$mdif<=5, 1, 0) # Check your work! table(births$mdif, births$pnc5, useNA = "always") 6

  7. HW1 Q5a,b Exposure: Exposure: Early prenatal care One option for creating factor variable pnc5_ffrom pnc5: births$pnc5_f <- factor(births$pnc5, levels = c(0, 1), labels = c("No Early PNC", "Early PNC")) table(births$mdif, births$pnc5_f, useNA = "always") > table(births$mdif, births$pnc5_f, useNA = "always") No Early PNC Early PNC <NA> 1 0 9292 0 2 0 36987 0 3 0 41252 0 4 0 14783 0 5 0 7142 0 6 3984 0 0 7 2336 0 0 8 1556 0 0 9 971 0 0 88 2055 0 0 <NA> 0 0 2155 7

  8. HW1 Q3 Outcome: Outcome: Preterm birth What do we know about the variable wksgest? How do we want to recode it? summary(births$wksgest) table(births$wksgest, useNA = "always") > summary(births$wksgest) Min. 1st Qu. Median Mean 3rd Qu. Max. 18.00 38.00 39.00 38.83 40.00 99.00 > table(births$wksgest, useNA = "always") 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 13 26 52 57 94 106 134 154 162 178 221 272 366 560 793 1154 34 35 36 37 38 39 40 41 42 43 44 45 99 <NA> 1902 2930 5020 9855 18306 33716 24316 12642 4492 2506 1505 844 137 0 8

  9. HW1 Q5a,b Outcome: Outcome: Preterm birth One option for creating numeric variable preterm from wksgest: births$wksgest[births$wksgest==99] <- NA births$preterm <- ifelse(births$wksgest<37, 1, 0) table(births$wksgest, births$preterm, useNA = "always") 9

  10. HW1 Q5a,b Outcome: Outcome: Preterm birth One option for creating numeric variable preterm from wksgest: births$wksgest[births$wksgest==99] <- NA births$preterm <- ifelse(births$wksgest<37, 1, 0) table(births$wksgest, births$preterm, useNA = "always") 10

  11. HW1 Q5a,b > table(births$wksgest, births$preterm, useNA = "always") 0 1 <NA> 18 0 13 0 19 0 26 0 20 0 52 0 21 0 57 0 22 0 94 0 23 0 106 0 24 0 134 0 25 0 154 0 26 0 162 0 27 0 178 0 28 0 221 0 29 0 272 0 30 0 366 0 31 0 560 0 32 0 793 0 33 0 1154 0 34 0 1902 0 35 0 2930 0 36 0 5020 0 37 9855 0 0 38 18306 0 0 39 33716 0 0 40 24316 0 0 41 12642 0 0 42 4492 0 0 43 2506 0 0 44 1505 0 0 45 844 0 0 <NA> 0 0 137 Outcome: Outcome: Preterm birth One option for creating numeric variable preterm from wksgest: births$wksgest[births$wksgest==99] <- NA births$preterm <- ifelse(births$wksgest<37, 1, 0) table(births$wksgest, births$preterm, useNA = "always") 11

  12. HW1 Q5a,b Outcome: Outcome: Preterm birth One option for creating factor variable preterm_ffrom preterm: births$preterm_f <- factor(births$preterm, levels = c(0, 1), labels = c("Term", "Preterm")) table(births$wksgest, births$preterm_f, useNA = "always") 12

  13. HW1 Q5a,b > table(births$wksgest, births$preterm, useNA = "always") 0 1 <NA> 18 0 13 0 19 0 26 0 20 0 52 0 21 0 57 0 22 0 94 0 23 0 106 0 24 0 134 0 25 0 154 0 26 0 162 0 27 0 178 0 28 0 221 0 29 0 272 0 30 0 366 0 31 0 560 0 32 0 793 0 Outcome: Outcome: Preterm birth One option for creating factor variable preterm_ffrom preterm: births$preterm_f <- factor(births$preterm, levels = c(0, 1), labels = c("Term", "Preterm")) 33 0 1154 0 34 0 1902 0 35 0 2930 0 36 0 5020 0 37 9855 0 0 38 18306 0 0 39 33716 0 0 40 24316 0 0 41 12642 0 0 42 4492 0 0 43 2506 0 0 44 1505 0 0 45 844 0 0 <NA> 0 0 137 table(births$wksgest, births$preterm_f, useNA = "always") 13

  14. Covariate: Covariate: Maternal age Your turn: Check out the distribution of mage. Question for the Google Doc: What is the mean maternal age in our dataset? Think about how you might recode this variable for any analysis. 14

  15. Covariate: Covariate: Maternal age Your turn: Check out the distribution of mage. How might you recode this variable? # For now: recode 99 as missing births$mage[births$mage==99] <- NA # For later: think about specifying mage in models Numeric? Factor? Higher order terms or splines? > summary(births$mage) Min. 1st Qu. Median Mean 3rd Qu. Max. 10.00 23.00 27.00 27.57 32.00 99.00 > births$mage[births$mage==99] <- NA > summary(births$mage) Min. 1st Qu. Median Mean 3rd Qu. Max. NA s 10.00 23.00 27.00 27.56 32.00 55.00 22 15

  16. Functions in R Functions in R Functions are R objects (just like everything else!). Nearly everything in R is done through functions. There are many built-in functions that you are already familiar with. A strength of R is the user s ability to write their own functions. Resources for getting started: https://www.statmethods.net/management/functions.html https://www.statmethods.net/management/userfunctions.html 16

  17. Functions for Numeric Descriptive Statistics Functions for Numeric Descriptive Statistics Our focus today is on built-in functions for summarizing continuous or categorical variables: A variable is continuous if it can take any of an infinite set of ordered values. Represented as vectors of numbers or date-times. A variable is categorical if it can only take one of a small set of values. Represented as factors or vectors of characters. (From R for Data Science, http://r4ds.had.co.nz/) 17

  18. Functions for Continuous Variables Functions for Continuous Variables Measures of Centrality mean() median() mode() Measures of Spread min() max() quantile() range() sd() var() 18

  19. Functions for Continuous Variables Functions for Continuous Variables Measures of Centrality mean() median() mode() General syntax: function(dataframe$variable) Example: Measures of Spread min() max() quantile() range() sd() var() mean(births$mage) 19

  20. Functions for Continuous Variables Functions for Continuous Variables Measures of Centrality mean() median() mode() What happened? > mean(births$mage) [1] NA Measures of Spread min() max() quantile() range() sd() var() 20

  21. Functions for Continuous Variables Functions for Continuous Variables Measures of Centrality mean() median() mode() General syntax: function(dataframe$variable) If there is missing data: function(dataframe$variable, na.rm = TRUE) Measures of Spread min() max() quantile() range() sd() var() 21

  22. Functions for Continuous Variables Functions for Continuous Variables Measures of Centrality mean() median() mode() > mean(births$mage) [1] NA > mean(births$mage, na.rm=T) [1] 27.55683 Measures of Spread min() max() quantile() range() sd() var() 22

  23. Functions for Continuous or Categorical Variables Functions for Continuous or Categorical Variables table() summary() Hmisc::describe() tableone::CreateTableOne() 23

  24. table() table() Produces a contingency table of counts. If given >1 argument, gives the counts at each combination of levels. useNA = "always (always!) > table(births$pnc5_f, births$preterm_f, useNA="always") Term Preterm <NA> No Early PNC 9326 1514 62 Early PNC 97113 12297 46 <NA> 1743 383 29 > table(births$pnc5_f, useNA="always") No Early PNC Early PNC <NA> 10902 109456 2155 24

  25. summary() summary() Generic function that produces result summaries depending on the class of the first argument. summary(births$mage) summary(births$mage[births$preterm_f=="Preterm"]) summary(births$mage[births$preterm_f=="Term"]) > summary(births$mage) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 10.00 23.00 27.00 27.56 32.00 55.00 22 > summary(births$mage[births$preterm_f=="Preterm"]) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 13.00 23.00 27.00 27.71 32.00 50.00 144 > summary(births$mage[births$preterm_f=="Term"]) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 10.00 23.00 27.00 27.54 32.00 55.00 151 25

  26. summary() summary() Generic function that produces result summaries depending on the class of the first argument. > summary(births$mage) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 10.00 23.00 27.00 27.56 32.00 55.00 22 > summary(births$mage[births$preterm_f=="Preterm"]) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 13.00 23.00 27.00 27.71 32.00 50.00 144 > summary(births$mage[births$preterm_f=="Term"]) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 10.00 23.00 27.00 27.54 32.00 55.00 151 Wait... I thought there were only 22 missing ages. What s going on? 26

  27. summary() summary() Results summary when the argument is a dataframe. summary(births) > summary(births) x deltype bedcode stocc coocc Min. : 1 Min. :1.000 Length:122513 Length:122513 Min. : 1.0 1st Qu.: 30629 1st Qu.:1.000 Class :character Class :character 1st Qu.: 63.0 Median : 61257 Median :1.000 Mode :character Mode :character Median :119.0 Mean : 61257 Mean :1.016 Mean :115.5 3rd Qu.: 91885 3rd Qu.:1.000 3rd Qu.:147.0 Max. :122513 Max. :7.000 Max. :999.0 cityocc stres cores cityresf dob Min. : 0 Length:122513 Min. : 1.0 Min. : 0 Length:122513 1st Qu.:14100 Class :character 1st Qu.: 63.0 1st Qu.: 0 Class :character Median :28080 Mode :character Median :119.0 Median :15320 Mode :character Mean :35263 Mean :125.9 Mean :25655 3rd Qu.:55000 3rd Qu.:155.0 3rd Qu.:43920 Max. :99999 Max. :999.0 Max. :99999 NA's :52 27

  28. summary() is flexible & takes different classes of objects summary() is flexible & takes different classes of objects > model1 <- glm(pnc5_f ~ marital, family=binomial("identity"), data=births) > summary(model1) Looking ahead: Call: glm(formula = pnc5_f ~ marital, family = binomial("identity"), data = births) summary() for model output Deviance Residuals: Min 1Q Median 3Q Max -2.3579 0.3579 0.3579 0.5225 1.3286 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.003484 0.002193 457.57 <2e-16 *** marital -0.065531 0.001617 -40.52 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 73148 on 120357 degrees of freedom Residual deviance: 71387 on 120356 degrees of freedom (2155 observations deleted due to missingness) AIC: 71391 Number of Fisher Scoring iterations: 5 28

  29. tableone tableone package package Create an object summarizing characteristics of study population (continuous and categorical variables). Key arguments for CreateTableOne() function: vars = variable names to be summarized strata = stratifying variable name (optional) data = dataframe where all variables exist Other arguments and more details here: https://cran.r- project.org/web/packages/tableone/tableone.pdf 34

  30. Graphics in R Graphics in R In R, there are a lot of ways to do the same thing, and this is especially true for visualization. Base graphics are R s built-in functionality for charts and graphs. R packages extend this functionality, with ggplot being the most popular. Today is a brief tour of base graphics, but most of the semester, we ll use ggplot. 38

  31. Functions for Base Graphics Functions for Base Graphics Main Functions plot(), hist(), barplot(), boxplot(), dotchart() Key Arguments Data to plot (x=, y=) Add text (xlab=, ylab=, main=) Add color, symbol, line (col=, pch=, lty=, lwd=) Helpers jitter(), density(), abline() Base Graphics Cheat Sheet: http://publish.illinois.edu/johnrgallagher/files/2015/10/BaseGraphicsCheatsheet.pdf 39

  32. plot() examples plot() examples b <- births[1:10000,] # smaller dataset for faster plotting plot(b$mage, b$wksgest) # basic plot plot(b$mage, b$wksgest, xlab="Maternal age in years", ylab = "Gestational age in weeks") #add axis labels plot(jitter(b$mage), jitter(b$wksgest), pch=".", xlab="Maternal age in years", ylab = "Gestational age in weeks") #add jitter to both mage and wksgest to reduce overplotting 40

  33. 41

  34. color <- rep(NA, nrow(b)) color[b$cigdur == "Y"] <- "red" color[b$cigdur == "N"] <- "blue" plot(jitter(b$mage), jitter(b$wksgest), pch=".", col=color, xlab="Maternal age in years", ylab = "Gestational age in weeks") abline(v=mean(b$mage, na.rm=T)) abline(h=mean(b$wksgest, na.rm=T)) 42

  35. 43

  36. boxplot() boxplot() boxplot(births$mage) #mage overall boxplot(births$mage ~ births$pnc5_f) #mage by pnc 44

  37. hist hist() () hist(births$mage) hist(births$mage, breaks = 40, main = "Histogram of Maternal Age", xlab = "Age in years") 45

  38. plot() and plot() and barplot barplot() with table object () with table object table(births$cigdur, births$pnc5_f) # 3x2 table smoke_by_pnc <- table(births$cigdur, births$pnc5_f) # saved as table object smoke_by_pnc plot(smoke_by_pnc) barplot(smoke_by_pnc, legend.text=T) Very ugly!!! Just to give you a sense of the most basic plots! 46

  39. plot() with density() plot() with density() plot(density(births$mage, na.rm=T), main = "Density Plot of Maternal Age") 47

  40. plot() with a plot() with a dataframe dataframe births_sample <- births[sample(nrow(births), 1000), c("mage", "mdif", "wksgest")] plot(births_sample) 48

  41. Practice with Recoding Practice with Recoding 1. Today, we recoded the main exposure and outcome variables and the covariate maternal age. Now, let s recode the covariate cigarette use. Take a look at the existing variable cigdur using the table() or describe() function. 2. Recode the existing character variable cigdur to a new integer variable smoke, so that it is coded as 1 for Y , 0 for N , and missing (NA) otherwise. 3. Convert that integer variable smoke to a factor variable smoker_f with levels Smoker and Non-smoker. 49

  42. Practice with Functions Practice with Functions 4. What is the mean maternal age among smokers? Among non- smokers? 5. What proportion of births were preterm among smokers? Among non-smokers? 6. Use the CreateTableOne() function to summarize early prenatal care, preterm births, and maternal age, stratified by smoking status. 7. With a plotting function of your choice, visualize the relationship between smoking status and the original weeks of gestation variable (wksgest). 50

  43. Plan for Thursday Plan for Thursday Keep recoding! 51

More Related Content