Managing Data Sets in R - Learn, Analyze, Visualize

Managing Data Sets in R - Learn, Analyze, Visualize
Slide Note
Embed
Share

Dive into managing data sets in R with Nicole Lama. Discover how to open, manipulate, and analyze large data sets, understand big data vs. large data sets, and overcome challenges when working with R. Explore the basics of visualizing data and utilizing R for statistical analysis using various file types.

  • R Programming
  • Data Analysis
  • Data Visualization
  • Statistical Analysis

Uploaded on Feb 17, 2025 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Training Up: Managing Data Sets in R Nicole Lama

  2. Goals Open Large Data Sets of Various types into R Manipulate/Access Data in R Learn How to Do Basic Statistical Analysis Basics of Visualizing the Data

  3. Big Data vs Large Data Set Big Data: Refers to a dataset that will not easily fit onto available RAM of a system Large Data Set: Data that is generally cumbersome to work with because of it s size

  4. Data Size Medium Datasets < 2GB Large Datasets ~ 2- 10 GB Big Data> 10 GB Requires distributed large scale computing

  5. R Cons Will only use one core Reads all data into memory rather than reading it on command Can slow down your computer

  6. Why R for Big Data? R has libraries for just about any statistical analysis possible R is open source (woo hoo! Free software!)

  7. File Types .tsv : tab separated .csv : comma separated .txt : generic text file, could be separated by spaces .json : JavaScript Object Notation

  8. Opening File Types .tsv : read.table( myFile.tsv , sep = \t ) .csv : read.csv( myFile.csv ) .txt : read.table( myFile.txt , sep = \s ) .json : fromJson(file = myFile.json ) rjson library

  9. Opening File Types UH OH! My Data is Throwing an out of workspace error Taking forever to load! Taking forever to run an analysis on

  10. Opening File Types UH OH! My Data is Throwing an out of workspace error* Taking forever to load! Taking forever to run an analysis on Issue: You have too much Data *Out of workspace can also mean the values you are running in your script/function are too large. Use significant values, use a different function, or transform your values before proceeding

  11. Too Much Data Some easy fixes: 1. Down sample your data sample() 2. Select only data that is relevant to your analysis Maybe you only need two columns of data for your analysis

  12. Too Much Data Some easy fixes: 1. Use data.table package Like data.frame but more efficient 2. Split up your data into chunks just read in 200 MB of data at a time

  13. Working with Data Sets < 2 GB Pre-defining column classes and specify number of rows bigfile.sample <- read.csv("data/SAT_Results2014.csv", stringsAsFactors=FALSE, header=T, nrows=20) bigfile.colclass <- sapply(bigfile.sample,class) bigfile.raw <- tbl_df(read.csv("data/SAT_Results2014.csv", stringsAsFactors=FALSE, header=T,nrow=10000, colClasses=attendance.colclass, comment.char="")) https://rpubs.com/msundar/large_data_analysis

  14. Working with Data Sets ~ 2 - 10 GB Use bigmemory library Biganalytics, bigtabulate for manipulation Find more computational power Especially if file size is approaching over 10 GB https://rpubs.com/msundar/large_data_analysis

  15. Still Too Slow? 1. Profile (time your functions) System.time(func) Manipulate function or args until run time is faster 2. Use Compile() on a function Auto-compiles every function only once

  16. Nope, Still Slow 1. Use Parallelism It s not worth using if you have less than 4 processors (IMO) doMC package 2. Use Super Computer Longleaf, Dogwood, other private clusters through Amazon, IBM, Google, NASA

  17. Bottom Line You should always try to dynamically read in your data if possible If you use R and need to change memory allocation, use: memory.size() If you are on 32-bit R, max is 2-3 GB You can never have more than (2^31)-1 (2,147,483,647) rows or columns

  18. But Most Importantly We still don t know the proper way to handle large datasets and it is a hot topic in research https://underthecblog.org/2014/09/16/big-data-big-problems/

  19. Back to Opening Files After opening your file (tsv,csv,txt,etc ) in R, it is usually stored as a data frame A data frame is a special object in R which organizes by row and column, and is easily manipulated Many functions in R require data to be stored as data frames *check if something is a data frame with : is.data.frame(<obj>)

  20. Cleaning Up Your Data Remove Missing Values: na.omit() Exclude NaNs from analysis with na.rm=TRUE argument Check for NaNs with: is.nan()

  21. Adding Data.Frame in R Rbind(df,list(information)) Cbind(df,rowName=c(information)) * Df = data.frame

  22. Removing Data.Frame in R Df$colName <- NULL #remove column Df <- df[-1,] #delete rows by reassignment * Df = data.frame

  23. Accessing Data.Frame in R Df[ colName ] #pulls out column called colName Df$colName Df[[3]] #pulls out 3rd column Df[1,] #access row * Df = data.frame

  24. Filtering Data in R Filt_df <- df[c(1:15),c(3,4,7:9)] #filter with row/col index Filt_df <- df[-c(1:15),-c(3,4,7:9)] #do inverse of above with - #Use subset(df, property you want to filter by, choose columns) subset_df <- subset(df, property == 2, select = c( colName1 , colName2 ))

  25. Statistical Analysis in R Summary Statistics (mean, med, mode) Summary(<list>)

  26. Statistical Analysis in R Covariance: Cov(x,y)

  27. Statistical Analysis in R Linear Models lm(y~x) Scatter plot: Visualize the linear relationship between the predictor and response Box plot: To spot any outlier observations in the variable. Having outliers in your predictor can drastically affect the predictions as they can easily affect the direction/slope of the line of best fit. Density plot: To see the distribution of the predictor variable (X). Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred. Let us see how to make each one of them. http://r-statistics.co

  28. Visualizing Statistical Data Histograms : hist(data) Scatter plots: plot() Bar Plots: barplot() Line Graphs: line() Box Plots: boxplot()

  29. Visualizing Statistical Data All graphing functions in R take aesthetic arguments such as col (color). Look at documentation for all options. For even more aesthetically pleasing plots, I recommend using the ggplot2 library.

  30. ggplot2

  31. write.table(mydata, "c:/mydata.txt", sep="\t") To an Excel Spreadsheet library(xlsx) write.xlsx(mydata, "c:/mydata.xlsx") Writing out Data Write.table(myData, c:/myFile.tsv , sep= \t )

  32. write.table(mydata, "c:/mydata.txt", sep="\t") To an Excel Spreadsheet library(xlsx) write.xlsx(mydata, "c:/mydata.xlsx") Exporting Graphs In R Studio, you can generate a graph and simply click the export button to save it onto you local drive

  33. References Big Data: http://www.columbia.edu/~sjm2186/EPIC_R/EPIC_R_BigData.pdf Large Scale Data Analysis in R: https://rpubs.com/msundar/large_data_analysis Taking R to the Limit: https://www.slideshare.net/bytemining/r-hpc Statistics in R: http://r-statistics.co https://www.statmethods.net/input/missingdata.html

  34. Supplementary Information

  35. Bigmemory Will point to location in memory rather than reading in a large file (uses pointer of data) https://rpubs.com/msundar/large_data_analysis

  36. Bigmemory library(bigmemory) library(biganalytics) library(bigtabulate) #Create big.matrix setwd("/Users/sundar/dev") school.matrix <- read.big.matrix( "./numeric_matrix_SAT__College_Board__2010_School_Level_Results.csv", type ="integer", header = TRUE, backingfile = "school.bin", descriptorfile ="school.desc", extraCols =NULL) # Get the location of the pointer to school.matrix. desc <- describe(school.matrix) str(school.matrix)## Formal class 'big.matrix' [package "bigmemory"] with 1 slot ## ..@ address:<externalptr> # process big matrix in active session. colsums.session1 <- sum(as.numeric(school.matrix[,3])) colsums.session1 https://rpubs.com/msundar/large_data_analysis

  37. Parallelism in R Use fread option form data.table package Parallel Processing with doMC package library(doMC) registerDoMC(cores = 4) https://rpubs.com/msundar/large_data_analysis

  38. Memory/Data Type Char: 24 MB Int: 96 MB Double: 192 MB Short: 48 MB

  39. R Syntax Variable Var <- 3 Function Func <- function( a,b ){ return(a) } Argument Func(1,2) Library library( ggplot2 )

  40. Programming Terms For Loops- To repeat something a certain amount of times While Loops- To repeat something until a condition is met If Statements- Do something only if a condition is met

  41. R Syntax For loop numbers <- cbind(1,2,3,4) for (num in numbers) { print(num*2) }

  42. R Syntax While loop i <- 0 while(i <6 ) { print(i) i<-i+1}

  43. R Syntax If statement x <- 0 if (x < 0) { print("Negative number") }

More Related Content