Managing Data Sets in R - Learn, Analyze, Visualize

Slide Note

Dive into managing data sets in R with Nicole Lama. Discover how to open, manipulate, and analyze large data sets, understand big data vs. large data sets, and overcome challenges when working with R. Explore the basics of visualizing data and utilizing R for statistical analysis using various file types.

jaer924 Follow

Uploaded on Feb 17, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Training Up: Managing Data Sets in R Nicole Lama

Goals Open Large Data Sets of Various types into R Manipulate/Access Data in R Learn How to Do Basic Statistical Analysis Basics of Visualizing the Data

Big Data vs Large Data Set Big Data: Refers to a dataset that will not easily fit onto available RAM of a system Large Data Set: Data that is generally cumbersome to work with because of it s size

Data Size Medium Datasets < 2GB Large Datasets ~ 2- 10 GB Big Data> 10 GB Requires distributed large scale computing

R Cons Will only use one core Reads all data into memory rather than reading it on command Can slow down your computer

Why R for Big Data? R has libraries for just about any statistical analysis possible R is open source (woo hoo! Free software!)

File Types .tsv : tab separated .csv : comma separated .txt : generic text file, could be separated by spaces .json : JavaScript Object Notation

Opening File Types .tsv : read.table( myFile.tsv , sep = \t ) .csv : read.csv( myFile.csv ) .txt : read.table( myFile.txt , sep = \s ) .json : fromJson(file = myFile.json ) rjson library

Opening File Types UH OH! My Data is Throwing an out of workspace error Taking forever to load! Taking forever to run an analysis on

Opening File Types UH OH! My Data is Throwing an out of workspace error* Taking forever to load! Taking forever to run an analysis on Issue: You have too much Data *Out of workspace can also mean the values you are running in your script/function are too large. Use significant values, use a different function, or transform your values before proceeding

Too Much Data Some easy fixes: 1. Down sample your data sample() 2. Select only data that is relevant to your analysis Maybe you only need two columns of data for your analysis

Too Much Data Some easy fixes: 1. Use data.table package Like data.frame but more efficient 2. Split up your data into chunks just read in 200 MB of data at a time

Working with Data Sets < 2 GB Pre-defining column classes and specify number of rows bigfile.sample <- read.csv("data/SAT_Results2014.csv", stringsAsFactors=FALSE, header=T, nrows=20) bigfile.colclass <- sapply(bigfile.sample,class) bigfile.raw <- tbl_df(read.csv("data/SAT_Results2014.csv", stringsAsFactors=FALSE, header=T,nrow=10000, colClasses=attendance.colclass, comment.char="")) https://rpubs.com/msundar/large_data_analysis

Working with Data Sets ~ 2 - 10 GB Use bigmemory library Biganalytics, bigtabulate for manipulation Find more computational power Especially if file size is approaching over 10 GB https://rpubs.com/msundar/large_data_analysis

Still Too Slow? 1. Profile (time your functions) System.time(func) Manipulate function or args until run time is faster 2. Use Compile() on a function Auto-compiles every function only once

Nope, Still Slow 1. Use Parallelism It s not worth using if you have less than 4 processors (IMO) doMC package 2. Use Super Computer Longleaf, Dogwood, other private clusters through Amazon, IBM, Google, NASA

Bottom Line You should always try to dynamically read in your data if possible If you use R and need to change memory allocation, use: memory.size() If you are on 32-bit R, max is 2-3 GB You can never have more than (2^31)-1 (2,147,483,647) rows or columns

But Most Importantly We still don t know the proper way to handle large datasets and it is a hot topic in research https://underthecblog.org/2014/09/16/big-data-big-problems/

Back to Opening Files After opening your file (tsv,csv,txt,etc ) in R, it is usually stored as a data frame A data frame is a special object in R which organizes by row and column, and is easily manipulated Many functions in R require data to be stored as data frames *check if something is a data frame with : is.data.frame(<obj>)

Cleaning Up Your Data Remove Missing Values: na.omit() Exclude NaNs from analysis with na.rm=TRUE argument Check for NaNs with: is.nan()

Adding Data.Frame in R Rbind(df,list(information)) Cbind(df,rowName=c(information)) * Df = data.frame

Removing Data.Frame in R Df$colName <- NULL #remove column Df <- df[-1,] #delete rows by reassignment * Df = data.frame

Accessing Data.Frame in R Df[ colName ] #pulls out column called colName Df$colName Df[[3]] #pulls out 3rd column Df[1,] #access row * Df = data.frame

Filtering Data in R Filt_df <- df[c(1:15),c(3,4,7:9)] #filter with row/col index Filt_df <- df[-c(1:15),-c(3,4,7:9)] #do inverse of above with - #Use subset(df, property you want to filter by, choose columns) subset_df <- subset(df, property == 2, select = c( colName1 , colName2 ))

Statistical Analysis in R Summary Statistics (mean, med, mode) Summary(<list>)

Statistical Analysis in R Covariance: Cov(x,y)

Statistical Analysis in R Linear Models lm(y~x) Scatter plot: Visualize the linear relationship between the predictor and response Box plot: To spot any outlier observations in the variable. Having outliers in your predictor can drastically affect the predictions as they can easily affect the direction/slope of the line of best fit. Density plot: To see the distribution of the predictor variable (X). Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred. Let us see how to make each one of them. http://r-statistics.co

Visualizing Statistical Data Histograms : hist(data) Scatter plots: plot() Bar Plots: barplot() Line Graphs: line() Box Plots: boxplot()

Visualizing Statistical Data All graphing functions in R take aesthetic arguments such as col (color). Look at documentation for all options. For even more aesthetically pleasing plots, I recommend using the ggplot2 library.

ggplot2

write.table(mydata, "c:/mydata.txt", sep="\t") To an Excel Spreadsheet library(xlsx) write.xlsx(mydata, "c:/mydata.xlsx") Writing out Data Write.table(myData, c:/myFile.tsv , sep= \t )

write.table(mydata, "c:/mydata.txt", sep="\t") To an Excel Spreadsheet library(xlsx) write.xlsx(mydata, "c:/mydata.xlsx") Exporting Graphs In R Studio, you can generate a graph and simply click the export button to save it onto you local drive

References Big Data: http://www.columbia.edu/~sjm2186/EPIC_R/EPIC_R_BigData.pdf Large Scale Data Analysis in R: https://rpubs.com/msundar/large_data_analysis Taking R to the Limit: https://www.slideshare.net/bytemining/r-hpc Statistics in R: http://r-statistics.co https://www.statmethods.net/input/missingdata.html

Supplementary Information

Bigmemory Will point to location in memory rather than reading in a large file (uses pointer of data) https://rpubs.com/msundar/large_data_analysis

Bigmemory library(bigmemory) library(biganalytics) library(bigtabulate) #Create big.matrix setwd("/Users/sundar/dev") school.matrix <- read.big.matrix( "./numeric_matrix_SAT__College_Board__2010_School_Level_Results.csv", type ="integer", header = TRUE, backingfile = "school.bin", descriptorfile ="school.desc", extraCols =NULL) # Get the location of the pointer to school.matrix. desc <- describe(school.matrix) str(school.matrix)## Formal class 'big.matrix' [package "bigmemory"] with 1 slot ## ..@ address:<externalptr> # process big matrix in active session. colsums.session1 <- sum(as.numeric(school.matrix[,3])) colsums.session1 https://rpubs.com/msundar/large_data_analysis

Parallelism in R Use fread option form data.table package Parallel Processing with doMC package library(doMC) registerDoMC(cores = 4) https://rpubs.com/msundar/large_data_analysis

Memory/Data Type Char: 24 MB Int: 96 MB Double: 192 MB Short: 48 MB

R Syntax Variable Var <- 3 Function Func <- function( a,b ){ return(a) } Argument Func(1,2) Library library( ggplot2 )

Programming Terms For Loops- To repeat something a certain amount of times While Loops- To repeat something until a condition is met If Statements- Do something only if a condition is met