Using R in Statistics - Practical Insights and Examples

using r in statistics by faruk guder and mary n.w
1 / 16
Embed
Share

Discover the benefits and ease of using R in introductory statistics courses with insights and examples presented in this virtual poster. Explore how R simplifies data analysis tasks and enhances visualization capability for statistical analysis. Learn about R functions for descriptive statistics, normal distribution, hypothesis testing, regression, and more.

  • R Statistics
  • Data Analysis
  • Statistical Software
  • Data Visualization
  • Introductory Statistics

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Using R in Statistics by Faruk Guder and Mary Malliaris Quinlan School of Business, Loyola University Chicago .

  2. Using R in Statistics This virtual poster presents our experience in using R in our introductory statistics course. Purpose of this presentation: It is not to argue that R is the best tool for data analysis It is to share our experience with the instructors who are considering to use R in their statistics courses. After using R in the course, we observed the following: R is relatively easy and very intuitive. Students need to know only a small number of R functions to generate the outputs needed in the course. . The examples showing the use of these functions in an introductory statistics course are presented on the following slides using the house data given on the next page.

  3. Data: 70 houses, 5 variables file: house.csv A Random Sample of 70 Houses in a City

  4. R and R Studio Free and open source. Available for Windows, Mac, and Linux systems. Can be downloaded from https://cran.r-project.org https://www.rstudio.com Programming Language (might look challenging, relatively simple and very intuitive). Leading programming language in statistics and data analytics Professional quality visualization and graphics capability. There are thousands of contributed packages (over 17,000) for R, written by many different authors. Some of these packages implement specialized statistical methods such as ARIMA time series analysis, others provide techniques for various data analytics tasks such as professional quality graphs, data mining, linear programming, and simulation.

  5. Subjects in Introductory Statistics 1. Descriptive Statistics (a) Numerical Measures (b) Visual Displays 2. Normal Distribution and Central Limit Theorem 3. Interval Estimation 4. Hypothesis Testing (a) One-sample Tests (b) Two-Sample Tests (c) Multi-Sample Tests (ANOVA) 5. Regression (a) Correlation (b) Simple Regression (c) Multiple Regression . R functions used for these subjects are presented on the following slides.

  6. R Functions 1. Descriptive Statistics Initial Work - Read Data: house <- read.csv("house.csv") (a) Numerical Measures: mean(house$value) [1] 215.2 median(house$value) [1] sd(house$value) [1] 30.62281 215.2 214.5 30.62281 [1] 214.5 (b) Visual Displays: hist(house$value) boxplot(house$value~location, data=house)

  7. R Functions 2. Normal Distribution and Central Limit Theorem The house values in a city is known to be normally distributed with a mean ( ) of 200 and a standard deviation ( ) of 20. (a) What is the probability that a randomly selected house in this city will have value less than 220? pnorm(220,200,20) [1] 0.8413447 0.8413447 (b) What is the probability that a randomly selected house in this will have a value more than 220? 1-pnorm(220,200,20) [1] 0.1586553 0.1586553 (c) If a random sample of 25 houses is selected in this city, what is the probability the mean value of these houses will be between 196 and 204? SE = 20/sqrt(25) pnorm(204,200,SE)-pnorm(196,200,SE) [1] 0.6826895 [1] 0.6826895

  8. R Functions 3. Interval Estimation Based on the observed values in the house.csv file, what is 95% confidence interval for the mean value of all houses? t.test(house$value, conf.level = 0.95) One Sample t-test data: house$value t = 58.796, df = 69, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 95 percent confidence interval: 207.8982 222.5018 207.8982 222.5018 sample estimates: mean of x 215.2

  9. R Functions 4(a). One-Sample Testing One-Sample Test: The past data indicated that the mean house value in this city was 205. Does the sample data provide evidence that the mean house value has changed at a significance level of 5%? t.test(house$value, mu=205, alt="two.sided", conf.level = 0.95) One Sample t-test data: house$value t = 2.7868, df = 69, p-value = 0.00687 alternative hypothesis: true mean is not equal to 205 95 percent confidence interval: 207.8982 222.5018 sample estimates: mean of x 215.2 Answer: Evidence that the mean house value has changed.

  10. R Functions 4(b). Two-Sample Testing Two-Sample Test: Does the sample data provide evidence that the mean house value of the houses with finished basement is different from the mean value of the houses with no finished basement at a significance level of 5%? t.test(value~basement, data=house, conf.level = 0.95) Welch Two Sample t-test data: value by basement t = -6.9071, df = 38.786, p-value = 0.00000002957 alternative hypothesis: true difference in means between group no and group yes is not equal to 0 95 percent confidence interval: -55.05575 -30.11092 sample estimates: mean in group no mean in group yes 186.0000 228.5833 Answer: p-value= 0.00000002957 < 0.05. Therefore, the mean values are different.

  11. R Functions 4(c). Multi-Sample Testing (ANOVA) Multi-Sample Test: Does the sample data provide evidence that the mean value of the houses in each location (A, B, C) is different at a significance level of 5%? ABC <- aov(value~location, data=house) summary(ABC) Df Sum Sq Mean Sq F value Pr(>F) p-value location 2 10839 5419 6.741 0.00215 ** Residuals 67 53866 804 Answer: The mean house values in all three locations are not the same. Question: Which locations have different mean house values? TukeyHSD(ABC) $location diff lwr upr p adj B-A 27.96 7.571322 48.34868 0.0045448 C-A 24.00 4.777370 43.22263 0.0106707 C-B -3.96 -24.348678 16.42868 0.8876148 The mean values are different - between locations B and A - between locations C and A

  12. R Functions 5(a). Regression Correlation coefficient between the house values and the square footage areas? cor(house$value, house$area) [1] 0.8616099 Scatterplot of house values vs square footage areas. plot(value~area, data=house)

  13. R Functions 5(b). Linear Regression Simple Regression to predict house value (Y) using the square footage area (X). R1 <- lm(value~area, data=house) summary(R1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 83.844469 9.568613 8.762 0.000000000000907 *** area 0.079467 0.005677 13.998 < 0.0000000000000002 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 15.66 on 68 degrees of freedom Multiple R-squared: 0.7424, Adjusted R-squared: 0.7386 F-statistic: 195.9 on 1 and 68 DF, p-value: < 0.00000000000000022 Regression equation: Y = 83.844469 + 0.079467area

  14. R Functions 5(c). Multiple Regression Multiple Regression to predict house values (Y) using the square footage area (X1), basement (X2), and location (X3). R3 <- lm(value~area+basement+location, data=house) summary(R3) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 92.123377 8.046769 11.448 < 0.0000000000000002 *** area 0.063305 0.005411 11.699 < 0.0000000000000002 *** basementyes 18.475014 3.831494 4.822 0.00000893 *** locationB 9.831023 4.033990 2.437 0.0176 * locationC 8.285070 3.778324 2.193 0.0319 * --- Residual standard error: 12.88 on 65 degrees of freedom Multiple R-squared: 0.8333, Adjusted R-squared: 0.823 F-statistic: 81.21 on 4 and 65 DF, p-value: < 0.00000000000000022 Regression equation: Y = 92. 123377 + 0.063305area + 18.475014basementyes + 9.831023locationB + 8.285070locationC

  15. R Functions 5(c). Multiple Regression - CIE What is 95% confidence interval estimate for the mean value of all houses with the following characteristics? area=1653, basement="yes", location="B". predict(R3, data.frame(area=1653, basement="yes", location="B"), interval="confidence , conf.level=0.95) fit lwr upr 1 225.0724 218.954 231.1907 95% Confidence Interval Estimate 218.954 < mean < 231.1907

  16. Summary - List of All R Functions

Related


More Related Content