
Relationships Between Quantitative Data Variables in Statistics
Explore how scatterplots help describe relationships between quantitative variables, focusing on direction, form, strength, and outliers. Learn how to interpret the patterns and deviations in data points to gain insights into linear and curvilinear relationships.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Examining Relationships: Quantitative Data Concepts in Statistics
Scatterplots How do we describe the relationship between two quantitative variables using a scatterplot? We describe the overall pattern and deviations from that pattern.
Scatterplots (cont. 1) In a scatterplot, we describe the overall pattern with descriptions of direction, form, and strength. The direction of the relationship can be positive, negative, or neither.
Scatterplots (cont. 2) A positive (or increasing) relationship means that an increase in one of the variables is associated with an increase in the other. A negative (or decreasing) relationship means that an increase in one of the variables is associated with a decrease in the other.
Scatterplots (cont. 3) The form of the relationship is its general shape. To identify the form, describe the shape of the data in the scatterplot. Linear form: The data points appear scattered about a line. We use a line to summarize the pattern in the data. Curvilinear form: The data points appear scattered about a smooth curve. We use a curve to summarize the pattern in the data.
Scatterplots (cont. 4) The strength of the relationship is a description of how closely the data follow the form of the relationship. Let s look, for example, at the following two scatterplots displaying positive, linear relationships:
Scatterplots (cont. 5) Outliers are points that deviate from the pattern of the relationship. In the scatterplot below, there is one outlier.
Example: Average Length of Pregnancy What can we learn about the relationship from the scatterplot? The direction of the relationship is positive. An increase in lifespan is associated with an increase in pregnancy length. In other words, animals that live longer tend to have longer pregnancies. The form of the relationship is linear. The relationship is moderately strong.
Linear Relationships The correlation coefficient (?) is a numeric measure that measures the strength and direction of a linear relationship between two quantitative variables Calculation: ? is calculated using the following formula: ? ? ?? ? ? ?? ? = ? 1 where ? is the sample size; ? is a data value for the explanatory variable; ? is the mean of the ?-values; ??is the standard deviation of the ?-values; similarly, for the terms involving ?.
Linear Relationships: Example The relationship is linear, we can use the correlation coefficient as a measure of direction and strength of the linear relationship. The ?-value is 0.793. The ?-value is negative, which means that the linear relationship has a negative direction. Because ? is somewhat close to 1, the relationship is moderately strong.
Properties of r The correlation does not change when the units of measurement of either one of the variables change. In other words, if we change the units of measurement of the explanatory variable and/or the response variable, it has no effect on the correlation (r). The correlation by itself is not enough to determine whether a relationship is linear. The correlation is heavily influenced by outliers.
Linear Relationships and the Correlation Coefficient The correlation measures only the strength of a linear relationship. The data have a smooth curvilinear form. The relationship is very strong. The correlation ? = 0.172 indicates a weak linear relationship. So the correlation coefficient only gives information about the strength of a linear relationship.
Causation and Lurking Variables A common mistake made when describing the relationship between two quantitative variables is confusion between association and causation. In a linear relationship, people mistakenly interpret an ?-value that is close to 1 or -1 as evidence that the explanatory variable causes changes in the response variable. The correct interpretation is that there is a statistical relationship between the variables. The explanatory variable and the response variable vary together in a predictable way. There is an association between the variables.
Linear Regression The line that best summarizes a linear relationship is the least-squares regression line. The most common measurement of overall error is the sum of the squares of the errors (SSE). The least-squares line is the line with the smallest SSE. We use the least-squares regression line to predict the value of the response variable from a value of the explanatory variable. Prediction for values of the explanatory variable that fall outside the range of the data is called extrapolation. The slope of the least-squares regression line is the average change in the predicted values of the response variable when the explanatory variable increases by 1 unit.
Linear Regression (cont.) We have two methods for finding the equation of the least-squares regression line Method 1: We use technology to find the equation of the least-squares regression line: Predicted ? = ? + ? ? Method 2: We use summary statistics for ? and ? and the correlation. In this method we can calculate the slope ? and the ?-intercept ? using the following: ? = (? ??)/??, ? = ? ? ?
Assessing the Fit of a Line In general, we have Observed data value Predicted value = Error. If we use (?,?) to represent a typical data point and to represent the predicted value (obtained by using the regression equation), then we have observed ? predicted ? = error ? = error We can think of the error as the amount that we have to add to the prediction to get the observed value. If the error is positive, it means the prediction is too small. If the error is negative, it means the prediction is too large. The prediction error is also called a residual. So another way to express the previous equation is: ? = + residual
Residual Plots We use residual plots to determine if a linear model is appropriate. We look for any unexpected patterns in the residuals that suggest that the data is not linear. Start by looking at what we expect to see in a residual plot when the form is linear. The graph below shows a scatterplot and the regression line. The blue points represent our original data set (our observed values). The red points, lying directly on the regression line, are the predicted values.
Assessing the Fit of the Line What proportion of variation does our linear model explain? The value of ?2is the proportion of the variation in the response variable that is explained by the least-squares regression line. Example: ? = 0.73 explained variation total variation = 0.732 0.53 ?2= 0.732 0.53 Recall ?2=explained variation We can interpret ?2 0.53 as our linear regression model explains 53% of the total variation in the response variable. This also means that 100-53% = 47% of the total variation is unexplained by the model. total variation
Assessing the Fit of the Line (cont.) Calculate the error in a single prediction by: Residual = Observed value Predicted value Use the regression line to make predictions even when we do not have an observed value. How do we measure the typical amount of error for predictions from the regression line? The most common measure of the size of the typical error is the standard error of the regression, which is represented by ??. It is calculated using the following formula: ??? ? 1 ??= where SSE stands for the sum of the squared errors.
Quick Review Does association imply causation? What is used to determine whether a linear model is a good summary of the relationship between the explanatory and response variables? When the form of a relationship is linear, what is used to measure the strength and direction of the relationship The line that best summarizes a linear relationship is? What is extrapolation? What two numeric measures to help us judge how well the regression line models the data? What should be done before interpreting correlation coefficient, ??