Correlation measures the strength and direction of a linear relationship between two quantitative variables. The correlation coefficient, denoted by r, quantifies how closely two variables move together.
The correlation coefficient is calculated as:
where:
Key concept: Correlation is a measure of linear association only. A quadratic relationship or other non-linear pattern will have a correlation near zero, even though the variables are clearly related.
Correlation helps us describe relationships:
A strong correlation does NOT imply causation. Just because two variables are correlated does not mean one causes the other. For example:
Correlation is useful for identifying potential relationships, but establishing causation requires careful experimental design or causal reasoning about the mechanism.
In R, calculate correlation using the cor() function:
This negative correlation indicates that as years progress (time increases), the duration of lake freezing tends to decrease - a pattern consistent with climate change.
Simple linear regression models the relationship between two quantitative variables using a straight line. We use it to:
Simple linear regression assumes:
y-hat = b₀ + b₁×x
where:
We find the "best" line by minimizing the sum of squared residuals (SSE). A residual is the difference between an observed value and its predicted value:
e = y - y-hat
The least squares line minimizes the sum of the squared residuals.
Once we fit the line, we get estimates:
where y and x are the means of y and x.
Interpretation: For each additional month of age, height increases by approximately 0.25 inches. When age = 0, the predicted height is 30.25 inches (though this extrapolation is not meaningful for newborns).
R-squared measures the proportion of variation in the response variable that is explained by the explanatory variable. It ranges from 0 to 1.
where SSE is the sum of squared residuals and SSyy is the total sum of squares.
Alternatively: R^2 = r^2
The correlation coefficient squared equals R-squared.
Key concept: R-squared is a measure of model fit. Higher R-squared means the line fits the data better, but even high R-squared does not imply causation.
Once we fit a regression model, we can predict the response for new values of the explanatory variable.
Simply plug the new x value into the regression equation:
y-hat = b₀ + b₁×x
For a 78-month-old child, we predict a height of approximately 49.79 inches.
Extrapolation - predicting outside the range of the observed data - is dangerous because:
1. The relationship may not hold outside the observed range
2. We have no data to verify our assumptions
3. Errors in prediction tend to increase the further we go from the data
Do not predict far outside the range of x values in your dataset.
Linear regression relies on four key assumptions. We check these primarily through residual plots.
1. Linearity: The relationship between x and y is linear. Violation: curved pattern in residual plot.
2. Normality: The residuals are normally distributed. Violation: residuals not symmetric around 0, heavy tails.
3. Constant Variance (Homoscedasticity): The spread of residuals is constant across all values of x. Violation: "megaphone" pattern (widening or narrowing spread).
4. Independence: Observations are independent of each other. This is checked through study design, not the residual plot.
Interpret the plot:
Key concept: Always plot your residuals. They reveal whether your regression assumptions are reasonable.
We not only estimate the slope b1, but also quantify uncertainty in that estimate using a confidence interval.
The standard error estimates how much b₁ varies across different samples:
where s = sqrt(SSE/(n-2)) is the residual standard error (the square root of the mean squared error).
For a 95% confidence interval with df = n - 2 degrees of freedom:
where t* is the critical value from the t-distribution with n-2 degrees of freedom.
We are 95% confident that the true slope is between 0.224 and 0.276 inches per month.
R provides a convenient function:
We often test whether there is a significant linear relationship between x and y.
Alternatively, for one-sided tests:
with degrees of freedom = n - 2
Under H0, this follows a t-distribution.
Interpretation: The test statistic is -8.63. The two-sided p-value is < 0.001, providing very strong evidence that the slope is not zero. There is a significant negative linear relationship between year and freeze duration.
Key concept: When p-value < 0.05, we reject H0 and conclude there is a significant linear relationship. The slope is statistically different from zero.
Often we want a confidence interval for the mean response (average y) at a given x value, not a prediction for an individual.
We are 95% confident that the mean nose proportion for 5-year-old lions is between 0.522 and 0.570.
Note: This interval is for the average response at x = x_new, not for an individual observation.
The confidence interval tells us where the regression line is - it's narrower near the center of the data (where we have the most information) and wider at the extremes (where we're more uncertain).
When we predict for an individual new observation, we need a wider interval to account for individual variation around the line.
We predict with 95% confidence that a new 5-year-old lion will have nose proportion between 0.412 and 0.680.
The prediction interval accounts for two sources of uncertainty:
1. Uncertainty about where the regression line is (same as CI)
2. Uncertainty about individual variation around the line (additional)
Notice the extra "1" in the formula - that's the individual variation.
The prediction interval is much wider because it accounts for how individual observations vary around the average trend.
Key concept: Confidence intervals estimate where the average/mean response is. Prediction intervals estimate where an individual observation will fall. Always use a prediction interval when predicting for individuals.