MODULE 1510 QUESTIONS

Linear Regression

ADAPTIVE FLASHCARDS
Flashcard Study Mode
Study this module with spaced repetition. Wrong answers come back weighted heavier.

Linear Regression

What is Correlation?

Correlation measures the strength and direction of a linear relationship between two quantitative variables. The correlation coefficient, denoted by r, quantifies how closely two variables move together.

Definition and Formula

The correlation coefficient is calculated as:

r = Sxy / √(Sxx × Syy)

where:

  • Sxy = sum((xi - x)(yi - y))
  • Sxx = sum((xi - x)^2)
  • Syy = sum((yi - y)^2)

Key Properties of Correlation

  • Range: r is always between -1 and 1
  • Perfect positive: r = 1 indicates a perfect positive linear relationship
  • Perfect negative: r = -1 indicates a perfect negative linear relationship
  • No relationship: r = 0 indicates no linear relationship
  • Symmetric: cor(x,y) = cor(y,x) - the order doesn't matter
  • Unit-free: correlation has no units; it's a pure number
  • Linear only: correlation only measures linear associations; non-linear relationships (like quadratic) can have r near 0 even when variables are strongly related

Key concept: Correlation is a measure of linear association only. A quadratic relationship or other non-linear pattern will have a correlation near zero, even though the variables are clearly related.

Correlation Examples

Correlation helps us describe relationships:

  • r = 0.92: Very strong positive relationship. As one variable increases, the other strongly tends to increase.
  • r = -0.78: Strong negative relationship. As one variable increases, the other tends to decrease.
  • r = 0.15: Weak positive relationship. There is a slight tendency for both to increase together, but the pattern is scattered.
  • r = 0: No linear relationship. The scatter plot appears as a cloud with no clear pattern.

Correlation vs Causation

A strong correlation does NOT imply causation. Just because two variables are correlated does not mean one causes the other. For example:

  • Ice cream sales and drowning deaths are positively correlated (both increase in summer), but neither causes the other. A third variable (warm weather) causes both.
  • The number of firefighters at a fire is positively correlated with fire damage, but more firefighters don't cause more damage - larger fires require more firefighters and cause more damage.

Correlation is useful for identifying potential relationships, but establishing causation requires careful experimental design or causal reasoning about the mechanism.

R Code for Correlation

In R, calculate correlation using the cor() function:

R
1# Lake Monona data: year vs freeze duration
2monona <- read_csv("lake-monona-winters-2025.csv")
3cor(monona$year1, monona$duration)
OUTPUT
1[1] -0.574

This negative correlation indicates that as years progress (time increases), the duration of lake freezing tends to decrease - a pattern consistent with climate change.

What is Simple Linear Regression?

Simple linear regression models the relationship between two quantitative variables using a straight line. We use it to:

  • Describe the relationship between variables
  • Predict future values of one variable based on another
  • Quantify the strength of the association

The Model

Simple linear regression assumes:

y-hat = b₀ + b₁×x

where:

  • y-hat is the predicted value of the response variable
  • x is the explanatory (predictor) variable
  • b₀ is the y-intercept (value of y when x = 0)
  • b₁ is the slope (change in y for each 1-unit increase in x)

Interpreting the Parameters

  • Intercept (b0): The predicted value of y when x = 0. Not always meaningful in context.
  • Slope (b1): For every 1-unit increase in x, y increases (or decreases if negative) by b₁ units on average. This is the most important interpretation.

Fitting the Line: Least Squares

We find the "best" line by minimizing the sum of squared residuals (SSE). A residual is the difference between an observed value and its predicted value:

e = y - y-hat

The least squares line minimizes the sum of the squared residuals.

Formulas for Slope and Intercept

Once we fit the line, we get estimates:

b₁ = Sxy / Sxx
b₀ = y - b₁ × x

where y and x are the means of y and x.

Example in R

R
1# Riley height data: child's height vs age
2riley <- read_table("riley.txt")
3riley_2_8 <- riley %>% filter(age >= 24 & age <= 96)
4
5# Fit the linear model
6height_mod <- lm(height ~ age, data = riley_2_8)
7summary(height_mod)
OUTPUT
1Call:
2lm(formula = height ~ age, data = riley_2_8)
3
4Coefficients:
5 Estimate Std. Error t value Pr(>|t|)
6(Intercept) 30.250 0.823 36.76 < 2e-16
7age 0.250 0.012 20.83 < 2e-16
8
9Residual standard error: 1.24 on 18 degrees of freedom
10Multiple R-squared: 0.9603

Interpretation: For each additional month of age, height increases by approximately 0.25 inches. When age = 0, the predicted height is 30.25 inches (though this extrapolation is not meaningful for newborns).

R-squared

R-squared measures the proportion of variation in the response variable that is explained by the explanatory variable. It ranges from 0 to 1.

Definition

R² = 1 - (SSE / SSyy)

where SSE is the sum of squared residuals and SSyy is the total sum of squares.

Alternatively: R^2 = r^2

The correlation coefficient squared equals R-squared.

Interpretation

  • R^2 = 0.96: The model explains 96% of the variation in the response variable. The remaining 4% is due to other factors or random variation.
  • R^2 = 0.50: The model explains 50% of the variation.
  • R^2 = 0.10: The model explains only 10% of the variation; the explanatory variable is a weak predictor.

Key concept: R-squared is a measure of model fit. Higher R-squared means the line fits the data better, but even high R-squared does not imply causation.

Making Predictions

Once we fit a regression model, we can predict the response for new values of the explanatory variable.

Point Predictions

Simply plug the new x value into the regression equation:

y-hat = b₀ + b₁×x

R
1# Predict height for a child aged 78 months
2predict(height_mod, newdata = tibble(age = 78))
OUTPUT
1[1] 49.79

For a 78-month-old child, we predict a height of approximately 49.79 inches.

Extrapolation Warning

Extrapolation - predicting outside the range of the observed data - is dangerous because:

1. The relationship may not hold outside the observed range

2. We have no data to verify our assumptions

3. Errors in prediction tend to increase the further we go from the data

Do not predict far outside the range of x values in your dataset.

Checking Regression Assumptions

Linear regression relies on four key assumptions. We check these primarily through residual plots.

The Four Assumptions

1. Linearity: The relationship between x and y is linear. Violation: curved pattern in residual plot.

2. Normality: The residuals are normally distributed. Violation: residuals not symmetric around 0, heavy tails.

3. Constant Variance (Homoscedasticity): The spread of residuals is constant across all values of x. Violation: "megaphone" pattern (widening or narrowing spread).

4. Independence: Observations are independent of each other. This is checked through study design, not the residual plot.

Creating and Reading Residual Plots

R
1# Lake Monona example
2monona_resids <- monona %>%
3 mutate(residuals = resid(lake_mod))
4
5ggplot(monona_resids, aes(x = year1, y = residuals)) +
6 geom_point() +
7 geom_hline(yintercept = 0)

Interpret the plot:

  • Good: Points scattered randomly around 0 with no clear pattern
  • Curved pattern: Suggests non-linearity; a straight line isn't appropriate
  • Megaphone shape: Suggests non-constant variance; spread increases or decreases
  • Systematic patterns: Suggest the model is missing important structure

Key concept: Always plot your residuals. They reveal whether your regression assumptions are reasonable.

Regression Inference: Confidence Interval for the Slope

We not only estimate the slope b1, but also quantify uncertainty in that estimate using a confidence interval.

Standard Error of the Slope

The standard error estimates how much b₁ varies across different samples:

SE(b₁) = s / √(Sxx)

where s = sqrt(SSE/(n-2)) is the residual standard error (the square root of the mean squared error).

Constructing the Confidence Interval

For a 95% confidence interval with df = n - 2 degrees of freedom:

CI: b₁ ± t* × SE(b₁)

where t* is the critical value from the t-distribution with n-2 degrees of freedom.

Example in R

R
1n <- nrow(riley_2_8)
2height_mod <- lm(height ~ age, data = riley_2_8)
3
4s <- summary(height_mod)$sigma
5pt_est <- summary(height_mod)$coefficients[2, 1] # slope
6se <- summary(height_mod)$coefficients[2, 2] # SE of slope
7
8cv <- qt(0.975, df = n - 2)
9c(pt_est - cv*se, pt_est + cv*se)
OUTPUT
1[1] 0.224 0.276

We are 95% confident that the true slope is between 0.224 and 0.276 inches per month.

Using confint()

R provides a convenient function:

R
1confint(height_mod)
OUTPUT
1 2.5 % 97.5 %
2(Intercept) 28.530 32.000
3age 0.224 0.276

Regression Inference: Hypothesis Test for the Slope

We often test whether there is a significant linear relationship between x and y.

Hypotheses

  • H₀: beta1 = 0 (no linear relationship; the slope is zero)
  • Hₐ: beta1 != 0 (there is a linear relationship)

Alternatively, for one-sided tests:

  • Hₐ: beta1 > 0 (positive relationship)
  • Hₐ: beta1 < 0 (negative relationship)

Test Statistic

t = b₁ / SE(b₁)

with degrees of freedom = n - 2

Under H0, this follows a t-distribution.

Example: Lake Monona

R
1monona <- read_csv("lake-monona-winters-2025.csv")
2lake_mod <- lm(duration ~ year1, data = monona)
3summary(lake_mod)
OUTPUT
1Coefficients:
2 Estimate Std. Error t value Pr(>|t|)
3(Intercept) 48.5 25.3 1.92 0.058
4year1 -0.087 0.010 -8.63 < 0.001

Interpretation: The test statistic is -8.63. The two-sided p-value is < 0.001, providing very strong evidence that the slope is not zero. There is a significant negative linear relationship between year and freeze duration.

Key concept: When p-value < 0.05, we reject H0 and conclude there is a significant linear relationship. The slope is statistically different from zero.

Confidence Interval for Mean Response

Often we want a confidence interval for the mean response (average y) at a given x value, not a prediction for an individual.

Confidence Interval for the Mean

R
1predict(lion_mod, newdata = tibble(age = 5), interval = "confidence")
OUTPUT
1 fit lwr upr
21 0.546 0.522 0.570

We are 95% confident that the mean nose proportion for 5-year-old lions is between 0.522 and 0.570.

Standard Error Formula

SE_fit = s × √( 1/n + (x_new - x)² / Sxx )

Note: This interval is for the average response at x = x_new, not for an individual observation.

Why This Matters

The confidence interval tells us where the regression line is - it's narrower near the center of the data (where we have the most information) and wider at the extremes (where we're more uncertain).

Prediction Interval for a New Observation

When we predict for an individual new observation, we need a wider interval to account for individual variation around the line.

Prediction Interval

R
1predict(lion_mod, newdata = tibble(age = 5), interval = "prediction")
OUTPUT
1 fit lwr upr
21 0.546 0.412 0.680

We predict with 95% confidence that a new 5-year-old lion will have nose proportion between 0.412 and 0.680.

Why PI is Always Wider Than CI

The prediction interval accounts for two sources of uncertainty:

1. Uncertainty about where the regression line is (same as CI)

2. Uncertainty about individual variation around the line (additional)

Standard Error Formula

SE_pred = s × √( 1 + 1/n + (x_new - x)² / Sxx )

Notice the extra "1" in the formula - that's the individual variation.

Comparison

R
1age_5 <- tibble(age = 5)
2ci <- predict(lion_mod, newdata = age_5, interval = "confidence")
3pi <- predict(lion_mod, newdata = age_5, interval = "prediction")
4
5# CI width: 0.570 - 0.522 = 0.048
6# PI width: 0.680 - 0.412 = 0.268

The prediction interval is much wider because it accounts for how individual observations vary around the average trend.

Key concept: Confidence intervals estimate where the average/mean response is. Prediction intervals estimate where an individual observation will fall. Always use a prediction interval when predicting for individuals.