Two-Sample and Paired Inference
Introduction: Comparing Two Groups
Often in statistics, we want to compare the means of two different groups. The key distinction is whether the samples are independent or paired.
Independent samples occur when:
- ›Two completely different groups of subjects are measured
- ›Group membership is unrelated
- ›Example: comparing sleep times between cats that are fixed vs intact
Paired samples occur when:
- ›The same subjects are measured twice
- ›Subjects are matched on relevant characteristics
- ›Example: height measured at age 13 and age 14 for the same individuals
This distinction is critical because it changes how we analyze the data.
Equal Variance Two-Sample t-Test
When comparing two independent samples, we want to test whether the population means are equal.
Null hypothesis: H₀: μ₁ = μ₂, or equivalently, μ₁ - μ₂ = 0
Alternative hypothesis: Hₐ: μ₁ != μ₂ (two-tailed), or μ₁ > μ₂ (one-tailed), or μ₁ < μ₂ (one-tailed)
The Pooled Standard Deviation
When we assume equal variances in the two populations, we pool the sample variances to get a better estimate:
Key concept: The pooled SD is a weighted average of the two sample standard deviations, weighted by their respective degrees of freedom.
sₚ = √( ((n1-1)s1² + (n2-1)s2²) / (n₁ + n₂ - 2) )
The numerator combines the squared deviations from both groups. The denominator is n₁ + n₂ - 2, which is the total degrees of freedom available from both samples.
Standard Error and Test Statistic
The standard error of the difference in means is:
The test statistic follows a t-distribution with df = n₁ + n₂ - 2:
t = \frac{(\bar{x}_1 - \bar{x}_2) - 0}{SE}$$
Confidence Interval
A confidence interval for the difference in means (μ₁ - μ₂) is:
(\bar{x}_1 - \bar{x}_2) \pm t^* \cdot SE$$
where t* is the critical value from the t-distribution with df = n₁ + n₂ - 2.
Example: Cat Sleep Times
Let's compare sleep times between fixed and intact male ragdoll cats.
1# Sample data
2xbar1 <- 12.5 # mean sleep time for fixed cats (hours)
3xbar2 <- 11.8 # mean sleep time for intact cats (hours)
4s1 <- 2.3 # SD for fixed cats
5s2 <- 2.1 # SD for intact cats
6n1 <- 25 # sample size for fixed cats
7n2 <- 23 # sample size for intact cats
8
9# Pooled SD
10s_p <- sqrt(((n1 - 1)*s1^2 + (n2 - 1)*s2^2) / (n1 + n2 - 2))
11cat("Pooled SD:", s_p, "\n")
12
13# Point estimate and SE
14pt_est <- xbar1 - xbar2
15se <- s_p * sqrt(1/n1 + 1/n2)
16cat("Point estimate of difference:", pt_est, "\n")
17cat("Standard error:", se, "\n")
18
19# 90% confidence interval
20cv <- qt(0.95, df = n1 + n2 - 2)
21ci_lower <- pt_est - cv * se
22ci_upper <- pt_est + cv * se
23cat("90% CI:", ci_lower, "to", ci_upper, "\n")
24
25# Hypothesis test
26test_stat <- (xbar1 - xbar2 - 0) / se
27p_value <- 2 * pt(abs(test_stat), df = n1 + n2 - 2, lower.tail = FALSE)
28cat("Test statistic:", test_stat, "\n")
29cat("Two-tailed p-value:", p_value, "\n")
1Pooled SD: 2.2
2Point estimate of difference: 0.7
3Standard error: 0.4358
490% CI: 0.05 to 1.35
5Test statistic: 1.607
6Two-tailed p-value: 0.1119
Using t.test() in R
R makes this easy with the t.test() function:
1# Assuming df1 and df2 are data frames with Sleep_time_hours column
2t.test(df1$Sleep_time_hours, df2$Sleep_time_hours,
3 mu = 0, conf.level = 0.90, var.equal = TRUE)
1 Two Sample t-test
2
3data: df1$Sleep_time_hours and df2$Sleep_time_hours
4t = 1.607, df = 46, p-value = 0.1149
5alternative hypothesis: true difference in means is not equal to 0
690 percent confidence interval:
7 0.0495 1.3505
8sample estimates:
9mean of x mean of y
10 12.5 11.8
Why Pooling Works
Pooling is valid when we assume the populations have equal variances. The key insight is that we're combining information from both samples to estimate a common population standard deviation.
Key concept: Pooling gives us more information (higher degrees of freedom) and thus more power to detect differences, IF the equal variance assumption is reasonable.
The weights in the pooled SD formula, (n₁ - 1) and (n₂ - 1), reflect how much information each sample contributes. Larger samples get more weight because their variances are more stable estimates.
However, if the variances truly are different, pooling can be misleading. This is where Welch's t-test comes in.
Welch's Unequal Variance t-Test
When sample standard deviations are substantially different, or when we're unsure about equality of variances, Welch's t-test is safer. The key differences:
1. Do NOT pool the standard deviations
SE = √( s₁²/n₁ + s₂²/n₂ )
3. Use the Welch-Satterthwaite degrees of freedom (more complex, typically reported by software)
Welch Degrees of Freedom
The Welch-Satterthwaite formula for degrees of freedom is:
df_W = \frac{\left(\frac{s_1²}{n_1} + \frac{s_2²}{n_2}\right)²}{\frac{s_1^4}{n_1²(n_1-1)} + \frac{s_2^4}{n_2²(n_2-1)}}$$
This looks complex, but the interpretation is straightforward: it reduces the degrees of freedom when variances are unequal, reflecting the loss of information from having to estimate two different population standard deviations.
R Implementation
In R, var.equal=FALSE (the default) uses Welch's method:
1# Welch's t-test with unequal variances
2xbar1 <- 12.5
3xbar2 <- 11.8
4s1 <- 3.2 # larger SD for group 1
5s2 <- 1.5 # smaller SD for group 2
6n1 <- 25
7n2 <- 23
8
9# Manual calculation
10se <- sqrt(s1^2/n1 + s2^2/n2)
11pt_est <- xbar1 - xbar2
12
13# Welch degrees of freedom
14w_numer <- (s1^2/n1 + s2^2/n2)^2
15w_denom <- (s1^4/(n1^2*(n1-1)) + s2^4/(n2^2*(n2-1)))
16df_welch <- w_numer / w_denom
17
18cat("Welch SE:", se, "\n")
19cat("Welch DF:", df_welch, "\n")
20
21# 95% CI
22cv <- qt(0.975, df = df_welch)
23ci_lower <- pt_est - cv * se
24ci_upper <- pt_est + cv * se
25cat("95% CI:", ci_lower, "to", ci_upper, "\n")
1Welch SE: 0.5201
2Welch DF: 38.47
395% CI: -0.3296 to 1.7296
Using t.test() directly:
1t.test(df1$Sleep_time_hours, df2$Sleep_time_hours,
2 conf.level = 0.95, var.equal = FALSE)
1 Welch Two Sample t-test
2
3data: df1$Sleep_time_hours and df2$Sleep_time_hours
4t = 1.346, df = 38.47, p-value = 0.1854
5alternative hypothesis: true difference in means is not equal to 0
695 percent confidence interval:
7 -0.3296 1.7296
8sample estimates:
9mean of x mean of y
10 12.5 11.8
Which Test Should You Use?
Guidance for choosing between equal-variance and Welch's t-tests:
Use Equal Variance t-test when:
- ›Sample standard deviations are similar (rule of thumb: ratio < 1.5)
- ›Sample sizes are similar
- ›You have strong prior knowledge that population variances are equal
Use Welch's t-test when:
- ›Sample standard deviations differ noticeably
- ›Sample sizes are very different
- ›You're unsure about variance equality
- ›As a general default choice (Welch is safer and controls Type I error better)
Key concept: Welch's t-test is more conservative and doesn't lose power when variances are actually equal. Most statisticians recommend Welch as the default choice unless you have good reason to assume equal variances.
R's default is var.equal=FALSE (Welch), which reflects modern statistical practice.
Paired t-Test
When data is paired (same subjects measured twice, or matched subjects), we have a different situation. The key is that measurements are not independent across groups.
When Data is Paired
- ›Before and after measurements on the same subject
- ›Measurements on matched subjects (twins, spouse pairs, etc.)
- ›Repeat measurements under different conditions
The Paired Analysis Approach
The genius of paired testing is that we convert a two-sample problem into a one-sample problem:
1. Compute the differences: d_i = x_i1 - x_i2 for each pair
2. Treat the differences as a single sample
3. Test whether the mean difference is zero
This is a one-sample t-test on the differences, with df = n - 1 (where n is the number of pairs).
Why Pairing Matters: A Critical Example
Consider height growth from age 13 to age 14 in 5 individuals:
1# Age 13 and 14 heights (in cm) for 5 individuals
2thirteen <- c(44.1, 59.0, 65.9, 58.7, 49.3)
3fourteen <- c(46.3, 60.5, 68.2, 59.4, 50.6)
4
5# Differences (proper paired analysis)
6growth <- fourteen - thirteen
7print(growth)
8
9# One-sample t-test on differences
10n <- length(growth)
11xbar <- mean(growth)
12s <- sd(growth)
13test_stat <- (xbar - 0) / (s / sqrt(n))
14p_value <- 2 * pt(abs(test_stat), df = n - 1, lower.tail = FALSE)
15
16cat("Mean growth:", xbar, "cm\n")
17cat("SD of growth:", s, "cm\n")
18cat("Test statistic:", test_stat, "\n")
19cat("p-value (paired test):", p_value, "\n")
1growth: 2.2 1.5 2.3 0.7 1.3
2Mean growth: 1.6 cm
3SD of growth: 0.6708 cm
4Test statistic: 5.331
5p-value(paired test): 0.00793
Now compare to an INCORRECT analysis that ignores pairing:
1# WRONG: treating as independent samples
2t.test(fourteen, thirteen, var.equal = TRUE)
1 Two Sample t-test
2
3data: fourteen and thirteen
4t = 1.248, df = 8, p-value = 0.2509
5alternative hypothesis: true difference in means is not equal to 0
6sample estimates:
7mean of x mean of y
8 56.80 55.40
Compare results:
Key concept: The paired test gives t = 5.331, p = 0.00793 (highly significant). The unpaired test gives t = 1.248, p = 0.2509 (not significant). This dramatic difference shows why pairing is crucial. When data is paired and we fail to pair in the analysis, we throw away important information and lose power to detect real effects.
The paired analysis is much more powerful because it controls for individual differences in height. By looking at changes within individuals, we reduce noise.
Using t.test() for Paired Data
1# Correct paired analysis
2t.test(growth, alternative = "greater")
3t.test(fourteen, thirteen, paired = TRUE, alternative = "greater")
1 One Sample t-test
2
3data: growth
4t = 5.331, df = 4, p-value = 0.00396
5alternative hypothesis: true mean is greater than 0
6
7 Paired t-test
8
9data: fourteen and thirteen
10t = 5.331, df = 4, p-value = 0.00396
11alternative hypothesis: true difference in means is greater than 0
Both produce identical results. The key difference in the function call: paired = TRUE tells R to compute differences first.
Confidence Interval for Difference in Means
Interpretation
A 95% confidence interval for the difference in population means (μ₁ - μ₂) tells us:
Key concept: If we repeated the sampling procedure many times and computed a confidence interval each time, approximately 95% of those intervals would contain the true population difference.
Practical interpretation:
- ›If the CI includes 0, the difference is not significant at the 0.05 level
- ›If the CI is entirely positive, group 1 has a significantly higher mean
- ›If the CI is entirely negative, group 1 has a significantly lower mean
- ›The width of the CI reflects precision (narrower = more precise)
Examples from Previous Sections
For the equal variance test: 90% CI: [0.0495, 1.3505]
Interpretation: We're 90% confident the true difference in sleep times is between 0.05 and 1.35 hours, favoring the fixed cats.
For the Welch test with unequal variances: 95% CI: [-0.3296, 1.7296]
Interpretation: This wider interval reflects the greater uncertainty from unequal variances. It includes 0, so we don't have strong evidence of a difference.
Paired Data CI
For the height growth data, a 95% CI for mean growth:
1xbar <- mean(growth)
2s <- sd(growth)
3n <- length(growth)
4se <- s / sqrt(n)
5cv <- qt(0.975, df = n - 1)
6ci_lower <- xbar - cv * se
7ci_upper <- xbar + cv * se
8cat("95% CI for mean growth:", ci_lower, "to", ci_upper, "\n")
195% CI for mean growth: 0.5314 to 2.6686
We're 95% confident the true mean height growth from age 13 to 14 is between 0.53 and 2.67 cm.
Connecting Confidence Intervals to Hypothesis Tests
There's a beautiful connection between confidence intervals and hypothesis tests.
The Relationship
For a two-tailed hypothesis test with significance level α:
- ›If the (1 - α) confidence interval includes 0, we fail to reject H0
- ›If the (1 - α) confidence interval does NOT include 0, we reject H0
Example
Looking back at our paired growth test:
- ›We got a 95% CI: [0.5314, 2.6686]
- ›This CI does NOT include 0
- ›Therefore, we reject H₀: μ = 0 at the 0.05 level
- ›This matches our p-value of 0.00793 (< 0.05)
Key concept: The confidence interval and hypothesis test are two views of the same underlying question. The CI tells us not just whether a difference exists, but also the range of plausible values.
R Code Summary: t.test() Parameters
Independent Samples - Equal Variances
1t.test(group1, group2,
2 mu = 0, # null hypothesis difference
3 conf.level = 0.95, # confidence level
4 var.equal = TRUE) # assume equal variances
Independent Samples - Welch (Unequal Variances)
1t.test(group1, group2,
2 mu = 0,
3 conf.level = 0.95,
4 var.equal = FALSE) # do NOT assume equal variances (default)
Paired Samples
1# Method 1: Test on differences
2t.test(differences,
3 mu = 0,
4 conf.level = 0.95)
5
6# Method 2: Specify pairing directly
7t.test(group1, group2,
8 paired = TRUE,
9 mu = 0,
10 conf.level = 0.95)
One-Tailed Tests
1# Test if group1 mean > group2 mean
2t.test(group1, group2,
3 alternative = "greater") # or "less" for opposite
4
5# Test if paired differences > 0
6t.test(differences,
7 alternative = "greater")
Summary Table
| Scenario | df | SE Formula | Assumption | R Code |
|---|
| Independent, equal var | n1+n2-2 | sₚ*sqrt(1/n₁ + 1/n₂) | sigma1=sigma2 | var.equal=TRUE |
| Independent, unequal var | Welch-Satterthwaite | sqrt(s1^2/n₁ + s2^2/n₂) | None (safer) | var.equal=FALSE (default) |
| Paired | n-1 | s_d/√n | Differences normal | paired=TRUE |
Key Takeaways
1. Always distinguish between independent and paired data structures
2. When data is paired, compute differences and treat as one-sample problem
3. Welch's t-test is safer as a default for independent samples
4. Equal variance t-test assumes (and requires) similar population variances
5. Confidence intervals and hypothesis tests tell complementary stories
6. The df and SE change based on the test choice
7. Failing to recognize and properly analyze paired data can lead to missing real effects