Introduction to data science with R: visualization, data manipulation, probability, and statistical inference. UW-Madison, Spring 2026.

# Introduction to R & RStudio

## What is R?

R is a programming language built specifically for statistics and data analysis. Unlike Python, which is a general-purpose language, R was designed from the ground up for working with data — every feature, from its vector math to its plotting system, reflects that purpose.

## Loading Packages

R has a huge ecosystem of add-on packages. The tidyverse bundle — which includes ggplot2, dplyr, tidyr, and more — is used throughout this course.

```r
install.packages("tidyverse")  # run once to download
library(tidyverse)             # run every session to load
```
```output
-- Attaching packages --- tidyverse 2.0.0 --
v ggplot2 3.4.0  v dplyr   1.1.0
v tidyr   1.3.0  v readr   2.1.4
```

> **Note:** You only need to run `install.packages()` once. Run `library()` at the top of every R script or R Markdown file.

RStudio is the IDE (integrated development environment) we use to write and run R code. Our actual files are **R Markdown** (`.Rmd`) documents — a combination of explanatory text and runnable code chunks.

> **Why R?** In statistics and data science, R is the standard. The `tidyverse` ecosystem (ggplot2, dplyr, tidyr) gives you powerful, readable tools for data manipulation and visualization that would take much more code in other languages.

## Variables and Assignment

A variable is a named container for a value. In R, the assignment operator is `<-`:

```r
x <- 4
my_name <- "Miranda"
is_raining <- TRUE
```
```output
# (no output — assignment is silent)
```

Notice that R doesn't print anything when you assign. To see the value, just type the variable name:

```r
x
x + 10    # evaluates but does NOT change x
x <- x + 1
x
```
```output
[1] 4
[1] 14
[1] 5
```

> **Important:** `x + 10` shows 14 but x is still 4. You need `<-` to actually change a variable. This is a common source of confusion.

## Data Types

Every object in R has a **class** that determines how R treats it. The three you'll use constantly:

- **Numeric** — any number (`42`, `3.14`, `-7.5`)
- **Character** — text, always wrapped in quotes (`"hello"`, `"TRUE"`)
- **Logical** — exactly `TRUE` or `FALSE` (no quotes)

```r
class(42)
class("hello")
class(TRUE)
```
```output
[1] "numeric"
[1] "character"
[1] "logical"
```

The type matters because operations only work with compatible types. Adding a number to a character string causes an error, not an automatic conversion.

## Useful Built-in Functions

Functions in R take inputs (arguments) inside parentheses and return a result:

```r
sqrt(16)
abs(-7)
nchar("statistics")
toupper("hello r")
seq(from = 0, to = 10, by = 2)
```
```output
[1] 4
[1] 7
[1] 10
[1] "HELLO R"
[1]  0  2  4  6  8 10
```

> **Reading function documentation:** If you're not sure what a function does, type `?function_name` in the R console. For example, `?seq` shows all the arguments seq() accepts.

## Special Values

R has three special numeric values worth knowing:

```r
0 / 0      # Not a number — undefined math
1 / 0      # Positive infinity
NA         # Missing data — absence of a value
```
```output
[1] NaN
[1] Inf
[1] NA
```

`NA` is especially important in statistics — real datasets almost always have missing values, and R's treatment of NA is deliberate: any operation involving NA returns NA unless you explicitly tell R to ignore them.

## R Markdown & Code Chunks

R Markdown (`.Rmd`) files blend text and executable R code. Code goes inside **code chunks** delimited by triple backticks:

```r
# This is a code chunk
x <- 4
x + 1
```
```output
[1] 5
```

Inside a chunk, lines starting with `#` are comments. When you **knit** the document (Ctrl+Shift+K), R runs all chunks and outputs a polished HTML or PDF report.

> **R Markdown workflow:** Write explanatory text → insert code chunks → knit → instant report. This is how actual data scientists document their work.

## Case Sensitivity & Working Directory

R is case-sensitive: `x` and `X` are different variables. `TRUE` and `true` are not the same (R only recognizes `TRUE`, `FALSE`, `NA`).

```r
x <- 5
X <- 10
x
X
```
```output
[1] 5
[1] 10
```

To see your current working directory, use `getwd()`:

```r
getwd()
```
```output
[1] "/Users/miranda/Desktop/STAT240"
```

## More Useful Functions

### `paste()` for combining strings

```r
paste("Hello", "world")
paste("Name:", "Alice")
paste("x =", 42)
```
```output
[1] "Hello world"
[1] "Name: Alice"
[1] "x = 42"
```



## Vectors with c()

The `c()` function combines values into a vector (a sequence of items of the same type):

```r
x <- c(3, 7, 2, 9, 1)   # numeric vector
y <- c("a", "b", "c")    # character vector
z <- c(TRUE, FALSE, TRUE) # logical vector
```

Vectors are the fundamental data structure in R — almost everything is a vector.

## Variable Naming Rules

R variable names must follow these rules:
- Must start with a **letter** or a **dot** (`.`)
- Can contain letters, digits, underscores (`_`), and dots (`.`)
- Cannot start with a digit (`2ndvar` is invalid)
- Cannot be reserved words: `TRUE`, `FALSE`, `NULL`, `NA`, `Inf`, `NaN`

```r
my_var2 <- 10     # valid: starts with letter
.hidden <- 5      # valid: starts with dot
MyResult <- 3.14  # valid: camelCase works too

# 2ndvar <- 1     # INVALID: starts with digit
# TRUE <- 1       # INVALID: reserved word
```

> **Exam tip:** `.hidden_val` is a valid name. `2ndvar` is not. `TRUE` is always a reserved keyword.

### `sum()` with multiple arguments

```r
sum(1:10)         # sum of 1, 2, 3, ... 10
sum(c(3, 7, 2))   # sum of 3, 7, 2
```
```output
[1] 55
[1] 12
```

## The `%%` Modulo Operator

The modulo operator `%%` returns the **remainder** after division:

```r
10 %% 3   # remainder of 10 ÷ 3
7 %% 2    # check if odd (remainder 1)
```
```output
[1] 1
[1] 1
```

Common use: `x %% 2 == 1` checks if x is odd.

## Type Conversion Functions

R can convert between types using `as.numeric()`, `as.character()`, and `as.logical()`:

```r
as.numeric("3.14")      # Convert string to number
as.character(42)        # Convert number to string
as.numeric(TRUE)        # Convert logical to number
```
```output
[1] 3.14
[1] "42"
[1] 1
```

When conversion fails, R returns `NA` with a warning:

```r
as.numeric("hello")
```
```output
[1] NA
Warning: NAs introduced by coercion
```

## Built-in R Vectors

R comes with several pre-loaded vectors of useful values:

```r
letters        # lowercase a–z
LETTERS        # uppercase A–Z
month.name     # "January" through "December"
letters[10]    # "j" — R vectors are 1-indexed
```
```output
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" ...
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" ...
 [1] "January"   "February"  "March"     "April" ...
[1] "j"
```

## rep() — Replicate Values

The `rep()` function repeats values to create longer vectors:

```r
rep(0, 5)              # [1] 0 0 0 0 0
rep(c(1, 2), 3)        # [1] 1 2 1 2 1 2
rep(c("A","B"), each=2) # [1] "A" "A" "B" "B"
```
```output
[1] 0 0 0 0 0
[1] 1 2 1 2 1 2
[1] "A" "A" "B" "B"
```

The `each=` argument repeats each element before moving to the next, while the default repeats the entire vector.

Intro to R & RStudio

# Data Types & Structures

## Logical Operators

Logical operators compare two values and return `TRUE` or `FALSE`. You'll use these constantly inside `filter()` and `if` statements:

```r
5 > 3    # greater than
5 < 3    # less than
5 >= 5   # greater than or equal
5 == 5   # equality — note double equals!
5 != 3   # not equal
```
```output
[1] TRUE
[1] FALSE
[1] TRUE
[1] TRUE
[1] TRUE
```

> **Common mistake:** `=` assigns a value, `==` checks equality. Writing `filter(species = "Adelie")` will give an error; you need `filter(species == "Adelie")`.

You can combine conditions with `&` (AND) and `|` (OR):

```r
(5 > 3) & (2 < 1)   # TRUE AND FALSE = FALSE
(5 > 3) | (2 < 1)   # TRUE OR FALSE = TRUE
```
```output
[1] FALSE
[1] TRUE
```

AND requires **both** conditions to be true. OR requires **at least one**.

## Missing Values: NA

`NA` means a value is absent — it's not zero, not empty string, it's genuinely unknown. R is strict about this: any arithmetic or comparison with NA propagates the NA:

```r
NA + 5
NA == NA    # even this is NA, not TRUE!
```
```output
[1] NA
[1] NA
```

> **Why does `NA == NA` return NA?** Because you don't know what the missing value is. If person A's age is unknown and person B's age is unknown, you can't say they're equal — they might be different ages.

The correct way to test for NA is always `is.na()`:

```r
x <- NA
is.na(x)
```
```output
[1] TRUE
```

## Vectors

A vector is R's most fundamental data structure — an ordered sequence of values of the **same type**. Create one with `c()`:

```r
scores <- c(88, 92, 75, 95, 61)
names  <- c("Alice", "Bob", "Carol")
scores[2]       # 1-indexed — first element is [1]
length(scores)
```
```output
[1] 92
[1] 5
```

**Type coercion** happens silently when you mix types. R converts everything to the most flexible type (character > numeric > logical):

```r
c(2, TRUE, "banana")  # all become character
c(2, TRUE)            # TRUE becomes 1
```
```output
[1] "2"      "TRUE"   "banana"
[1] 2 1
```

## Operations Work Element-Wise

One of R's most useful features: math and comparisons apply to every element:

```r
scores - 70           # subtract 70 from each
scores > 80           # TRUE/FALSE for each
sum(scores > 80)      # count how many passed
mean(scores)
```
```output
[1] 18 22  5 25 -9
[1]  TRUE  TRUE FALSE  TRUE FALSE
[1] 3
[1] 82.2
```

> **Key insight:** `sum()` treats `TRUE` as 1 and `FALSE` as 0. So `sum(scores > 80)` is a clean way to count how many scores are above 80 — no loops needed.

## Dataframes and Tibbles

A dataframe organizes multiple vectors as columns in a table — like a spreadsheet in R:

```r
library(tidyverse)
students <- tibble(
  name  = c("Alice", "Bob", "Carol"),
  score = c(88, 92, 75),
  pass  = c(TRUE, TRUE, FALSE)
)
students
```
```output
# A tibble: 3 × 3
  name  score pass
  <chr> <dbl> <lgl>
1 Alice    88 TRUE
2 Bob      92 TRUE
3 Carol    75 FALSE
```

Access a column with `$`, or use `[row, col]` indexing:

```r
students$score       # extract as vector
students[1, ]        # entire first row
students[, 2]        # entire second column
```
```output
[1] 88 92 75 95 83
# A tibble: 1 x 4
  student_id name  score grade
       <int> <chr> <dbl> <chr>
1          1 Alice    88 B    
```

> **tibble vs data.frame:** `tibble()` is the tidyverse version of a dataframe. It prints more nicely, never converts strings to factors by default, and gives more helpful error messages. In STAT 240, we always use tibbles.

## Mathematical Operators

R supports the standard arithmetic operators:

```r
5 + 3      # addition
5 - 3      # subtraction
5 * 3      # multiplication
5 / 3      # division
2 ^ 10     # exponentiation (2 to the power of 10)
2 ** 10    # also exponentiation (equivalent to ^)
```
```output
[1] 8
[1] 2
[1] 15
[1] 1.666667
[1] 1024
[1] 1024
```

## The %in% Operator

Check if a value exists in a set:

```r
prime_numbers <- c(2, 3, 5, 7, 11, 13)
2 %in% prime_numbers
4 %in% prime_numbers
```
```output
[1] TRUE
[1] FALSE
```

## Numeric Shortcuts

The colon `:` creates a sequence:

```r
1:10
5:1
```
```output
[1]  1  2  3  4  5  6  7  8  9 10
[1]  5  4  3  2  1
```

## Common Vector Functions

```r
scores <- c(88, 92, 75, 95, 61)
min(scores)
max(scores)
mean(scores)
median(scores)
sum(scores)
log(scores)        # natural logarithm
```
```output
[1] 61
[1] 95
[1] 82.2
[1] 88
[1] 411
[1] 4.477337 4.521789 4.317488 4.553877 4.110874
```

## Exploring Dataframes

```r
head(students)       # first 6 rows
glimpse(students)    # compact overview
colnames(students)   # column names
dim(students)        # dimensions (rows, cols)
nrow(students)       # number of rows
ncol(students)       # number of columns
```
```output
# A tibble: 5 x 4
  student_id name  score grade
       <int> <chr> <dbl> <chr>
1          1 Alice    88 B    
2          2 Bob      92 A    
3          3 Carol    75 C    
4          4 Dan      95 A    
5          5 Eve      83 B    
Rows: 5
Columns: 4
$ student_id <int> 1, 2, 3, 4, 5
$ name       <chr> "Alice", "Bob", "Carol", "Dan", "Eve"
$ score      <dbl> 88, 92, 75, 95, 83
$ grade      <chr> "B", "A", "C", "A", "B"
[1] "student_id" "name"       "score"      "grade"     
```

## Subsetting with [row, col]

```r
students[1, ]        # first row, all columns
students[, 2]        # all rows, second column
students$name        # column by name
students[1, 3]       # first row, third column
```
```output
# A tibble: 1 x 4
  student_id name  score grade
       <int> <chr> <dbl> <chr>
1          1 Alice    88 B    
# A tibble: 5 x 1
  name 
  <chr>
1 Alice
2 Bob  
3 Carol
4 Dan  
5 Eve  
[1] 92
```

## Coercion Hierarchy

When you mix different types in a single vector, R **coerces** everything to the most flexible type:

**logical → numeric → character**

Each arrow means "gets coerced to". Examples:

```r
c(TRUE, 1, "a")    # all become character
c(TRUE, 1)         # TRUE becomes 1
c(TRUE, FALSE, NA) # stays logical
```
```output
[1] "TRUE" "1"    "a"
[1] 1 1
[1] TRUE FALSE NA
```

> **Important:** `NA` is special — it exists in all types (logical, numeric, character). All three are just displayed as `NA`.

## Factors

A **factor** stores categorical data — like species, color, or treatment group. Under the hood, it's an integer with category labels:

```r
sizes <- factor(c("small","large","medium","large","small"))
levels(sizes)    # "large" "medium" "small" (alphabetical by default)
nlevels(sizes)   # 3
table(sizes)     # count per level
```
```output
[1] "large"  "medium" "small"
[1] 3
sizes
 large medium  small
     2      1      2
```

Factors matter in `ggplot()` (they determine the order of bars and axes) and in regression models (they're encoded as dummy variables).

## Accessing and Modifying Vectors

R uses **1-based indexing** (not 0-based like Python). Access elements with square brackets:

```r
x <- c(10, 20, 30, 40, 50)
x[3]           # 30 — single element at position 3
x[c(1,3,5)]    # 10 30 50 — multiple positions
x[-2]          # all except index 2: 10 30 40 50
x[x > 25]      # logical subsetting: 30 40 50
x[2] <- 99     # replace element 2
x
```
```output
[1] 30
[1] 10 30 50
[1] 10 30 40 50
[1] 30 40 50
[1] 10 99 30 40 50
```

Negative indexing (`x[-2]`) means "all EXCEPT index 2".

## Checking and Replacing NAs

The `is.na()` function tests for missing values.

There are two different approaches to handling NAs:

**Option 1: Replace NAs with a value first, then compute**
```r
scores <- c(88, NA, 75, NA, 61)
scores[is.na(scores)] <- 0     # NAs become 0
mean(scores)                    # mean of c(88, 0, 75, 0, 61)
```
```output
[1] 44.8
```

**Option 2: Keep NAs, but tell the function to skip them**
```r
scores <- c(88, NA, 75, NA, 61)
mean(scores, na.rm = TRUE)     # ignores the two NAs
```
```output
[1] 74.67
```

> **Key difference:** Option 1 treats NAs as 0 (affecting the denominator). Option 2 computes the mean of only the non-NA values: (88 + 75 + 61) / 3 = 74.67.

Most functions accept `na.rm=TRUE` to ignore missing values.

Data Types & Structures

# Data Visualization with ggplot2

## The Grammar of Graphics

ggplot2 is built on a concept called the **grammar of graphics** — the idea that every plot can be described by a small set of components assembled together. Once you understand the grammar, you can build any chart.

The three required components:
- **Data** — the dataframe
- **Aesthetics** (`aes()`) — which columns map to which visual properties
- **Geoms** — what shape to draw (points, bars, lines, etc.)

> **Why ggplot2?** Most plotting tools require you to specify low-level details (this pixel, that color). ggplot2 lets you think in terms of your data — "map species to color" — and handles the rendering automatically.

## Building a Plot Layer by Layer

Every ggplot starts the same way and grows with `+`:

```r
library(tidyverse)

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()
```
```output
# Scatter plot: 344 points, flipper length vs body mass
```

Add layers to make it richer:

```r
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g,
                     color = species)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Penguin Flipper Length vs Body Mass",
    x     = "Flipper Length (mm)",
    y     = "Body Mass (g)"
  )
```
```output
# Scatter with colored points by species + linear trend lines
```

## Variable vs Constant Aesthetics

This distinction trips up almost everyone at first.

**Variable aesthetic** — inside `aes()`, maps a data column to a visual property. Each unique value gets a different appearance:

```r
geom_point(aes(color = species))   # different color per species
```

**Constant aesthetic** — outside `aes()`, applies the same value to every element:

```r
geom_point(color = "red")          # ALL points are red
geom_point(size = 3)               # ALL points size 3
```

> **The rule:** If the value comes from your data, put it inside `aes()`. If it's a fixed visual setting, put it outside `aes()` directly in the geom.

Common mistake — putting a fixed color inside `aes()`:

```r
geom_point(aes(color = "blue"))  # WRONG: treats "blue" as a data value
geom_point(color = "blue")       # CORRECT: sets all points blue
```

## Global vs Local Aesthetics

**Global** aesthetics go in `ggplot(aes(...))` and are inherited by all geom layers:

```r
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point() +       # inherits x and y
  geom_smooth()        # also inherits x and y — no error
```

**Local** aesthetics go inside a specific geom and only apply there:

```r
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species)) +  # color only for points
  geom_smooth()                        # no color grouping on smooth
```

## Choosing the Right Geom

| Goal | Geom | Notes |
|------|------|-------|
| Distribution of continuous var | `geom_histogram()` | Use binwidth to control bin size |
| Smooth density curve | `geom_density()` | Better for comparing groups |
| Count of categories | `geom_bar()` | Counts rows automatically |
| You have y values already | `geom_col()` | You supply both x and y |
| Relationship between two vars | `geom_point()` | Add `geom_smooth()` for trend |
| Change over time | `geom_line()` | Connect points in order |
| Reference line | `geom_hline()` / `geom_vline()` | Constant line on plot |

> **geom_bar vs geom_col:** This is a frequent exam question. `geom_bar()` counts rows for you — you only provide x. `geom_col()` plots y values you already have — you provide both x and y.

## Facets: Small Multiples

`facet_wrap()` splits your plot into separate panels by a variable. This is more honest than using color alone:

```r
ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 200) +
  facet_wrap(vars(species)) +
  labs(title = "Body Mass Distribution by Species")
```
```output
# Three side-by-side histograms, one per species
```

> **When to facet:** When you have 3+ groups and they overlap so much that a single plot is unreadable. Facets trade space for clarity.

## One-Variable Distributions: Histograms & Density

For continuous variables, plot the distribution with a histogram:

```r
ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 200)
```

Control bin edges with `boundary`:

```r
ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 200, boundary = 0)
```

Overlay a smooth density curve:

```r
ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 200, aes(y = after_stat(density))) +
  geom_density(color = "blue", linewidth = 1)
```

## Boxplots for Distributions

```r
ggplot(penguins, aes(x = species, y = body_mass_g)) +
  geom_boxplot(aes(fill = species), alpha = 0.7)
```

## Adding Trend Lines

Use `geom_smooth()` to add a trend line:

```r
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)  # se=FALSE removes confidence band
```

## Transparency & Shapes

Control transparency with `alpha` (0 = fully transparent, 1 = fully opaque):

```r
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(alpha = 0.5, size = 3)  # semi-transparent points
```

Control point shape with `shape`:

```r
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(shape = species), size = 3)
```

## Labels & Titles

```r
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point() +
  labs(
    title = "Penguin Flipper Length vs Body Mass",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)"
  )
```

## color vs fill

For plots with bars or filled shapes:
- `color` colors the **border** or outline
- `fill` colors the **interior**

```r
ggplot(penguins, aes(x = species)) +
  geom_bar(aes(fill = species))  # fill bars by species

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species))  # color the points
```

## Scales and Themes

Control plot colors and appearance with `scale_*()` functions and `theme()`:

```r
ggplot(penguins, aes(x = species, fill = species)) +
  geom_bar() +
  scale_fill_manual(values = c("darkorange","purple","cyan4")) +
  theme_minimal() +
  theme(legend.position = "none")
```

Common complete themes:
- `theme_minimal()` — clean, minimal background
- `theme_classic()` — classic x/y axes, no grid
- `theme_bw()` — white background with gridlines

## Reordering Categories with fct_reorder()

By default, factors are ordered alphabetically. Use `fct_reorder()` from the forcats package to reorder by another variable:

```r
ggplot(penguins, aes(x = fct_reorder(species, body_mass_g, .fun=mean),
                     y = body_mass_g)) +
  geom_boxplot()
```

`fct_reorder(x, y)` orders the levels of x by the **median** of y (the default). Pass `.fun = mean` to sort by mean instead. This is essential for ranked bar charts and ordered boxplots.

## Common Exam Plot Mistakes

Watch out for these:

- **Wrong geom:** `geom_bar()` **counts** rows; `geom_col()` uses your y values. Know which you need.
- **Aesthetic placement:** Fixed values (like colors) go OUTSIDE `aes()`. Data-driven aesthetics go INSIDE.
- **Confidence bands:** `geom_smooth()` adds a band by default. Use `se=FALSE` to remove it.
- Both `facet_wrap(vars(col))` and `facet_wrap(~col)` work in ggplot2. The `vars()` style is preferred in tidyverse code, but `~col` is not deprecated and won't cause errors.
- **Missing na.rm:** Summary geoms warn if data has NAs. Add `na.rm=TRUE` to suppress warnings.

Data Visualization with ggplot2

# dplyr: Data Manipulation

## The Pipe: `%>%`

The pipe passes the left-hand result to the next function. This makes chains of operations readable:

```r
# Without pipe — hard to read
filter(select(penguins, species, body_mass_g), species == "Adelie")

# With pipe — clean and sequential
penguins %>%
  select(species, body_mass_g) %>%
  filter(species == "Adelie")
```
```output
# A tibble: 152 × 2
   species body_mass_g
   <fct>         <int>
 1 Adelie         3750
 2 Adelie         3800
 3 Adelie         3250
 # … with 149 more rows
```


> **Note:** R 4.1+ introduced the native pipe `|>` which works identically to `%>%` for most purposes. You may see both in course materials and online resources. `x |> f()` is equivalent to `x %>% f()`.

## Column Operations

### `select()` — keep or drop columns

```r
penguins %>% select(species, body_mass_g, sex)
penguins %>% select(-island)   # drop island
```
```output
# A tibble: 344 x 3
   species body_mass_g sex   
   <fct>         <int> <fct> 
 1 Adelie         3750 male  
 2 Adelie         3800 male  
 3 Adelie         3250 female
 4 Adelie           NA NA    
 5 Adelie         3450 female
 6 Adelie         3650 male  
 7 Adelie         3625 female
 8 Adelie         4675 male  
 9 Adelie         3475 NA    
10 Adelie         4250 NA    
# i 334 more rows
# A tibble: 344 x 7
   species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex   
   <fct>            <dbl>         <dbl>             <int>       <int> <fct> 
 1 Adelie            39.1          18.7               181        3750 male  
 2 Adelie            39.5          17.4               186        3800 male  
 3 Adelie            40.3          18                 195        3250 female
# i 341 more rows
```

### `mutate()` — create or modify columns

```r
penguins %>%
  mutate(mass_kg = body_mass_g / 1000) %>%
  select(species, mass_kg)
```
```output
# A tibble: 344 × 2
   species mass_kg
   <fct>     <dbl>
 1 Adelie     3.75
 2 Adelie     3.8
 3 Adelie     3.25
```

## Row Operations

### `filter()` — keep rows matching a condition

```r
penguins %>%
  filter(species == "Chinstrap") %>%
  nrow()
```
```output
[1] 68
```

```r
penguins %>%
  filter(body_mass_g > 5000 & sex == "male") %>%
  select(species, body_mass_g)
```
```output
# A tibble: 27 × 2
   species body_mass_g
   <fct>         <int>
 1 Gentoo         5500
 2 Gentoo         5700
```

### `arrange()` — sort rows

```r
penguins %>%
  arrange(desc(body_mass_g)) %>%
  select(species, body_mass_g) %>%
  head(3)
```
```output
# A tibble: 3 × 2
  species body_mass_g
  <fct>         <int>
1 Gentoo         6300
2 Gentoo         6050
3 Gentoo         6000
```


```r
# Filter for multiple values using %in%
penguins %>%
  filter(species %in% c("Adelie", "Chinstrap"))
# Returns only rows where species is Adelie OR Chinstrap
```
```output
# A tibble: 220 x 8
   species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex    year
   <fct>     <fct>              <dbl>         <dbl>             <int>       <int> <fct> <int>
 1 Adelie    Torgersen           39.1          18.7               181        3750 male   2007
 2 Adelie    Torgersen           39.5          17.4               186        3800 male   2007
 3 Adelie    Torgersen           40.3          18                 195        3250 female 2007
# i 217 more rows
```

## Summarizing Data

### `group_by()` + `summarize()`

```r
penguins %>%
  group_by(species) %>%
  summarize(
    avg_mass = mean(body_mass_g, na.rm = TRUE),
    n        = n()
  )
```
```output
# A tibble: 3 × 3
  species   avg_mass     n
  <fct>        <dbl> <int>
1 Adelie       3701.   152
2 Chinstrap    3733.    68
3 Gentoo       5076.   124
```

### `count()` — quick row counts per group

```r
penguins %>% count(species)
```
```output
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124
```


> **Note:** After `group_by() %>% summarize()`, the result is still grouped by all-but-last grouping variable. Use `.groups = "drop"` inside `summarize()` or chain `ungroup()` after to remove all grouping and avoid unexpected behavior in downstream operations.
```r
penguins %>%
  group_by(species, island) %>%
  summarize(avg_mass = mean(body_mass_g, na.rm = TRUE),
            .groups = "drop")   # removes all grouping from result
```
```output
`summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
# A tibble: 5 x 3
# Groups:   species [3]
  species   island    avg_mass
  <fct>     <fct>        <dbl>
1 Adelie    Biscoe       3710.
2 Adelie    Dream        3688.
3 Adelie    Torgersen    3706.
4 Chinstrap Dream        3733.
5 Gentoo    Biscoe       5076.
```

## `case_when()` — conditional values

```r
penguins %>%
  mutate(size = case_when(
    body_mass_g > 5000 ~ "large",
    body_mass_g > 3500 ~ "medium",
    .default           = "small"
  )) %>%
  count(size)
```
```output
# A tibble: 3 × 2
  size       n
  <chr>  <int>
1 large     61
2 medium   219
3 small     64
```

Conditions evaluate in order — the **first** TRUE match wins.

## Relocating Columns

Move a column to a different position:

```r
penguins %>%
  relocate(sex, .after = species) %>%
  head()
```
```output
# A tibble: 6 x 8
  species sex    island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
  <fct>   <fct>  <fct>              <dbl>         <dbl>             <int>       <int> <int>
1 Adelie  male   Torgersen           39.1          18.7               181        3750  2007
2 Adelie  male   Torgersen           39.5          17.4               186        3800  2007
3 Adelie  female Torgersen           40.3          18                 195        3250  2007
4 Adelie  NA     Torgersen           NA            NA                  NA          NA  2007
5 Adelie  female Torgersen           36.7          19.3               193        3450  2007
6 Adelie  male   Torgersen           39.3          20.6               190        3650  2007
```

## Renaming Columns

```r
penguins %>%
  rename(flipper = flipper_length_mm, mass = body_mass_g) %>%
  head()
```
```output
# A tibble: 6 x 8
  species island    bill_length_mm bill_depth_mm flipper  mass sex    year
  <fct>   <fct>              <dbl>         <dbl>   <int> <int> <fct> <int>
1 Adelie  Torgersen           39.1          18.7     181  3750 male   2007
2 Adelie  Torgersen           39.5          17.4     186  3800 male   2007
3 Adelie  Torgersen           40.3          18       195  3250 female 2007
4 Adelie  Torgersen           NA            NA        NA    NA NA     2007
5 Adelie  Torgersen           36.7          19.3     193  3450 female 2007
6 Adelie  Torgersen           39.3          20.6     190  3650 male   2007
```


Sort in **descending** order:

```r
penguins %>%
  arrange(desc(body_mass_g)) %>%
  select(species, body_mass_g) %>%
  head(3)
```
```output
# A tibble: 6 x 2
  species body_mass_g
  <fct>         <int>
1 Gentoo         6300
2 Gentoo         6050
3 Gentoo         6000
4 Gentoo         5950
5 Gentoo         5950
6 Gentoo         5800
```

## group_by() + mutate()

Unlike `summarize()`, `mutate()` adds a column while keeping **all original rows**. Each row gets the per-group value:

```r
penguins %>%
  group_by(species) %>%
  mutate(
    species_mean_mass = mean(body_mass_g, na.rm = TRUE)
  ) %>%
  select(species, body_mass_g, species_mean_mass) %>%
  head()
```
```output
# A tibble: 6 x 3
# Groups:   species [1]
  species body_mass_g species_mean_mass
  <fct>         <int>             <dbl>
1 Adelie         3750             3701.
2 Adelie         3800             3701.
3 Adelie         3250             3701.
4 Adelie           NA             3701.
5 Adelie         3450             3701.
6 Adelie         3650             3701.
```

Then use `ungroup()` to remove grouping:

```r
penguins %>%
  group_by(species) %>%
  mutate(species_count = n()) %>%
  ungroup()
```
```output
# A tibble: 3 x 2
  species   species_count
  <fct>             <int>
1 Adelie              152
2 Chinstrap            68
3 Gentoo              124
```

## Multiple Summaries at Once

```r
penguins %>%
  group_by(species) %>%
  summarize(
    avg = mean(body_mass_g, na.rm = TRUE),
    min = min(body_mass_g, na.rm = TRUE),
    max = max(body_mass_g, na.rm = TRUE),
    n = n()
  )
```
```output
# A tibble: 3 x 5
  species     avg min_mass max_mass count
  <fct>     <dbl>    <int>    <int> <int>
1 Adelie    3701.     2850     4775   152
2 Chinstrap 3733.     2700     4800    68
3 Gentoo    5076.     3950     6300   124
```

## slice_min() and slice_max()

Keep the top/bottom k rows per group using `slice_max()` and `slice_min()`:

```r
penguins %>%
  group_by(species) %>%
  slice_max(body_mass_g, n = 1)   # heaviest penguin per species
```
```output
# A tibble: 3 x 8
# Groups:   species [3]
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex    year
  <fct>     <fct>           <dbl>         <dbl>             <int>       <int> <fct> <int>
1 Adelie    Biscoe           43.2          19                 197        4775 male   2009
2 Chinstrap Dream            52            20.7               210        4800 male   2008
3 Gentoo    Biscoe           49.2          15.2               221        6300 male   2007
```

`slice_max(col, n=k)` keeps the k rows with the **largest** values.
`slice_min(col, n=k)` keeps the k rows with the **smallest** values.

## drop_na()

Remove rows with missing values:

```r
penguins %>% drop_na()              # remove any row with ANY NA
penguins %>% drop_na(body_mass_g)   # remove rows where body_mass_g is NA
```
```output
# A tibble: 333 x 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex    year
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct> <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750 male   2007
 2 Adelie  Torgersen           39.5          17.4               186        3800 male   2007
 3 Adelie  Torgersen           40.3          18                 195        3250 female 2007
# i 330 more rows
# A tibble: 342 x 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex    year
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct> <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750 male   2007
 2 Adelie  Torgersen           39.5          17.4               186        3800 male   2007
 3 Adelie  Torgersen           40.3          18                 195        3250 female 2007
# i 339 more rows
```

## mutate() vs summarize() — Key Distinction

These two verbs handle grouping differently:

- `mutate()` → returns **same number of rows** as input
- `summarize()` → returns **one row per group**

```r
# mutate: 344 rows → 344 rows
penguins %>%
  group_by(species) %>%
  mutate(species_avg = mean(body_mass_g, na.rm=TRUE)) %>%
  nrow()
```
```output
[1] 344
```

```r
# summarize: 344 rows → 3 rows (one per species)
penguins %>%
  group_by(species) %>%
  summarize(avg = mean(body_mass_g, na.rm=TRUE)) %>%
  nrow()
```
```output
[1] 3
```

Use `mutate()` to add new columns; use `summarize()` to collapse groups into summaries.

## Full Pipeline Example

Combining filter, group_by, summarize, and arrange:

```r
penguins %>%
  filter(!is.na(body_mass_g), !is.na(sex)) %>%
  group_by(species, sex) %>%
  summarize(
    avg_mass = mean(body_mass_g),
    n = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_mass))
```
```output
`summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
# A tibble: 6 x 4
# Groups:   species [3]
  species   sex    avg_mass     n
  <fct>     <fct>     <dbl> <int>
1 Adelie    female    3369.    73
2 Adelie    male      4043.    73
3 Chinstrap female    3527.    34
4 Chinstrap male      3939.    34
5 Gentoo    female    4680.    58
6 Gentoo    male      5485.    61
```

This removes rows with missing data, groups by species and sex, computes mean mass and count per group, then sorts by descending average mass.

dplyr: Data Manipulation

# Joining & Pivoting Data

## Why Multiple Tables?

Real data rarely lives in a single table. A school database might have one table for students, another for courses, another for grades. Keeping them separate avoids repetition — you don't need to re-type "Introduction to Statistics" for every student who takes that course.

**Joins** let you combine tables using a shared **key column** — a value that appears in both tables and links the rows.

> **Key concept:** When you join tables, you're asking: "For each row in table A, find the matching row(s) in table B." The answer depends on which join type you use.

## semi_join() — Filter by Match (No Duplication)

`semi_join(x, y)` returns all rows from x that **have a match** in y — but keeps only x's columns. It's the "positive filter" complement to `anti_join()`:

```r
# Keep only students who have a grade recorded
semi_join(students, grades, by = "student_id")
```
```output
# A tibble: 4 x 3
  student_id name  score
       <int> <chr> <dbl>
1          1 Alice    88
2          2 Bob      92
3          3 Carol    75
4          4 Dan      95
```

Unlike `inner_join()`, `semi_join()` never duplicates rows even if there are multiple matches in y.

| Function | Keeps rows from x... | Adds y's columns? |
|---|---|---|
| inner_join | with a match in y | Yes |
| left_join | all rows | Yes (NA if no match) |
| anti_join | with NO match in y | No |
| semi_join | with a match in y | No |

## When Keys Have Different Names

Both tables have the same concept but different column names? Use `join_by()`:

```r
# Major table uses "Student_Name", Language table uses "Name"
inner_join(Major, Language,
           by = join_by(Student_Name == Name))
```
```output
# 15 rows — students present in both tables
```

## anti_join: Finding Non-Matches

`anti_join` is underused but powerful. It returns rows from A that have **no match** in B:

```r
# Find players who have NOT previously won an award
anti_join(all_players, past_winners, by = "player_id")
```
```output
# All players NOT in past_winners
```

> **When to use anti_join:** When you want to find "what's in A but not in B" — unmatched students, products never ordered, etc.

## right_join() and full_join()

While `left_join()` keeps all rows from the **left** table:

```r
left_join(students, grades)  # all students, NAs for those without grades
```
```output
Joining with `by = join_by(student_id)`
# A tibble: 5 x 4
  student_id name  score grade
       <int> <chr> <dbl> <chr>
1          1 Alice    88 B    
2          2 Bob      92 A    
3          3 Carol    75 C    
4          4 Dan      95 A    
5          5 Eve      83 NA   
```

`right_join()` keeps all rows from the **right** table:

```r
right_join(students, grades)  # all students in grades table
```
```output
Joining with `by = join_by(student_id)`
# A tibble: 4 x 4
  student_id name  score grade
       <int> <chr> <dbl> <chr>
1          1 Alice    88 B    
2          2 Bob      92 A    
3          3 Carol    75 C    
4          4 Dan      95 A    
```

> **Equivalence:** `right_join(x, y)` is the same as `left_join(y, x)`.

`full_join()` keeps **all** rows from both tables:

```r
full_join(students, grades)  # all students AND all grades, NAs everywhere they don't match
```
```output
Joining with `by = join_by(student_id)`
# A tibble: 5 x 4
  student_id name  score grade
       <int> <chr> <dbl> <chr>
1          1 Alice    88 B    
2          2 Bob      92 A    
3          3 Carol    75 C    
4          4 Dan      95 A    
5          5 Eve      83 NA   
```

## Pivoting: Reshaping Data

**Tidy data** has one observation per row and one variable per column. Sometimes data arrives in a "wide" format that needs reshaping.

Wide format — each time point is a separate column:

```r
# quiz_scores: student | q1 | q2 | q3
```

Long (tidy) format — one row per student-quiz combination:

```r
# quiz_scores: student | quiz | score
```

### `pivot_longer()` — wide to long

```r
quiz_scores %>%
  pivot_longer(
    cols      = c(q1, q2, q3),
    names_to  = "quiz",
    values_to = "score"
  )
```
```output
# A tibble: (3 × original_rows) rows
  student quiz  score
  <chr>   <chr> <dbl>
1 Alice   q1       88
2 Alice   q2       91
3 Alice   q3       79
4 Bob     q1       72
# ... etc
```

### `pivot_wider()` — long to wide

```r
# Long format: child | measurement | value
children %>%
  pivot_wider(
    names_from  = measurement,
    values_from = value
  )
```
```output
# A tibble: 3 × 3
  name  height weight
  <chr>  <dbl>  <dbl>
1 Ross      43   40.2
2 Terry     46   43.6
3 Ellie     44   39.0
```

> **Why tidy data matters:** ggplot2 and dplyr are designed for long/tidy format. A dataset with columns jan, feb, mar can't be plotted easily — you'd need three separate geom calls. After pivot_longer, you have one `month` column and can use `color = month` or `facet_wrap(vars(month))` automatically.

## bind_rows(): Stacking Vertically

When tables have the same columns but different rows, stack them:

```r
group1 <- tibble(id = 1:3, score = c(85, 90, 78))
group2 <- tibble(id = 4:6, score = c(92, 88, 81))

bind_rows(group1, group2)  # append rows vertically
```
```output
# A tibble: 6 x 2
     id score
  <int> <dbl>
1     1    85
2     2    90
3     3    78
4     4    92
5     5    88
6     6    81
```

> **Difference from joins:** `bind_rows()` stacks rows directly. `full_join()` matches on keys and fills NAs — it's for combining related data from different sources. `bind_rows()` is for appending separate datasets.

## Checking for Duplicate Keys

Before joining, verify that the join key is **unique** in at least one table. Duplicates create unexpected row multiplication:

```r
# Check before joining
students %>% count(student_id) %>% filter(n > 1)
```
```output
# A tibble: 0 x 2
# i 2 variables: student_id <int>, n <int>
```

If this returns rows, you have duplicate keys — decide whether to aggregate, filter, or use a different join type.

## Pivot Longer — Column Selection Options

Select columns to pivot using helper functions:

```r
# Use starts_with() to select columns
df %>% pivot_longer(cols = starts_with("month_"),
                    names_to = "month", values_to = "value")

# Use a column range
df %>% pivot_longer(cols = jan:dec,
                    names_to = "month", values_to = "sales")

# Strip a prefix from column names
df %>% pivot_longer(cols = wk1:wk3,
                    names_to = "week",
                    names_prefix = "wk",
                    values_to = "rank")
```
```output
# A tibble: 9 x 3
  id    month   value
  <int> <chr>   <dbl>
1     1 month_1   100
2     1 month_2   120
3     1 month_3   110
4     2 month_1    95
5     2 month_2   130
6     2 month_3   105
7     3 month_1   108
8     3 month_2   115
9     3 month_3   112
```

`names_prefix` is useful when column names have a consistent prefix you want to remove in the `names_to` column.

Joining & Pivoting Data

This checkpoint covers all Midterm 1 material: Base R, ggplot2, dplyr, joins, and pivoting. Questions are formatted to match the actual SP26 exam — mix of select-one, select-all, and fill-in-the-blank.

Midterm 1 Checkpoint

# Probability Foundations

## What is Probability?

Probability is a number between 0 and 1 that measures how likely an event is to occur. P(A) = 0 means A never happens; P(A) = 1 means A always happens.

Before we can compute probabilities, we need to define:
- **Sample space (S)** — the set of ALL possible outcomes
- **Event** — any subset of the sample space

For a fair six-sided die: S = {1, 2, 3, 4, 5, 6}. The event "roll an even number" = {2, 4, 6}.

> **Equally likely outcomes:** When all outcomes are equally likely, P(A) = (# outcomes in A) / (# outcomes in S). Rolling an even number: P = 3/6 = 0.5.

## The Complement Rule

Every event A has a complement A' ("not A"). Together they cover everything:

P(A) + P(A') = 1
P(A') = 1 - P(A)

> **Why this matters:** Often it's easier to find P(not A) and subtract from 1. "P(at least one success in 10 trials)" is hard directly, but "P(zero successes)" is easy with the binomial — then subtract.

## Union and Intersection

- **Intersection** (A intersect B): "A AND B both occur"
- **Union** (A union B): "A OR B (or both) occur"

The **Addition Rule** connects them:

P(A union B) = P(A) + P(B) - P(A intersect B)

We subtract the intersection to avoid double-counting it.

If A and B are **mutually exclusive** (can't both happen), then P(A intersect B) = 0 and the formula simplifies:

P(A union B) = P(A) + P(B)

> **Example:** A standard deck of cards. P(King) = 4/52. P(Heart) = 13/52. P(King of Hearts) = 1/52. So P(King OR Heart) = 4/52 + 13/52 - 1/52 = 16/52.

## Conditional Probability

P(A | B) means "the probability of A, **given** that B has already occurred." You're restricting your attention to the world where B happened:

P(A|B) = P(A intersect B) / P(B)

This is different from P(A) unless A and B are independent.

> **Intuition:** If it's cloudy (B), the probability of rain (A) is higher than the unconditional P(rain). Knowing B happened updates your belief about A.

## Independence

Two events are **independent** if knowing one occurred tells you nothing about the other:

P(A|B) = P(A)  if and only if  P(A intersect B) = P(A) × P(B)

The multiplication rule for independent events is used constantly in probability:
- Flipping two fair coins: P(both heads) = P(H) × P(H) = 0.5 × 0.5 = 0.25
- Assuming independence when it doesn't hold is one of the most common errors in statistics

> **Mutually exclusive ≠ Independent.** This confuses many students. If A and B are mutually exclusive with positive probability, they CANNOT be independent, because knowing A occurred means B definitely didn't — that's information.

## Bayes' Theorem

Bayes' theorem reverses conditional probability. You know P(B|A) but want P(A|B):

P(A|B) = [ P(B|A) × P(A) ] / P(B)

**Example:** A test for a disease is 99% accurate. The disease affects 1% of the population. If you test positive, what's the probability you actually have the disease?

- P(positive | disease) = 0.99
- P(disease) = 0.01
- P(positive) = P(pos|disease)*P(disease) + P(pos|no disease)*P(no disease) = 0.99*0.01 + 0.01*0.99 approximately 0.0198
- P(disease | positive) = (0.99 * 0.01) / 0.0198 approximately 0.5

> **The surprising result:** Even a 99% accurate test gives only ~50% probability of actually having the disease when it's rare. This is why Bayes matters — intuition fails here.

## Random Variables and Probability Distributions

A **random variable** X is a function that assigns a number to each outcome in a sample space. For example:
- X = result of rolling a die (values 1-6)
- X = number of heads in 3 coin flips (values 0-3)
- X = height of a randomly chosen student (values 58-80 inches)

A **probability distribution** specifies the probabilities of all possible values. For a **discrete** random variable, we list:
- **Support:** all possible values
- **Probability for each value:** P(X = x)
- **Constraint:** all probabilities sum to 1

## Expected Value E[X]

The expected value is the long-run average — the mean of the distribution:

E[X] = sum of x * P(X = x)

In R:

```r
vals <- 0:10          # support for Apgar scores
probs <- c(0.001, 0.006, 0.007, 0.008, 0.012, 0.02, 0.038, 
           0.099, 0.319, 0.437, 0.053)
E_X <- sum(vals * probs)
E_X
```
```output
[1] 8.128
```

## Variance and Standard Deviation

**Variance** measures spread around the mean:

Var(X) = sum of (x - mu)^2 * P(X = x)

In R:

```r
mu <- E_X
Var_X <- sum((vals - mu)^2 * probs)
SD_X  <- sqrt(Var_X)
Var_X
SD_X
```
```output
[1] 2.066
[1] 1.437
```

## Visualizing a Discrete Distribution

```r
apgar_data <- tibble(
  score = 0:10,
  prob = c(0.001, 0.006, 0.007, 0.008, 0.012, 0.02, 0.038, 
           0.099, 0.319, 0.437, 0.053)
)

ggplot(apgar_data, aes(x = score, y = prob)) +
  geom_col(fill = "steelblue", alpha = 0.8) +
  labs(title = "Apgar Score Distribution", x = "Score", y = "Probability")
```

> **Why this matters:** The shape of the distribution tells you about typical and unlikely outcomes. High concentration around 8-9 means most newborns have very good Apgar scores.

## Population vs Sample — Definitions

**Key distinction for all inference:**

- **Population:** entire group of interest; parameters (mu, sigma, p) describe it — **fixed but unknown**
- **Sample:** subset we observe; statistics (x-bar, s, p-hat) describe it — **vary from sample to sample**
- **Inference:** using sample statistics to estimate population parameters

| Quantity | Population Parameter | Sample Statistic |
|---|---|---|
| Mean | mu | x-bar |
| Standard Deviation | sigma | s |
| Proportion | p | p-hat |

Every hypothesis test and confidence interval is about using the sample statistic to learn about the population parameter.

## Discrete vs Continuous Random Variables

**Discrete RV:** takes countable values (0, 1, 2, 3, ...). P(X = k) can be nonzero. Examples: number of successes, count of defects.

**Continuous RV:** takes any value in an interval. P(X = exactly 3.14159...) = 0. Probability is always computed as an **area**. Examples: height, time, weight.

> **Critical rule:** For discrete (like Binomial): P(X >= k) ≠ P(X > k). For continuous (like Normal): P(X >= k) = P(X > k) since P(X=k)=0.

## Law of Total Probability

If A and A' partition the sample space:

**P(B) = P(B|A)*P(A) + P(B|A')*P(A')**

This formula lets you compute the total probability of B by conditioning on whether A occurs. Used extensively in Bayes' theorem and diagnostic testing problems.

## Preview: Cumulative Probabilities

In the next module, we'll work extensively with P(X <= k) — the probability that a random variable is at most k. This is called the **cumulative distribution function (CDF)**.

In R: `pbinom(k, size, prob)` gives P(X <= k) for a binomial random variable. We'll use this to answer questions like "what's the probability of getting at most 3 successes?"

Probability Foundations

# Binomial Distribution

## When to Use the Binomial

The binomial distribution models the number of successes in a fixed number of **Bernoulli trials** — independent experiments that can only end in success or failure.

Four conditions must ALL hold:
1. **Fixed n** — you know how many trials before you start
2. **Independence** — each trial's outcome doesn't affect others
3. **Constant p** — same probability of success on every trial
4. **Binary** — only two possible outcomes per trial

> **Classic examples:** Number of heads in 10 coin flips. Number of patients who recover out of 20 given a treatment. Number of correct guesses on a 10-question T/F test.

## The PMF Formula

For X ~ Binomial(n, p), the probability of exactly k successes:

P(X = k) = C(n,k) * p^k * (1-p)^(n-k)

The **C(n,k)** term counts the number of ways to arrange k successes among n trials. The p^k and (1-p)^(n-k) give the probability of any specific arrangement.

> **Example:** P(exactly 3 heads in 5 flips) = C(5,3) * 0.5^3 * 0.5^2 = 10 * 0.125 * 0.25 = 0.3125.

## R Functions

In R, you'll never compute the PMF formula by hand — use these functions:

```r
# P(X = 3) when X ~ Binomial(10, 0.4)
dbinom(3, size = 10, prob = 0.4)

# P(X <= 5) — cumulative probability
pbinom(5, size = 10, prob = 0.4)

# P(X >= 6) — upper tail: 1 - P(X <= 5)
1 - pbinom(5, size = 10, prob = 0.4)

# P(X >= 6) directly with lower.tail = FALSE
pbinom(5, size = 10, prob = 0.4, lower.tail = FALSE)
```
```output
[1] 0.2149908
[1] 0.8337614
[1] 0.1662386
[1] 0.1662386
```

> **dbinom vs pbinom:** `d` = density (exact probability at one value). `p` = probability (cumulative, all values up to and including k). For "at least", remember that P(X >= k) = 1 - P(X <= k-1), not 1 - P(X <= k).

## Mean and Variance

For X ~ Binomial(n, p):
- **Mean:** E[X] = np
- **Variance:** Var(X) = np(1-p)
- **Standard deviation:** SD(X) = sqrt(np(1-p))

```r
n <- 20
p <- 0.3
mean_X <- n * p
var_X  <- n * p * (1 - p)
sd_X   <- sqrt(var_X)
```
```output
mean_X = 6
var_X  = 4.2
sd_X   approximately 2.05
```

> **Intuition for the mean:** If you flip a coin 20 times (p = 0.5), you expect 10 heads. If p = 0.3, you expect 20 * 0.3 = 6 successes. The mean is just n times the probability of a single success.

## Visualizing the Binomial

```r
tibble(k = 0:15) %>%
  mutate(prob = dbinom(k, size = 15, prob = 0.4)) %>%
  ggplot(aes(x = k, y = prob)) +
  geom_col(fill = "#00D4FF", alpha = 0.8) +
  labs(title = "Binomial(15, 0.4)", x = "Number of successes k", y = "P(X = k)")
```
```output
# Bar chart: bell-shaped, centered around k=6 (the mean), range 0-15
```

The distribution is symmetric only when p = 0.5. For p < 0.5 it's right-skewed; for p > 0.5 it's left-skewed.

## Combinations and Factorials

How many ways can you arrange k successes in n trials?

C(n,k) = n! / (k! * (n-k)!)

In R:

```r
choose(5, 3)       # C(5,3) = 10 ways to choose 3 items from 5
factorial(5)       # 5! = 120
```
```output
[1] 10
[1] 120
```

## The Four Binomial Functions

**dbinom(k, n, p)** — exact probability P(X = k):
```r
dbinom(5, size = 10, prob = 0.4)  # P(X = 5) for Binom(10, 0.4)
```
```output
[1] 0.2006828
```

**pbinom(k, n, p)** — cumulative probability P(X <= k):
```r
pbinom(5, size = 10, prob = 0.4)  # P(X <= 5)
```
```output
[1] 0.6330565
```

**qbinom(q, n, p)** — quantile (inverse CDF). Find the smallest x where P(X <= x) >= q:
```r
qbinom(0.7, size = 10, prob = 0.5)  # returns 6, since P(X <= 5)=0.623 < 0.7 but P(X <= 6)=0.828 >= 0.7
```
```output
[1] 6
```

**rbinom(n_samples, size, prob)** — simulate random samples:
```r
rbinom(5, size = 10, prob = 0.5)  # generate 5 random Binomial(10, 0.5) values
```
```output
[1] 5 6 4 6 5
```

> **Quick reference:** d = density (exact), p = probability (cumulative), q = quantile, r = random sample.

## BINS Mnemonic for Binomial

The **four required conditions** for Binomial(n, p):

- **B**inary — each trial has exactly two outcomes (success/failure, yes/no)
- **I**ndependent — outcome of one trial doesn't affect others
- **N**umber fixed — n is determined before the experiment
- **S**ame p — probability of success is identical for every trial

> **Common violation:** Sampling **without replacement** from a small population violates Independence. **Rule of thumb:** If population size >= 20n, independence is approximately satisfied.

## P(X >= k) vs P(X > k) — Critical Distinction

For a **DISCRETE** distribution these are **NOT the same**:

- **P(X >= k)** = "at least k" = 1 - P(X <= k-1) → use `1 - pbinom(k-1, n, p)`
- **P(X > k)** = "more than k" = 1 - P(X <= k) → use `1 - pbinom(k, n, p)`

Example: X ~ Binomial(10, 0.4)

```r
# P(X >= 4) — at least 4 successes
1 - pbinom(3, 10, 0.4)
```
```output
[1] 0.6177
```

```r
# P(X > 4) — more than 4 (i.e., at least 5)
1 - pbinom(4, 10, 0.4)
```
```output
[1] 0.3669
```

> **This is the #1 source of exam errors.** "At least 4" means >= 4, so subtract P(X <= 3), NOT P(X <= 4).

## Simulation with rbinom()

You can simulate binomial experiments using `rbinom()`:

```r
# Simulate 100,000 experiments: n=10 trials, p=0.4
sims <- rbinom(100000, size = 10, prob = 0.4)
mean(sims)   # should be close to n*p = 4
```
```output
[1] 3.998
```

The simulated mean (approximately 4) confirms our formula E[X] = n*p = 10 * 0.4 = 4. Simulation is a powerful way to verify theoretical results.

Binomial Distribution

# Normal Distribution

## The Most Important Distribution in Statistics

The normal distribution is a continuous, bell-shaped distribution that appears everywhere in nature and statistics. Heights, test scores, measurement errors — many real-world phenomena follow an approximately normal distribution.

The distribution is completely described by two parameters:
- **mu (mean)** — the mean, which determines where the bell is centered
- **sigma (standard deviation)** — the standard deviation, which determines how wide the bell is

Notation: X ~ N(mu, sigma^2) — note sigma^2 is the variance, so sigma is the standard deviation.

> **Normal vs other distributions:** Unlike the binomial (which counts discrete successes), the normal distribution is continuous — it applies to measurements that can take any value on a number line.

## The 68-95-99.7 Rule (Empirical Rule)

This rule lets you do quick probability calculations in your head:

```r
# For any normal distribution N(mu, sigma^2):
# P(mu - sigma  < X < mu + sigma)  approximately 0.68  (68%)
# P(mu - 2*sigma < X < mu + 2*sigma) approximately 0.95  (95%)
# P(mu - 3*sigma < X < mu + 3*sigma) approximately 0.997 (99.7%)
```

> **Example:** SAT scores ~ N(1060, 195^2). About 68% of students score between 865 and 1255. About 95% score between 670 and 1450. A score above 1645 (3 SDs above mean) is in the top 0.15%.

## Z-Scores: Standardization

A **z-score** measures how many standard deviations a value is from the mean:

z = (x - mu) / sigma

```r
# Score of 85, mean = 75, sd = 10
z <- (85 - 75) / 10
z
```
```output
[1] 1
```

A z-score of +1 means the value is 1 standard deviation above the mean. Standardizing converts any normal distribution to the **standard normal** Z ~ N(0, 1).

> **Why z-scores matter:** They let you compare values from different scales. A z-score of 2.0 on a math test and a z-score of 2.0 on a physics test are equally impressive, even if the raw scores were very different.

## R Functions for Normal Probabilities

```r
# P(X <= 85) for X ~ N(75, 100)
pnorm(85, mean = 75, sd = 10)

# P(X > 80)
pnorm(80, mean = 75, sd = 10, lower.tail = FALSE)
# or: 1 - pnorm(80, mean = 75, sd = 10)

# P(65 <= X <= 85)
pnorm(85, mean = 75, sd = 10) - pnorm(65, mean = 75, sd = 10)

# Find the 90th percentile
qnorm(0.90, mean = 75, sd = 10)
```
```output
[1] 0.8413447
[1] 0.3085375
[1] 0.6826895
[1] 87.81552
```

> **pnorm vs qnorm:** `pnorm` goes from a value to a probability (area to the left). `qnorm` goes from a probability to a value — it's the inverse. Use `qnorm` when questions ask "what score corresponds to the top 10%?"

## The Central Limit Theorem (CLT)

The CLT is arguably the most important theorem in statistics. It says:

> If you take random samples of size n from **any** population with mean mu and standard deviation sigma, the distribution of **sample means** will be approximately normal with mean mu and standard error sigma/sqrt(n), as long as n is large enough (usually n >= 30).

```r
# Even if the population is skewed, sample means are normal
# SE = population_sd / sqrt(n)
SE <- 15 / sqrt(100)  # pop sd = 15, n = 100
SE
```
```output
[1] 1.5
```

> **Why the CLT is so powerful:** It lets us use normal distribution math even when we don't know the shape of the population distribution. This is the foundation for confidence intervals and hypothesis tests in the next two modules.

## dnorm(), pnorm(), qnorm()

**dnorm(x, mean, sd)** — height of the bell curve at x (density, not probability):
```r
dnorm(0, mean = 0, sd = 1)        # height at z = 0
dnorm(2, mean = 0, sd = 1)        # height at z = 2
```
```output
[1] 0.3989423
[1] 0.0539909
```

> **Important:** dnorm gives height, not area. Probability is always zero for any single point in a continuous distribution.

**pnorm(q, lower.tail = FALSE)** — find P(X > q):
```r
pnorm(2, mean = 0, sd = 1, lower.tail = FALSE)  # P(Z > 2)
```
```output
[1] 0.02275013
```

## Normal Approximation to the Binomial

When the sample is large enough, Binomial(n, p) approximately N(np, np(1-p)).

**Conditions:** np(1-p) >= 10

```r
# Check if approximation is valid
n <- 100
p <- 0.5
np_1_minus_p <- n * p * (1 - p)
np_1_minus_p >= 10
```
```output
[1] TRUE
```

With n = 100, p = 0.5: np(1-p) = 25 (valid).
With n = 100, p = 0.01: np(1-p) = 0.99 (invalid) Invalid — binomial too skewed.

## Standardization for Comparison

Z-scores let you compare values across different distributions:

```r
# Alice scores 85 on a test with mean 75, sd 10
# Bob scores 38 on a test with mean 30, sd 8
alice_z <- (85 - 75) / 10
bob_z <- (38 - 30) / 8
alice_z
bob_z
```
```output
[1] 1
[1] 1
```

Both have z = 1, so they performed equally well relative to their peers.

## Four Normal Probability Cases

**Every normal probability question is one of four forms:**

1. P(X < a): `pnorm(a, mean, sd)` — left tail
2. P(X > a): `pnorm(a, mean, sd, lower.tail=FALSE)` — right tail
3. P(a < X < b): `pnorm(b, mean, sd) - pnorm(a, mean, sd)` — between
4. Find value: `qnorm(p, mean, sd)` — given probability, find x

Example: X ~ N(72, 8^2) — heights in inches

```r
pnorm(80, 72, 8)                        # P(X < 80)
pnorm(60, 72, 8, lower.tail = FALSE)    # P(X > 60)
pnorm(80, 72, 8) - pnorm(64, 72, 8)     # P(64 < X < 80)
qnorm(0.95, 72, 8)                      # 95th percentile
```
```output
[1] 0.8413
[1] 0.9332
[1] 0.6827
[1] 85.16
```

Note: P(64 < X < 80) = 0.6827 because 64 = mu-sigma and 80 = mu+sigma, so this is the ±1 sigma range (68-95-99.7 rule).

## CLT and Sampling Distribution of x-bar

When you take repeated samples of size n from a population with mean mu and sd sigma:

**x-bar ~ N(mu, sigma/sqrt(n))** where SE = sigma/sqrt(n) (standard error)

The sample mean is normally distributed regardless of the population distribution (if n is large enough).

```r
# Heights ~ N(68, 3^2), taking samples of n=36
SE <- 3 / sqrt(36)
SE
pnorm(69, mean = 68, sd = SE, lower.tail = FALSE)
```
```output
[1] 0.5
[1] 0.02275
```

> **Key insight:** Doubling n reduces SE by factor sqrt(2) approximately 1.41. Quadrupling n cuts SE in half. **Larger samples — more precise estimates of mu.**

Normal Distribution

This checkpoint covers Midterm 2 material: probability foundations, binomial distribution, and normal distribution. Questions are formatted to match exam style — mix of select-one, select-all, and fill-in-the-blank. All R function questions require exact syntax.

Midterm 2 Checkpoint

# Confidence Intervals & Hypothesis Testing

## The Problem of Inference

We almost never have access to the entire population -- we work with a **sample**. A sample mean x̄ gives us our best guess at the true population mean μ, but how precise is that guess?

A **confidence interval** answers this by giving a range of plausible values rather than a single point:

$$x̄  ±  t*  ×  (s / sqrt(n))

- **x̄** -- sample mean (our point estimate)
- **t*** -- critical value from the t-distribution (depends on confidence level and sample size)
- **s/sqrt(n)** -- standard error (measures how much x̄ varies across samples)

> **Interpreting a 95% CI:** If we repeated our study many times and built a 95% CI each time, 95% of those intervals would contain the true μ. The specific interval we built either does or doesn't contain μ -- we can't know which, but we're 95% confident.

## Why the t-Distribution (Not Normal)?

When sigma is unknown -- which is always in practice -- we estimate it with the sample standard deviation s. This introduces additional uncertainty that the normal distribution doesn't account for. The t-distribution has heavier tails to compensate.

The t-distribution has a parameter called **degrees of freedom** (df = n - 1). As n gets larger, the t-distribution approaches the normal distribution.

### How Degrees of Freedom Change the Shape

The degrees of freedom determine how heavy the tails are:

- **df = 1:** Very heavy tails (much heavier than normal) -- critical values are far from 0
- **df = 10:** Getting closer to normal -- tails still noticeably heavier
- **df = infinity:** Exactly equals the standard normal distribution

**Practical example:**
```r
qt(0.975, df = 5)      # 2.571  (wider interval needed)
qt(0.975, df = 30)     # 2.042   (closer to normal)
qnorm(0.975)           # 1.960   (standard normal for reference)
```
```output
[1] 2.570582
[1] 2.042272
[1] 1.959964
```

For **df = 5** (small sample, n = 6), the critical value is further from 0 than for **df = 30** (n = 31). This wider interval reflects our uncertainty when working with small samples.

> **Rule of thumb:** For large n (> 30), the difference between t and z is negligible. For small n, the heavier tails matter -- they make intervals wider, reflecting genuine uncertainty.

## Confidence Intervals in R

```r
# Height data for 25 students
heights <- c(65, 67, 70, 68, 72, 64, 69, 71, 66, 68,
             70, 73, 65, 67, 69, 68, 72, 71, 66, 64,
             68, 70, 67, 69, 71)

t.test(heights)
```
```output
    One Sample t-test

data:  heights
t = 134.14, df = 24, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 67.35  69.45
sample estimates:
mean of x
    68.4
```

> **Reading t.test output:** The 95% CI is (67.35, 69.45). Our best estimate is 68.4 inches, and we're 95% confident the true mean height is between 67.35 and 69.45 inches.

## Hypothesis Testing

A hypothesis test formally evaluates a specific claim about a population parameter.

**Step 1: State hypotheses**
- H0 (null): μ = μ₀ -- the "no effect" / status quo claim
- Ha (alternative): μ != μ₀, or μ > μ₀, or μ < μ₀

**Step 2: Compute the test statistic**

$$t = (x̄ - μ₀) / (s / sqrt(n))

**Step 3: Find the p-value** -- the probability of seeing a result as extreme as ours, assuming H0 is true.

**Step 4: Decision** -- if p-value < alpha (usually 0.05), reject H0.

```r
# Test if mean height equals 70 inches
t.test(heights, mu = 70)
```
```output
    One Sample t-test

t = -3.14, df = 24, p-value = 0.0044
alternative hypothesis: true mean is not equal to 70
95 percent confidence interval:
 67.35  69.45
```

Since p-value (0.004) < alpha (0.05), we reject H₀: μ = 70. The data provide significant evidence that the true mean height is not 70 inches.

## One-Sided vs Two-Sided Tests

A **two-sided test** checks if a parameter differs in either direction from the null value:
- H₀: μ = μ₀
- Hₐ: μ != μ₀
- Default in most software

A **one-sided test** checks if a parameter is specifically larger or smaller:
- **Right-tailed:** Hₐ: μ > μ₀ (looking for evidence the mean is greater)
- **Left-tailed:** Hₐ: μ < μ₀ (looking for evidence the mean is less)

### When to Use One-Sided vs Two-Sided

**Use two-sided when:**
- You have no prior directional hypothesis
- A difference in either direction matters equally
- This is the safer, more conservative default

**Use one-sided when:**
- Theory or context suggests a specific direction
- You only care about one direction
- Example: Does a new drug improve (not worsen) a condition?

### One-Sided Tests in R

```r
# Two-sided test (default)
t.test(heights, mu = 70)
# Ha: mu != 70

# One-sided test: left tail (testing if mu < 70)
t.test(heights, mu = 70, alternative = "less")

# One-sided test: right tail (testing if mu > 70)
t.test(heights, mu = 70, alternative = "greater")
```
```output

	One Sample t-test

data:  heights
t = -2.0455, df = 19, p-value = 0.05529
alternative hypothesis: true mean is not equal to 70
95 percent confidence interval:
 67.35 69.45
sample estimates:
mean of x 
     68.4 


	One Sample t-test

data:  heights
t = -2.0455, df = 19, p-value = 0.02765
alternative hypothesis: true mean is less than 70
95 percent confidence interval:
  -Inf 69.24
sample estimates:
mean of x 
     68.4 


	One Sample t-test

data:  heights
t = -2.0455, df = 19, p-value = 0.9724
alternative hypothesis: true mean is greater than 70
95 percent confidence interval:
 67.56   Inf
sample estimates:
mean of x 
     68.4 
```

**Key point:** A one-sided p-value is **half** the corresponding two-sided p-value (when the test statistic points in the expected direction).

## Type I and Type II Errors

No test is perfect -- we can make two kinds of mistakes:

| | H0 is True | H0 is False |
|--|--|--|
| **Reject H0** | Type I Error (rate = alpha) | Correct! (Power) |
| **Don't Reject** | Correct! | Type II Error (rate = beta) |

> **The tradeoff:** Making alpha smaller (e.g., 0.01 instead of 0.05) reduces Type I errors but increases Type II errors. In high-stakes settings (like medical testing), you choose alpha carefully based on the cost of each error type.

**Power** = 1 - beta = the probability of correctly detecting a real effect. Power increases with larger samples, bigger true effects, and higher alpha.

## Constructing Confidence Intervals with qnorm()

To build a CI, find the critical z-values using `qnorm()`:

```r
# For 95% CI: alpha = 0.05, so alpha/2 = 0.025
qnorm(0.025)       # lower tail
qnorm(0.975)       # upper tail
```
```output
[1] -1.959964
[1] 1.959964
```

For 90% CI: alpha/2 = 0.05
```r
qnorm(0.05)
qnorm(0.95)
```
```output
[1] -1.644854
[1] 1.644854
```

> **Pattern:** For (100 - alpha)% CI, use `qnorm(1 - alpha/2)` for the upper critical value.

## The Lady Tasting Tea: A Hypothesis Test Example

**Scenario:** Lady Bristol claims she can taste whether tea or milk was added first. She tastes 8 cups and guesses correctly on 6. Can we conclude she has ability, or is she just guessing?

- H₀: p = 0.5 (guessing randomly)
- Hₐ: p != 0.5 (has ability)
- X ~ Binomial(8, 0.5), and she got X = 6

**P-value:** P(X >= 6) assuming H0 is true:

```r
1 - pbinom(5, size = 8, prob = 0.5)
```
```output
[1] 0.1445313
```

> **Interpretation:** Even by random chance, she has a 14.5% probability of guessing >= 6 correct. This is not unusual, so we do NOT reject H0. The data don't provide strong evidence she has tasting ability.

## P-Value Definition

The **p-value** is the probability of observing a result as extreme as (or more extreme than) what we got, assuming the null hypothesis is true.

- Small p-value (< alpha) -> Result is surprising under H0 -> Reject H0
- Large p-value (>= alpha) -> Result is consistent with H0 -> Fail to reject

## When to Use z vs t

| Situation | Test Statistic | R Function |
|---|---|---|
| sigma known | z = (x̄ - μ₀)/(sigma/sqrt(n)) | `pnorm()` |
| sigma unknown, any n | t = (x̄ - μ₀)/(s/sqrt(n)), df=n-1 | `t.test()` |

In STAT 240, sigma is almost never known -- we use **t-tests**.

## t Critical Values from qt()

Compute the t* value for confidence intervals:

```r
qt(0.975, df = 24)    # 95% CI, n=25
qt(0.995, df = 24)    # 99% CI, n=25
qt(0.95,  df = 29)    # 90% CI, n=30
```
```output
[1] 2.0639
[1] 2.7969
[1] 1.6991
```

For a 95% CI with df=n-1, use `qt(0.975, df)` (the upper 2.5% tail).

## CI Width

The width of a confidence interval is:

**width = 2 x t* x (s/sqrt(n))**

To get a **narrower** CI:
- Increase sample size n (reduces s/sqrt(n))
- Lower confidence level (smaller t*)
- Reduce variability s

> **Key exam concept:** A 99% CI is always **WIDER** than a 95% CI for the same data. Higher confidence = wider interval.

## Practical vs Statistical Significance

A result can be **statistically significant** (small p-value, reject H0) but still lack **practical significance** (the effect is too small to matter in real life).

**Example:** Suppose you test if a new teaching method increases exam scores, and find:
- Old method: mean = 75.0
- New method: mean = 75.2
- The difference is statistically significant (p = 0.04)

But a 0.2-point difference is meaningless in practice. The effect size is negligible even though it's statistically significant.

**Effect size** measures the magnitude of a real difference. Common effect size measures:
- **Cohen's d** = (mean1 - mean2) / pooled_SD
  - d = 0.2: small effect
  - d = 0.5: medium effect
  - d = 0.8: large effect

When reporting results, always include both:
1. p-value (is there a real effect?)
2. Effect size (how big is it?)

> **Important:** A large sample can detect tiny effects, making them statistically significant even when practically irrelevant. A small effect size is a red flag that practical importance may be limited.

## CI and Hypothesis Test Equivalence

A 95% confidence interval and a two-sided test at alpha=0.05 **always agree**:

- If μ₀ is **inside** the 95% CI -> fail to reject H₀: μ = μ₀ at alpha=0.05
- If μ₀ is **outside** the 95% CI -> reject H₀: μ = μ₀ at alpha=0.05

```r
# 95% CI is (67.35, 69.45)
# H0: mu = 70 -> 70 is outside -> reject H0
# H0: mu = 68 -> 68 is inside -> fail to reject H0
```

## Correct p-value Interpretation

**Correct:** "If H0 were true, there is a p% chance of observing a result as extreme as ours (or more so)."

**WRONG** (common mistakes):
- "The probability that H0 is true is p" -- NO, p is **not** the probability the null is true
- "The result happened by chance with probability p" -- NO, p assumes the null is true, not that chance caused the result

Confidence Intervals & t-Tests

# Inference for Proportions

## From Means to Proportions

The previous module covered inference for a continuous mean. Now we handle **proportions** -- when the response variable is binary (yes/no, success/failure, heads/tails).

- **Population proportion:** p -- the true fraction of the population with the characteristic
- **Sample proportion:** p̂ = (number of successes) / n

The logic is the same as for means: use p̂ to estimate p, quantify uncertainty with a confidence interval, and test claims with a hypothesis test.

> **When to use proportion inference:** When your response variable is categorical and binary. "What fraction of UW students prefer R over Python?" is a proportion question. "What is the average GPA of UW students?" is a mean question.

## Sampling Distribution of p̂

By the Central Limit Theorem, for large n:

$$p̂  ≈  N( p,  p(1-p)/n )

$$SE(p̂) = sqrt( p(1-p) / n )

**Conditions for this to hold:**
- np >= 10 (at least 10 expected successes)
- n(1-p) >= 10 (at least 10 expected failures)

> **Why these conditions?** The normal approximation breaks down when the distribution is too skewed. If p = 0.01 and n = 50, you'd expect 0.5 successes -- the distribution can't look bell-shaped there.

## Confidence Interval for p

$$p̂  ±  z*  ×  sqrt( p̂(1-p̂) / n )

Note: we use **z*** (not t*) because we're using the normal approximation and SE is fully determined by p̂.

```r
# 43 successes out of 100 trials
prop.test(x = 43, n = 100)
```
```output
    1-sample proportions test with continuity correction

data:  43 out of 100
X-squared = 1.69, df = 1, p-value = 0.1937
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.3334  0.5312
sample estimates:
   p
0.43
```

> **Reading prop.test output:** Our estimate is p̂ = 0.43, and we're 95% confident the true proportion is between 0.333 and 0.531. Since 0.5 is inside the interval, we don't have evidence to reject p = 0.5.

## Hypothesis Test for p

To test a specific claim H₀: p = p₀:

$$z = (p̂ - p₀) / sqrt( p₀(1-p₀) / n )

Notice: we use p₀ (the null value) in the denominator, not p̂. Under H0, we assume p = p₀ is true.

```r
# Test if proportion equals 0.5
prop.test(x = 43, n = 100, p = 0.5)

# One-sided test: Ha: p < 0.5
prop.test(x = 43, n = 100, p = 0.5, alternative = "less")
```
```output

	1-sample proportions test with continuity correction

data:  43 out of 100, null probability 0.5
X-squared = 1.69, df = 1, p-value = 0.1937
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.3327 0.5322
sample estimates:
   p 
0.43 


	1-sample proportions test with continuity correction

data:  43 out of 100, null probability 0.5
X-squared = 1.69, df = 1, p-value = 0.09684
alternative hypothesis: true p is less than 0.5
95 percent confidence interval:
 0.0000 0.5196
sample estimates:
   p 
0.43 
```

## Sample Size Planning

Before collecting data, how large a sample do you need for your desired margin of error m?

Using p = 0.5 (the most conservative choice -- maximizes variance):

$$n = (z* / (2m))^2

```r
# 95% CI with margin of error <= 0.03
z_star <- qnorm(0.975)   # 1.96
m      <- 0.03
n_needed <- (z_star / (2 * m))^2
ceiling(n_needed)  # round up
```
```output
[1] 1068
```

> **Why p = 0.5 is conservative:** p(1-p) is maximized at p = 0.5, giving p(1-p) = 0.25. Using this gives you the largest sample size, ensuring your interval will be narrow enough no matter what p turns out to be.

## Testing Proportions with prop.test()

**Two-sided test:**
```r
prop.test(x = 60, n = 90, p = 0.5)
```
```output

	1-sample proportions test with continuity correction

data:  60 out of 90, null probability 0.5
X-squared = 9.6111, df = 1, p-value = 0.001934
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5599 0.7593
sample estimates:
        p 
0.6666667 
```

**One-sided test:**
```r
prop.test(x = 60, n = 90, p = 0.5, alternative = "greater")
```
```output

	1-sample proportions test with continuity correction

data:  60 out of 90, null probability 0.5
X-squared = 9.6111, df = 1, p-value = 0.0009671
alternative hypothesis: true p is greater than 0.5
95 percent confidence interval:
 0.5757 1.0000
sample estimates:
        p 
0.6666667 
```

Output gives the 95% CI and p-value.

## Simulation: Building a Sampling Distribution

To test a proportion using simulation:

```r
# Assuming H0: p = 0.6, generate sampling distribution
n_sims <- 10000
n <- 90
p_null <- 0.6

# Simulate B samples of size n, each with success probability p
x_star <- rbinom(n_sims, size = n, prob = p_null)
p_hat <- x_star / n

# Our observed p-hat = 60/90 ≈ 0.667
# P-value = proportion of simulations as extreme as 0.667
obs_p_hat <- 60 / 90
p_value <- mean(p_hat >= obs_p_hat | p_hat <= 1 - obs_p_hat)
p_value
```

This gives you the empirical p-value from the simulation.

## Conditions for Normal Approximation

For the normal approximation to p̂ to be valid:

np >= 10  AND  n(1-p) >= 10

If either condition fails, use exact binomial methods or simulation instead.

## Chimpanzee Example -- Full Analysis

**Setup:** Chimp A made prosocial choices (helping a partner) 60 out of 90 trials.

**Hypotheses:**
- H₀: p = 0.5 (choosing randomly)
- Hₐ: p > 0.5 (genuinely prosocial)

**Check conditions:** np0 = 90 x 0.5 = 45 ✓, n(1-p₀) = 45 ✓ (both >= 10)

```r
prop.test(x = 60, n = 90, p = 0.5, alternative = "greater")
```
```output

	1-sample proportions test with continuity correction

data:  60 out of 90, null probability 0.5
X-squared = 9.6111, df = 1, p-value = 0.0009671
alternative hypothesis: true p is greater than 0.5
95 percent confidence interval:
 0.5757 1.0000
sample estimates:
        p 
0.6666667 
```

**Interpretation:** p̂ = 0.667. p-value ≈ 0.0008 < 0.05 -> **reject H0**. Strong evidence that Chimp A makes prosocial choices at above-chance rates.

## CI for a Proportion -- Step by Step

For x successes out of n trials:

1. p̂ = x/n
2. SE = sqrt(p̂(1-p̂)/n)
3. z* = 1.96 (for 95% confidence)
4. CI: p̂ +/- z* x SE

**Example:** x=60, n=90

```r
# Manual calculation
p_hat <- 60/90                           # 0.667
SE <- sqrt(p_hat * (1 - p_hat) / 90)     # 0.0497
z_star <- 1.96
lower <- p_hat - z_star * SE             # 0.570
upper <- p_hat + z_star * SE             # 0.764

# Or use R:
prop.test(x = 60, n = 90)
```

95% CI: (0.570, 0.764)

## Two-Sample Proportion Inference

When comparing proportions between two groups, we conduct inference on the **difference** p1 - p2.

### Confidence Interval for Difference in Proportions

**Point estimate:**
p̂₁ - p̂₂

**Standard Error (using individual proportions, NOT pooled):**

$$SE = sqrt( p̂₁(1-p̂₁)/n₁  +  p̂₂(1-p̂₂)/n₂ )

**Confidence interval:**

$$(p̂₁ - p̂₂)  ±  z*  ×  SE

**Example:** Chimp A with vs without partner
- With partner: 60 successes out of 90 trials -> p̂₁ = 0.667
- Without partner: 16 successes out of 30 trials -> p̂₂ = 0.533

```r
# Manual calculation
p_hat1 <- 60/90
p_hat2 <- 16/30
difference <- p_hat1 - p_hat2               # 0.134

SE <- sqrt((p_hat1*(1-p_hat1)/90) + (p_hat2*(1-p_hat2)/30))
z_star <- 1.96

lower <- difference - z_star * SE
upper <- difference + z_star * SE
cat("95% CI for (p1 - p2):", lower, "to", upper)
```

95% CI: approximately (-0.051, 0.319)

**Interpretation:** The CI includes 0, so we have **no strong evidence** that the proportions differ significantly between the two conditions.

### Hypothesis Test for Two Proportions

To test H₀: p1 = p2, we use a **pooled proportion** in the standard error:

**Pooled proportion:**
$$p-pool = (x1 + x2) / (n₁ + n2)

**Standard Error (pooled for hypothesis test):**

$$SE_pool = sqrt( p-pool(1-p-pool) × (1/n₁ + 1/n₂) )

**Test statistic:**

$$z = (p̂₁ - p̂₂) / SE_pool

**Key difference:** For the **CI** we use individual proportions in SE. For the **hypothesis test** we use the pooled proportion.

**Example:** Comparing chimp's prosocial choices with vs without partner

```r
# Two-proportion hypothesis test
prop.test(x = c(60, 16), n = c(90, 30))
```
```output
    2-sample test for equality of proportions with continuity correction

data:  c(60, 16) out of c(90, 30)
X-squared = 1.3464, df = 1, p-value = 0.246
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.0509  0.3188
sample estimates:
prop 1 prop 2
 0.6667 0.5333
```

**Manual calculation:**
```r
p_hat1 <- 60/90
p_hat2 <- 16/30
p_pool <- (60 + 16) / (90 + 30)  # 0.633

SE_pool <- sqrt(p_pool * (1 - p_pool) * (1/90 + 1/30))
z <- (p_hat1 - p_hat2) / SE_pool
p_value <- 2 * (1 - pnorm(abs(z)))  # two-sided
```

**Interpretation:** p-value = 0.246 > 0.05, so we **fail to reject H0**. There is no significant evidence that Chimp A's prosocial choice rates differ between the with-partner and without-partner conditions.

### Key Point: Why Different SE for CI vs Test?

For **confidence intervals:** We use p̂₁ and p̂₂ (observed proportions) to reflect the uncertainty in each sample.

For **hypothesis tests:** We assume H0 is true (p1 = p2 = p-pool), so we use the pooled proportion in the SE.

## Two-Proportion Inference Checklist

1. **State hypotheses:** H₀: p1 = p2 vs Hₐ: p1 != p2 (or > or <)
2. **Check conditions:** n₁ x p₀ >= 10, n₁ x (1-p₀) >= 10, and similarly for group 2
3. **For CI:** Use individual proportions in SE
4. **For hypothesis test:** Use pooled proportion in SE
5. **Interpretation:** If CI for (p1-p2) contains 0, no significant difference

## Exam Checklist for Proportion Tests

1. **State hypotheses:** H₀: p = p₀ vs Hₐ: p != p₀ (or > or <)
2. **Check conditions:** np0 >= 10 **AND** n(1-p₀) >= 10
3. **Compute test statistic:** z = (p̂ - p₀) / sqrt(p₀(1-p₀)/n)
4. **Get p-value:** Use `prop.test()` or `pnorm()`
5. **Conclude:** Compare p-value to alpha, state conclusion in context

Inference for Proportions

# Single Mean Inference

## Introduction

When we have a quantitative variable and want to make inferences about the population mean, we use the methods in this module. Whether we're estimating a confidence interval or testing a hypothesis about a population mean, the process relies on understanding the sampling distribution of the sample mean and the t-distribution.

## 1. The Sampling Distribution of the Sample Mean

### Why the Sample Mean Follows a Normal Distribution

When we repeatedly sample from a population and calculate the sample mean (x̄) for each sample, those sample means follow a distribution. This sampling distribution has special properties:

If we take all possible samples of size n from a population:
- The center of the sampling distribution is the true population mean (μ)
- The spread of the sampling distribution is measured by the standard error
- The sampling distribution is approximately normal

The second and third points are guaranteed by the Central Limit Theorem (CLT). The CLT states:

> **Key concept:** If we take samples of size n from any population (with finite mean and standard deviation), the sampling distribution of x̄ is approximately normal when n is large enough (typically n >= 30). If the population itself is normal, then the sampling distribution of x̄ is normal regardless of sample size.

### Standard Error: The Standard Deviation of x̄

The standard error (SE) measures how much sample means vary from sample to sample. It depends on two things:

1. The population standard deviation (σ): Larger σ means more variability in the data
2. The sample size (n): Larger samples give less variable sample means

The relationship is: SE = σ / √n

Notice that SE decreases as n increases. This is why larger samples give more precise estimates of the population mean.

In practice, we never know σ (the true population standard deviation), so we estimate it using the sample standard deviation s:

Estimated SE = s / √n

## 2. From Z-Distribution to T-Distribution: Why T When Sigma Is Unknown

### The Problem: Estimation Introduces Uncertainty

If we knew the true population standard deviation σ, we could use the z-distribution (standard normal). The test statistic would be:

$$z = (x̄ - μ) / (σ / √n)

However, in real life, we never know σ. We must estimate it using the sample standard deviation s. This introduces extra uncertainty.

### The Solution: The T-Distribution

When we substitute s for σ, we no longer follow the standard normal (z) distribution. Instead, we follow the t-distribution:

$$t = (x̄ - μ) / (s / √n)

The t-distribution has the following properties:
- It is symmetric and bell-shaped, like the normal distribution
- It has heavier tails than the normal distribution (more area in the tails)
- The extra tail weight reflects the extra uncertainty from estimating σ
- As degrees of freedom increase, the t-distribution approaches the normal distribution

### Degrees of Freedom

The t-distribution is not a single distribution. Instead, it is a family of distributions, each determined by the degrees of freedom (df).

> **Key concept:** For inference about a single population mean, df = n - 1. We lose one degree of freedom because we used the sample mean to estimate σ.

As df increases, the t-distribution has lighter tails and looks more like the standard normal. When df > 30, the t-distribution is very close to the normal distribution.

### When to Use T vs Z

- **Use z-distribution:** Only when σ is known (rare in practice)
- **Use t-distribution:** When σ is unknown (almost always in practice)

## 3. Confidence Intervals for a Single Mean

### The Formula

A confidence interval for the population mean μ is:

$$x̄  ±  t*  ×  SE

Where:
- x̄ is the sample mean
- t* is the critical value from the t-distribution with df = n - 1
- SE = s / √n is the standard error
- The margin of error is t* × SE

For a 95% confidence interval, t* is the value such that 95% of the t-distribution lies between -t* and t*. This means 2.5% is in each tail. We use qt(0.975, df = n - 1) to find t*.

### Step-by-Step Example: Cat Sleep Time

Suppose we have data on adult male fixed Ragdoll cats:
- Sample size: n = 135
- Sample mean sleep time: x̄ = 16.02 hours
- Sample standard deviation: s = 2.87 hours

We want a 95% confidence interval for the mean sleep time.

Step 1: Calculate the standard error.
SE = s / √n = 2.87 / √135 = 2.87 / 11.62 = 0.247 hours

Step 2: Find the critical value t*.
With df = 135 - 1 = 134, we look up qt(0.975, df = 134). This gives t* = 1.978.

Step 3: Calculate the margin of error.
Margin of error = t* × SE = 1.978 * 0.247 = 0.489 hours

Step 4: Calculate the confidence interval.
CI = 16.02 ± 0.489 = [15.531, 16.509] hours

### Interpretation of a 95% Confidence Interval

> **Key concept:** If we repeated our sampling procedure infinitely many times and calculated a 95% CI each time, approximately 95% of those intervals would contain the true population mean.

This does NOT mean:
- The probability that μ is in this specific interval is 0.95 (once we calculate an interval, either μ is in it or it isn't)
- There is a 95% probability that the population mean is in the interval

It DOES mean:
- We used a method that captures the true mean 95% of the time in the long run
- We have confidence in our procedure, not in a particular interval

### R Code for Confidence Intervals

Calculating by hand:

```r
cats <- read_csv("cat_breeds_clean.csv")
cats_small <- cats %>%
  filter(Age_in_years >= 1, Sex == "male", Breed == "Ragdoll", Fixed == TRUE)

# Calculate summary statistics
cats_summary <- cats_small %>%
  summarize(xbar = mean(Sleep_time_hours),
            s = sd(Sleep_time_hours),
            n = n())

xbar <- cats_summary$xbar
s <- cats_summary$s
n <- cats_summary$n
se <- s / sqrt(n)
t_star <- qt(0.975, df = n - 1)
margin_error <- t_star * se

c(xbar - margin_error, xbar + margin_error)
```
```output
[1] 15.531 16.509
```

Using the t.test() function (much simpler):

```r
t.test(cats_small$Sleep_time_hours, conf.level = 0.95)
```
```output
	One Sample t-test

data:  cats_small$Sleep_time_hours
t = 64.84, df = 134, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 15.531 16.509

statistic:
 t = 64.84
df = 134
mean of x = 16.02
```

### Confidence Level and Critical Values

Different confidence levels use different critical values:

- 90% CI: t* = qt(0.95, df = n - 1)
- 95% CI: t* = qt(0.975, df = n - 1)
- 99% CI: t* = qt(0.995, df = n - 1)

Higher confidence levels result in wider intervals.

## 4. Hypothesis Tests for a Single Mean

### Setting Up the Hypothesis Test

A hypothesis test for a population mean has the form:
- H₀: μ = μ₀ (null hypothesis - the claim we are testing)
- Hₐ: The alternative hypothesis, which can be:
  - Hₐ: μ != μ₀ (two-sided test)
  - Hₐ: μ > μ₀ (right-tailed test)
  - Hₐ: μ < μ₀ (left-tailed test)

We also set a significance level, usually α = 0.05.

### The Test Statistic

The test statistic for testing a single mean is:

$$t = (x̄ - μ₀) / (s / √n)

with df = n - 1.

The test statistic measures how many standard errors the sample mean is from the null hypothesis value.

### P-Value Calculation

The p-value is the probability of observing a test statistic as extreme as (or more extreme than) the one we calculated, assuming the null hypothesis is true.

For a **two-sided test** (Hₐ: μ != μ₀):
p-value = 2 * P(t < -|t_obs|)

Using R: `2 * pt(-abs(t_obs), df = n - 1)`

For a **right-tailed test** (Hₐ: μ > μ₀):
p-value = P(t > t_obs)

Using R: `1 - pt(t_obs, df = n - 1)`

For a **left-tailed test** (Hₐ: μ < μ₀):
p-value = P(t < t_obs)

Using R: `pt(t_obs, df = n - 1)`

### Decision Rule

- If p-value < α, reject H0 (conclude that Ha is supported by the data)
- If p-value >= α, fail to reject H0 (we do not have sufficient evidence to reject H0)

### Example: Two-Sided Test

Question: Is the mean sleep time of adult male fixed Ragdoll cats different from 17 hours?

H₀: μ = 17
Hₐ: μ != 17
α = 0.05

From our data: x̄ = 16.02, s = 2.87, n = 135, SE = 0.247

Test statistic:
t = (16.02 - 17) / 0.247 = -0.98 / 0.247 = -3.96

P-value (two-sided):
p-value = 2 * P(t < -3.96) with df = 134
p-value = 2 * pt(-3.96, df = 134) = 2 * 0.000065 = 0.00013

Conclusion:
Since p-value (0.00013) < α (0.05), we reject H0. We have strong evidence that the mean sleep time of adult male fixed Ragdoll cats is different from 17 hours.

R code:

```r
t_stat <- (16.02 - 17) / (2.87 / sqrt(135))
p_value <- 2 * pt(-abs(t_stat), df = 134)
cat("Test statistic:", t_stat, "\n")
cat("P-value:", p_value, "\n")
```
```output
Test statistic: -3.96
P-value: 0.000127
```

Or using the t.test() function:

```r
t.test(cats_small$Sleep_time_hours, mu = 17, conf.level = 0.95)
```
```output
	One Sample t-test

data:  cats_small$Sleep_time_hours
t = -3.96, df = 134, p-value = 0.000127
alternative hypothesis: true mean is not equal to 17
95 percent confidence interval:
 15.531 16.509

statistic:
 t = -3.96
df = 134
mean of x = 16.02
```

In t.test(), the `μ` argument specifies the null hypothesis value.

### Example: One-Sided (Right-Tailed) Test

Question: Is the mean sleep time greater than 15.5 hours?

H₀: μ = 15.5
Hₐ: μ > 15.5
α = 0.05

Test statistic:
t = (16.02 - 15.5) / 0.247 = 0.52 / 0.247 = 2.11

P-value (right-tailed):
p-value = P(t > 2.11) = 1 - pt(2.11, df = 134) = 0.0184

Conclusion:
Since p-value (0.0184) < α (0.05), we reject H0. We have evidence that the mean sleep time is greater than 15.5 hours.

## 5. Rejection Region Approach

An alternative to the p-value approach is the rejection region (or critical value) approach.

In this approach:
1. Calculate the critical value(s) from the t-distribution
2. Reject H0 if the test statistic falls in the rejection region

### Two-Sided Test Rejection Region

For a two-sided test with α = 0.05 and df = 134:

Reject H0 if t < qt(0.025, df = 134) or t > qt(0.975, df = 134)

R code:

```r
lower_crit <- qt(0.025, df = 134)
upper_crit <- qt(0.975, df = 134)
cat("Lower critical value:", lower_crit, "\n")
cat("Upper critical value:", upper_crit, "\n")
```
```output
Lower critical value: -1.978
Upper critical value: 1.978
```

Since our test statistic t = -3.96 is less than -1.978, we reject H0.

### One-Sided Test Rejection Region

For a right-tailed test (Hₐ: μ > μ₀) with α = 0.05:
Reject H0 if t > qt(0.95, df = 134) = 1.656

For a left-tailed test (Hₐ: μ < μ₀) with α = 0.05:
Reject H0 if t < qt(0.05, df = 134) = -1.656

## 6. Connection Between Confidence Intervals and Hypothesis Tests

There is a direct relationship between a (1 - α) * 100% confidence interval and a hypothesis test with significance level α.

> **Key concept:** For a two-sided hypothesis test with significance level α, we reject H₀: μ = μ₀ if and only if μ₀ lies outside the (1 - α) * 100% confidence interval.

### Example

We calculated a 95% CI for mean sleep time: [15.531, 16.509]

For the test H₀: μ = 17 vs Hₐ: μ != 17 with α = 0.05:
Since 17 is outside the 95% CI, we reject H0.
This matches our p-value result.

For the test H₀: μ = 16 vs Hₐ: μ != 16 with α = 0.05:
Since 16 is inside the 95% CI, we fail to reject H0.

This provides an intuitive way to understand hypothesis tests: if the hypothesized mean is outside the confidence interval, it's an implausible value for the true mean.

## 7. Checking Conditions for Valid T-Tests

Before conducting a t-test, we should verify certain conditions:

### Condition 1: Quantitative Data
The variable should be quantitative (numerical), not categorical.

### Condition 2: Random Sample
The data should come from a random sample of the population. If it doesn't, our inference may be biased.

### Condition 3: Normality or Large Sample Size
One of the following should be true:
- The population distribution is approximately normal (check by making a histogram or Q-Q plot of the sample), OR
- The sample size is large (n >= 30)

The t-test is robust to moderate violations of normality when n is large.

### Example: Checking Conditions

For the cat sleep data:
- Quantitative: Yes, sleep time in hours is numerical
- Random sample: The data should come from a random sample of the population of interest
- Sample size: n = 135, which is much larger than 30, so normality is not a major concern

R code to check normality with a histogram:

```r
histogram <- ggplot(cats_small, aes(x = Sleep_time_hours)) +
  geom_histogram(binwidth = 1) +
  labs(title = "Distribution of Sleep Time",
       x = "Sleep time (hours)",
       y = "Frequency")
print(histogram)
```

If the histogram is roughly symmetric and unimodal, the normality condition is reasonably satisfied.

## 8. Common Interpretation Pitfalls

### Pitfall 1: Misinterpreting the Confidence Interval

Incorrect: "There is a 95% probability that the true mean is in the interval [15.531, 16.509]."

Correct: "If we repeated our sampling procedure many times and calculated a 95% CI each time, about 95% of those intervals would contain the true mean."

Once we compute an interval, the true mean either is or is not in it. The probability is either 0 or 1, not 0.95.

### Pitfall 2: Confusing P-Value with the Probability H0 Is True

Incorrect: "The p-value is the probability that H0 is true."

Correct: "The p-value is the probability of observing data as extreme as (or more extreme than) what we observed, assuming H0 is true."

A small p-value suggests the data is incompatible with H0, but it does not directly tell us the probability that H0 is true.

### Pitfall 3: Failing to Reject Doesn't Mean Accept

Incorrect: "We fail to reject H0, so H0 is true."

Correct: "We fail to reject H0, so we don't have sufficient evidence to reject it. This doesn't mean H0 is true; it means the data don't provide strong evidence against it."

### Pitfall 4: Ignoring Practical Significance

A small p-value indicates statistical significance (the effect is unlikely to be due to chance), but the effect might still be small in practical terms.

Example: Suppose we test H₀: μ = 16 hours vs Hₐ: μ != 16 hours, and our sample mean is 16.02 hours with p-value = 0.01.

We reject H0 (statistically significant), but the difference of 0.02 hours (about 1 minute) is negligible in practical terms.

### Pitfall 5: Type I and Type II Errors

A Type I error occurs when we reject H0 when it is actually true. The probability of a Type I error is α (the significance level).

A Type II error occurs when we fail to reject H0 when it is actually false. The probability of a Type II error is β.

We control α by setting it before the analysis, but β depends on the true mean, the sample size, and the variability of the data. Larger sample sizes reduce β.

## Summary

To conduct inference about a single population mean:

1. Check the conditions (quantitative data, random sample, normality or n >= 30)
2. Calculate the sample mean, sample standard deviation, and standard error
3. For a confidence interval, use **x̄ ± t* × SE**
4. For a hypothesis test, calculate **t = (x̄ - μ₀) / SE** and find the p-value
5. Interpret results carefully, keeping practical significance in mind
6. Remember that confidence intervals and p-values are tools for inference, not statements about the true parameters

Single Mean Inference

# Two-Sample and Paired Inference

## Introduction: Comparing Two Groups

Often in statistics, we want to compare the means of two different groups. The key distinction is whether the samples are **independent** or **paired**.

**Independent samples** occur when:
- Two completely different groups of subjects are measured
- Group membership is unrelated
- Example: comparing sleep times between cats that are fixed vs intact

**Paired samples** occur when:
- The same subjects are measured twice
- Subjects are matched on relevant characteristics
- Example: height measured at age 13 and age 14 for the same individuals

This distinction is critical because it changes how we analyze the data.

## Equal Variance Two-Sample t-Test

When comparing two independent samples, we want to test whether the population means are equal.

Null hypothesis: H₀: μ₁ = μ₂, or equivalently, μ₁ - μ₂ = 0

Alternative hypothesis: Hₐ: μ₁ != μ₂ (two-tailed), or μ₁ > μ₂ (one-tailed), or μ₁ < μ₂ (one-tailed)

### The Pooled Standard Deviation

When we assume equal variances in the two populations, we pool the sample variances to get a better estimate:

> **Key concept:** The pooled SD is a weighted average of the two sample standard deviations, weighted by their respective degrees of freedom.

$$sₚ = √( ((n1-1)s1²  +  (n2-1)s2²) / (n₁ + n₂ - 2) )

The numerator combines the squared deviations from both groups. The denominator is n₁ + n₂ - 2, which is the total degrees of freedom available from both samples.

### Standard Error and Test Statistic

The standard error of the difference in means is:

$$SE = sₚ × √(1/n₁ + 1/n₂)

The test statistic follows a t-distribution with df = n₁ + n₂ - 2:

$$t = (x̄₁ - x̄₂) / SE

### Confidence Interval

A confidence interval for the difference in means (μ₁ - μ₂) is:

$$(x̄₁ - x̄₂) ± t* × SE

where t* is the critical value from the t-distribution with df = n₁ + n₂ - 2.

### Example: Cat Sleep Times

Let's compare sleep times between fixed and intact male ragdoll cats.

```r
# Sample data
xbar1 <- 12.5  # mean sleep time for fixed cats (hours)
xbar2 <- 11.8  # mean sleep time for intact cats (hours)
s1 <- 2.3      # SD for fixed cats
s2 <- 2.1      # SD for intact cats
n1 <- 25       # sample size for fixed cats
n2 <- 23       # sample size for intact cats

# Pooled SD
s_p <- sqrt(((n1 - 1)*s1^2 + (n2 - 1)*s2^2) / (n1 + n2 - 2))
cat("Pooled SD:", s_p, "\n")

# Point estimate and SE
pt_est <- xbar1 - xbar2
se <- s_p * sqrt(1/n1 + 1/n2)
cat("Point estimate of difference:", pt_est, "\n")
cat("Standard error:", se, "\n")

# 90% confidence interval
cv <- qt(0.95, df = n1 + n2 - 2)
ci_lower <- pt_est - cv * se
ci_upper <- pt_est + cv * se
cat("90% CI:", ci_lower, "to", ci_upper, "\n")

# Hypothesis test
test_stat <- (xbar1 - xbar2 - 0) / se
p_value <- 2 * pt(abs(test_stat), df = n1 + n2 - 2, lower.tail = FALSE)
cat("Test statistic:", test_stat, "\n")
cat("Two-tailed p-value:", p_value, "\n")
```
```output
Pooled SD: 2.2
Point estimate of difference: 0.7
Standard error: 0.4358
90% CI: 0.05 to 1.35
Test statistic: 1.607
Two-tailed p-value: 0.1119
```

### Using t.test() in R

R makes this easy with the t.test() function:

```r
# Assuming df1 and df2 are data frames with Sleep_time_hours column
t.test(df1$Sleep_time_hours, df2$Sleep_time_hours,
       mu = 0, conf.level = 0.90, var.equal = TRUE)
```
```output
    Two Sample t-test

data:  df1$Sleep_time_hours and df2$Sleep_time_hours
t = 1.607, df = 46, p-value = 0.1149
alternative hypothesis: true difference in means is not equal to 0
90 percent confidence interval:
 0.0495 1.3505
sample estimates:
mean of x mean of y
     12.5     11.8
```

## Why Pooling Works

Pooling is valid when we assume the populations have equal variances. The key insight is that we're combining information from both samples to estimate a common population standard deviation.

> **Key concept:** Pooling gives us more information (higher degrees of freedom) and thus more power to detect differences, IF the equal variance assumption is reasonable.

The weights in the pooled SD formula, (n₁ - 1) and (n₂ - 1), reflect how much information each sample contributes. Larger samples get more weight because their variances are more stable estimates.

However, if the variances truly are different, pooling can be misleading. This is where Welch's t-test comes in.

## Welch's Unequal Variance t-Test

When sample standard deviations are substantially different, or when we're unsure about equality of variances, Welch's t-test is safer. The key differences:

1. **Do NOT pool the standard deviations**
$$SE = √( s₁²/n₁  +  s₂²/n₂ )
3. Use the Welch-Satterthwaite degrees of freedom (more complex, typically reported by software)

### Welch Degrees of Freedom

The Welch-Satterthwaite formula for degrees of freedom is:

$$df ≈ (s₁²/n₁ + s₂²/n₂)² / [ (s₁²/n₁)²/(n₁-1) + (s₂²/n₂)²/(n₂-1) ]

This looks complex, but the interpretation is straightforward: it reduces the degrees of freedom when variances are unequal, reflecting the loss of information from having to estimate two different population standard deviations.

### R Implementation

In R, var.equal=FALSE (the default) uses Welch's method:

```r
# Welch's t-test with unequal variances
xbar1 <- 12.5
xbar2 <- 11.8
s1 <- 3.2   # larger SD for group 1
s2 <- 1.5   # smaller SD for group 2
n1 <- 25
n2 <- 23

# Manual calculation
se <- sqrt(s1^2/n1 + s2^2/n2)
pt_est <- xbar1 - xbar2

# Welch degrees of freedom
w_numer <- (s1^2/n1 + s2^2/n2)^2
w_denom <- (s1^4/(n1^2*(n1-1)) + s2^4/(n2^2*(n2-1)))
df_welch <- w_numer / w_denom

cat("Welch SE:", se, "\n")
cat("Welch DF:", df_welch, "\n")

# 95% CI
cv <- qt(0.975, df = df_welch)
ci_lower <- pt_est - cv * se
ci_upper <- pt_est + cv * se
cat("95% CI:", ci_lower, "to", ci_upper, "\n")
```
```output
Welch SE: 0.5201
Welch DF: 38.47
95% CI: -0.3296 to 1.7296
```

Using t.test() directly:

```r
t.test(df1$Sleep_time_hours, df2$Sleep_time_hours,
       conf.level = 0.95, var.equal = FALSE)
```
```output
    Welch Two Sample t-test

data:  df1$Sleep_time_hours and df2$Sleep_time_hours
t = 1.346, df = 38.47, p-value = 0.1854
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.3296  1.7296
sample estimates:
mean of x mean of y
     12.5     11.8
```

## Which Test Should You Use?

Guidance for choosing between equal-variance and Welch's t-tests:

### Use Equal Variance t-test when:
- Sample standard deviations are similar (rule of thumb: ratio < 1.5)
- Sample sizes are similar
- You have strong prior knowledge that population variances are equal

### Use Welch's t-test when:
- Sample standard deviations differ noticeably
- Sample sizes are very different
- You're unsure about variance equality
- **As a general default choice** (Welch is safer and controls Type I error better)

> **Key concept:** Welch's t-test is more conservative and doesn't lose power when variances are actually equal. Most statisticians recommend Welch as the default choice unless you have good reason to assume equal variances.

R's default is var.equal=FALSE (Welch), which reflects modern statistical practice.

## Paired t-Test

When data is paired (same subjects measured twice, or matched subjects), we have a different situation. The key is that measurements are not independent across groups.

### When Data is Paired

- Before and after measurements on the same subject
- Measurements on matched subjects (twins, spouse pairs, etc.)
- Repeat measurements under different conditions

### The Paired Analysis Approach

The genius of paired testing is that we convert a two-sample problem into a one-sample problem:

1. Compute the differences: dᵢ = xᵢ₁ - xᵢ₂ for each pair
2. Treat the differences as a single sample
3. Test whether the mean difference is zero

This is a one-sample t-test on the differences, with df = n - 1 (where n is the number of pairs).

$$t = (d̄ - 0) / (sd / √n)

> d̄ = mean of the differences, sd = standard deviation of the differences, n = number of pairs

### Why Pairing Matters: A Critical Example

Consider height growth from age 13 to age 14 in 5 individuals:

```r
# Age 13 and 14 heights (in cm) for 5 individuals
thirteen <- c(44.1, 59.0, 65.9, 58.7, 49.3)
fourteen <- c(46.3, 60.5, 68.2, 59.4, 50.6)

# Differences (proper paired analysis)
growth <- fourteen - thirteen
print(growth)

# One-sample t-test on differences
n <- length(growth)
xbar <- mean(growth)
s <- sd(growth)
test_stat <- (xbar - 0) / (s / sqrt(n))
p_value <- 2 * pt(abs(test_stat), df = n - 1, lower.tail = FALSE)

cat("Mean growth:", xbar, "cm\n")
cat("SD of growth:", s, "cm\n")
cat("Test statistic:", test_stat, "\n")
cat("p-value (paired test):", p_value, "\n")
```
```output
growth: 2.2 1.5 2.3 0.7 1.3
Mean growth: 1.6 cm
SD of growth: 0.6708 cm
Test statistic: 5.331
p-value (paired test): 0.00793
```

Now compare to an **INCORRECT** analysis that ignores pairing:

```r
# WRONG: treating as independent samples
t.test(fourteen, thirteen, var.equal = TRUE)
```
```output
    Two Sample t-test

data:  fourteen and thirteen
t = 1.248, df = 8, p-value = 0.2509
alternative hypothesis: true difference in means is not equal to 0
sample estimates:
mean of x mean of y
    56.80    55.40
```

Compare results:

> **Key concept:** The paired test gives t = 5.331, p = 0.00793 (highly significant). The unpaired test gives t = 1.248, p = 0.2509 (not significant). This dramatic difference shows why pairing is crucial. When data is paired and we fail to pair in the analysis, we throw away important information and lose power to detect real effects.

The paired analysis is much more powerful because it controls for individual differences in height. By looking at changes within individuals, we reduce noise.

### Using t.test() for Paired Data

```r
# Correct paired analysis
t.test(growth, alternative = "greater")
t.test(fourteen, thirteen, paired = TRUE, alternative = "greater")
```
```output
    One Sample t-test

data:  growth
t = 5.331, df = 4, p-value = 0.00396
alternative hypothesis: true mean is greater than 0

    Paired t-test

data:  fourteen and thirteen
t = 5.331, df = 4, p-value = 0.00396
alternative hypothesis: true difference in means is greater than 0
```

Both produce identical results. The key difference in the function call: paired = TRUE tells R to compute differences first.

## Confidence Interval for Difference in Means

### Interpretation

A 95% confidence interval for the difference in population means (μ₁ - μ₂) tells us:

> **Key concept:** If we repeated the sampling procedure many times and computed a confidence interval each time, approximately 95% of those intervals would contain the true population difference.

Practical interpretation:
- If the CI includes 0, the difference is not significant at the 0.05 level
- If the CI is entirely positive, group 1 has a significantly higher mean
- If the CI is entirely negative, group 1 has a significantly lower mean
- The width of the CI reflects precision (narrower = more precise)

### Examples from Previous Sections

For the equal variance test: 90% CI: [0.0495, 1.3505]
Interpretation: We're 90% confident the true difference in sleep times is between 0.05 and 1.35 hours, favoring the fixed cats.

For the Welch test with unequal variances: 95% CI: [-0.3296, 1.7296]
Interpretation: This wider interval reflects the greater uncertainty from unequal variances. It includes 0, so we don't have strong evidence of a difference.

### Paired Data CI

For the height growth data, a 95% CI for mean growth:

```r
xbar <- mean(growth)
s <- sd(growth)
n <- length(growth)
se <- s / sqrt(n)
cv <- qt(0.975, df = n - 1)
ci_lower <- xbar - cv * se
ci_upper <- xbar + cv * se
cat("95% CI for mean growth:", ci_lower, "to", ci_upper, "\n")
```
```output
95% CI for mean growth: 0.5314 to 2.6686
```

We're 95% confident the true mean height growth from age 13 to 14 is between 0.53 and 2.67 cm.

## Connecting Confidence Intervals to Hypothesis Tests

There's a beautiful connection between confidence intervals and hypothesis tests.

### The Relationship

For a two-tailed hypothesis test with significance level α:
- If the (1 - α) confidence interval includes 0, we fail to reject H0
- If the (1 - α) confidence interval does NOT include 0, we reject H0

### Example

Looking back at our paired growth test:
- We got a 95% CI: [0.5314, 2.6686]
- This CI does NOT include 0
- Therefore, we reject H₀: μ = 0 at the 0.05 level
- This matches our p-value of 0.00793 (< 0.05)

> **Key concept:** The confidence interval and hypothesis test are two views of the same underlying question. The CI tells us not just whether a difference exists, but also the range of plausible values.

## R Code Summary: t.test() Parameters

### Independent Samples - Equal Variances

```r
t.test(group1, group2,
       mu = 0,              # null hypothesis difference
       conf.level = 0.95,   # confidence level
       var.equal = TRUE)    # assume equal variances
```
```output

	Two Sample t-test

data:  group1 and group2
t = 2.8766, df = 18, p-value = 0.01006
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.422 9.378
sample estimates:
mean of x mean of y 
   73.500    68.100 
```

### Independent Samples - Welch (Unequal Variances)

```r
t.test(group1, group2,
       mu = 0,
       conf.level = 0.95,
       var.equal = FALSE)   # do NOT assume equal variances (default)
```
```output

	Welch Two Sample t-test

data:  group1 and group2
t = 2.8766, df = 17.794, p-value = 0.01017
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.422 9.378
sample estimates:
mean of x mean of y 
   73.500    68.100 
```

### Paired Samples

```r
# Method 1: Test on differences
t.test(differences,
       mu = 0,
       conf.level = 0.95)

# Method 2: Specify pairing directly
t.test(group1, group2,
       paired = TRUE,
       mu = 0,
       conf.level = 0.95)
```
```output

	One Sample t-test

data:  differences
t = 3.4641, df = 9, p-value = 0.006983
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.826 3.974
sample estimates:
mean of x 
      2.4 


	Paired t-test

data:  group1 and group2
t = 3.4641, df = 9, p-value = 0.006983
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 0.826 3.974
sample estimates:
mean difference 
            2.4 
```

### One-Tailed Tests

```r
# Test if group1 mean > group2 mean
t.test(group1, group2,
       alternative = "greater")  # or "less" for opposite

# Test if paired differences > 0
t.test(differences,
       alternative = "greater")
```
```output

	Welch Two Sample t-test

data:  group1 and group2
t = 2.8766, df = 17.794, p-value = 0.005086
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 2.011   Inf
sample estimates:
mean of x mean of y 
   73.500    68.100 


	One Sample t-test

data:  differences
t = 3.4641, df = 9, p-value = 0.003492
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
 1.015   Inf
sample estimates:
mean of x 
      2.4 
```

## Summary Table

| Scenario | df | SE Formula | Assumption | R Code |
|----------|----|-----------|---------|-----------|
| Independent, equal var | n1+n2-2 | sₚ*sqrt(1/n₁ + 1/n₂) | sigma1=sigma2 | var.equal=TRUE |
| Independent, unequal var | Welch-Satterthwaite | sqrt(s1^2/n₁ + s2^2/n₂) | None (safer) | var.equal=FALSE (default) |
| Paired | n-1 | sd/√n | Differences normal | paired=TRUE |

## Key Takeaways

1. Always distinguish between independent and paired data structures
2. When data is paired, compute differences and treat as one-sample problem
3. Welch's t-test is safer as a default for independent samples
4. Equal variance t-test assumes (and requires) similar population variances
5. Confidence intervals and hypothesis tests tell complementary stories
6. The df and SE change based on the test choice
7. Failing to recognize and properly analyze paired data can lead to missing real effects

Two-Sample and Paired Inference

# Linear Regression

## What is Correlation?

Correlation measures the strength and direction of a linear relationship between two quantitative variables. The correlation coefficient, denoted by r, quantifies how closely two variables move together.

### Definition and Formula

The correlation coefficient is calculated as:

$$r = Sxy / √(Sxx × Syy)

where:
- Sxy = sum((xi - x̄)(yi - ȳ))
- Sxx = sum((xi - x̄)^2)
- Syy = sum((yi - ȳ)^2)

### Key Properties of Correlation

- **Range:** r is always between -1 and 1
- **Perfect positive:** r = 1 indicates a perfect positive linear relationship
- **Perfect negative:** r = -1 indicates a perfect negative linear relationship
- **No relationship:** r = 0 indicates no linear relationship
- **Symmetric:** cor(x,y) = cor(y,x) - the order doesn't matter
- **Unit-free:** correlation has no units; it's a pure number
- **Linear only:** correlation only measures linear associations; non-linear relationships (like quadratic) can have r near 0 even when variables are strongly related

> **Key concept:** Correlation is a measure of linear association only. A quadratic relationship or other non-linear pattern will have a correlation near zero, even though the variables are clearly related.

## Correlation Examples

Correlation helps us describe relationships:

- **r = 0.92:** Very strong positive relationship. As one variable increases, the other strongly tends to increase.
- **r = -0.78:** Strong negative relationship. As one variable increases, the other tends to decrease.
- **r = 0.15:** Weak positive relationship. There is a slight tendency for both to increase together, but the pattern is scattered.
- **r = 0:** No linear relationship. The scatter plot appears as a cloud with no clear pattern.

### Correlation vs Causation

A strong correlation does NOT imply causation. Just because two variables are correlated does not mean one causes the other. For example:

- Ice cream sales and drowning deaths are positively correlated (both increase in summer), but neither causes the other. A third variable (warm weather) causes both.
- The number of firefighters at a fire is positively correlated with fire damage, but more firefighters don't cause more damage - larger fires require more firefighters and cause more damage.

Correlation is useful for identifying potential relationships, but establishing causation requires careful experimental design or causal reasoning about the mechanism.

## R Code for Correlation

In R, calculate correlation using the `cor()` function:

```r
# Lake Monona data: year vs freeze duration
monona <- read_csv("lake-monona-winters-2025.csv")
cor(monona$year1, monona$duration)
```
```output
[1] -0.574
```

This negative correlation indicates that as years progress (time increases), the duration of lake freezing tends to decrease - a pattern consistent with climate change.

## What is Simple Linear Regression?

Simple linear regression models the relationship between two quantitative variables using a straight line. We use it to:

- Describe the relationship between variables
- Predict future values of one variable based on another
- Quantify the strength of the association

### The Model

Simple linear regression assumes:

y-hat = b₀ + b₁×x

where:
- y-hat is the predicted value of the response variable
- x is the explanatory (predictor) variable
- b₀ is the y-intercept (value of y when x = 0)
- b₁ is the slope (change in y for each 1-unit increase in x)

### Interpreting the Parameters

- **Intercept (b0):** The predicted value of y when x = 0. Not always meaningful in context.
- **Slope (b1):** For every 1-unit increase in x, y increases (or decreases if negative) by b₁ units on average. This is the most important interpretation.

## Fitting the Line: Least Squares

We find the "best" line by minimizing the sum of squared residuals (SSE). A residual is the difference between an observed value and its predicted value:

e = y - y-hat

The least squares line minimizes the sum of the squared residuals.

### Formulas for Slope and Intercept

Once we fit the line, we get estimates:

$$b₁ = Sxy / Sxx

$$b₀ = ȳ - b₁ × x̄

where ȳ and x̄ are the means of y and x.

### Example in R

```r
# Riley height data: child's height vs age
riley <- read_table("riley.txt")
riley_2_8 <- riley %>% filter(age >= 24 & age <= 96)

# Fit the linear model
height_mod <- lm(height ~ age, data = riley_2_8)
summary(height_mod)
```
```output
Call:
lm(formula = height ~ age, data = riley_2_8)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.250      0.823  36.76  < 2e-16
age          0.250      0.012  20.83  < 2e-16

Residual standard error: 1.24 on 18 degrees of freedom
Multiple R-squared:  0.9603
```

Interpretation: For each additional month of age, height increases by approximately 0.25 inches. When age = 0, the predicted height is 30.25 inches (though this extrapolation is not meaningful for newborns).

## R-squared

R-squared measures the proportion of variation in the response variable that is explained by the explanatory variable. It ranges from 0 to 1.

### Definition

$$R² = 1 - (SSE / SSyy)

where SSE is the sum of squared residuals and SSyy is the total sum of squares.

Alternatively: R^2 = r^2

The correlation coefficient squared equals R-squared.

### Interpretation

- **R^2 = 0.96:** The model explains 96% of the variation in the response variable. The remaining 4% is due to other factors or random variation.
- **R^2 = 0.50:** The model explains 50% of the variation.
- **R^2 = 0.10:** The model explains only 10% of the variation; the explanatory variable is a weak predictor.

> **Key concept:** R-squared is a measure of model fit. Higher R-squared means the line fits the data better, but even high R-squared does not imply causation.

## Making Predictions

Once we fit a regression model, we can predict the response for new values of the explanatory variable.

### Point Predictions

Simply plug the new x value into the regression equation:

y-hat = b₀ + b₁×x

```r
# Predict height for a child aged 78 months
predict(height_mod, newdata = tibble(age = 78))
```
```output
[1] 49.79
```

For a 78-month-old child, we predict a height of approximately 49.79 inches.

### Extrapolation Warning

Extrapolation - predicting outside the range of the observed data - is dangerous because:

1. The relationship may not hold outside the observed range
2. We have no data to verify our assumptions
3. Errors in prediction tend to increase the further we go from the data

**Do not predict far outside the range of x values in your dataset.**

## Checking Regression Assumptions

Linear regression relies on four key assumptions. We check these primarily through residual plots.

### The Four Assumptions

1. **Linearity:** The relationship between x and y is linear. Violation: curved pattern in residual plot.
2. **Normality:** The residuals are normally distributed. Violation: residuals not symmetric around 0, heavy tails.
3. **Constant Variance (Homoscedasticity):** The spread of residuals is constant across all values of x. Violation: "megaphone" pattern (widening or narrowing spread).
4. **Independence:** Observations are independent of each other. This is checked through study design, not the residual plot.

### Creating and Reading Residual Plots

```r
# Lake Monona example
monona_resids <- monona %>%
  mutate(residuals = resid(lake_mod))

ggplot(monona_resids, aes(x = year1, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0)
```

Interpret the plot:
- **Good:** Points scattered randomly around 0 with no clear pattern
- **Curved pattern:** Suggests non-linearity; a straight line isn't appropriate
- **Megaphone shape:** Suggests non-constant variance; spread increases or decreases
- **Systematic patterns:** Suggest the model is missing important structure

> **Key concept:** Always plot your residuals. They reveal whether your regression assumptions are reasonable.

## Regression Inference: Confidence Interval for the Slope

We not only estimate the slope b1, but also quantify uncertainty in that estimate using a confidence interval.

### Standard Error of the Slope

The standard error estimates how much b₁ varies across different samples:

$$SE(b₁) = s / √(Sxx)

where s = sqrt(SSE/(n-2)) is the residual standard error (the square root of the mean squared error).

### Constructing the Confidence Interval

For a 95% confidence interval with df = n - 2 degrees of freedom:

$$CI: b₁  ±  t*  ×  SE(b₁)

where t* is the critical value from the t-distribution with n-2 degrees of freedom.

### Example in R

```r
n <- nrow(riley_2_8)
height_mod <- lm(height ~ age, data = riley_2_8)

s <- summary(height_mod)$sigma
pt_est <- summary(height_mod)$coefficients[2, 1]  # slope
se <- summary(height_mod)$coefficients[2, 2]       # SE of slope

cv <- qt(0.975, df = n - 2)
c(pt_est - cv*se, pt_est + cv*se)
```
```output
[1] 0.224 0.276
```

We are 95% confident that the true slope is between 0.224 and 0.276 inches per month.

### Using confint()

R provides a convenient function:

```r
confint(height_mod)
```
```output
                 2.5 %  97.5 %
(Intercept) 28.530  32.000
age          0.224   0.276
```

## Regression Inference: Hypothesis Test for the Slope

We often test whether there is a significant linear relationship between x and y.

### Hypotheses

- H₀: beta1 = 0 (no linear relationship; the slope is zero)
- Hₐ: beta1 != 0 (there is a linear relationship)

Alternatively, for one-sided tests:
- Hₐ: beta1 > 0 (positive relationship)
- Hₐ: beta1 < 0 (negative relationship)

### Test Statistic

$$t = b₁ / SE(b₁)

with degrees of freedom = n - 2

Under H0, this follows a t-distribution.

### Example: Lake Monona

```r
monona <- read_csv("lake-monona-winters-2025.csv")
lake_mod <- lm(duration ~ year1, data = monona)
summary(lake_mod)
```
```output
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  48.5    25.3      1.92   0.058
year1        -0.087  0.010    -8.63  < 0.001
```

Interpretation: The test statistic is -8.63. The two-sided p-value is < 0.001, providing very strong evidence that the slope is not zero. There is a significant negative linear relationship between year and freeze duration.

> **Key concept:** When p-value < 0.05, we reject H0 and conclude there is a significant linear relationship. The slope is statistically different from zero.

## Confidence Interval for Mean Response

Often we want a confidence interval for the mean response (average y) at a given x value, not a prediction for an individual.

### Confidence Interval for the Mean

```r
predict(lion_mod, newdata = tibble(age = 5), interval = "confidence")
```
```output
     fit    lwr    upr
1  0.546  0.522  0.570
```

We are 95% confident that the mean nose proportion for 5-year-old lions is between 0.522 and 0.570.

### Standard Error Formula

$$SE_fit = s × √( 1/n  +  (x_new - x̄)² / Sxx )

Note: This interval is for the average response at x = x_new, not for an individual observation.

### Why This Matters

The confidence interval tells us where the regression line is - it's narrower near the center of the data (where we have the most information) and wider at the extremes (where we're more uncertain).

## Prediction Interval for a New Observation

When we predict for an individual new observation, we need a wider interval to account for individual variation around the line.

### Prediction Interval

```r
predict(lion_mod, newdata = tibble(age = 5), interval = "prediction")
```
```output
     fit    lwr    upr
1  0.546  0.412  0.680
```

We predict with 95% confidence that a new 5-year-old lion will have nose proportion between 0.412 and 0.680.

### Why PI is Always Wider Than CI

The prediction interval accounts for two sources of uncertainty:
1. Uncertainty about where the regression line is (same as CI)
2. Uncertainty about individual variation around the line (additional)

### Standard Error Formula

$$SE_pred = s × √( 1  +  1/n  +  (x_new - x̄)² / Sxx )

Notice the extra "1" in the formula - that's the individual variation.

### Comparison

```r
age_5 <- tibble(age = 5)
ci <- predict(lion_mod, newdata = age_5, interval = "confidence")
pi <- predict(lion_mod, newdata = age_5, interval = "prediction")

# CI width: 0.570 - 0.522 = 0.048
# PI width: 0.680 - 0.412 = 0.268
```

The prediction interval is much wider because it accounts for how individual observations vary around the average trend.

> **Key concept:** Confidence intervals estimate where the average/mean response is. Prediction intervals estimate where an individual observation will fall. Always use a prediction interval when predicting for individuals.

Linear Regression

# Inference Review Checkpoint

This checkpoint drills modules 9 through 13 -- confidence intervals, hypothesis testing, proportion inference, mean inference (single and two-sample), and linear regression.

Questions are pulled from each module's question bank alongside this deck's own questions, so the pool is larger than what you see here.

> Use Flashcard mode to surface weak spots with spaced repetition. Cards you get wrong are shown more often.

Inference Review Checkpoint

# Final Exam Review

This deck pulls questions from all modules (1-13) with heavier weight on inference and regression. The flashcard pool is larger than the quiz -- use Flashcard mode for full coverage.

> Inference (Ch 7-15) is 75% of the final exam. Spend most of your time on modules 9-13.