MODULE 0410 QUESTIONS

dplyr: Data Manipulation

ADAPTIVE FLASHCARDS
Flashcard Study Mode
Study this module with spaced repetition. Wrong answers come back weighted heavier.

dplyr: Data Manipulation

The Pipe: %>%

The pipe passes the left-hand result to the next function. This makes chains of operations readable:

R
1# Without pipe — hard to read
2filter(select(penguins, species, body_mass_g), species == "Adelie")
3
4# With pipe — clean and sequential
5penguins %>%
6 select(species, body_mass_g) %>%
7 filter(species == "Adelie")

Note: R 4.1+ introduced the native pipe |> which works identically to %>% for most purposes. You may see both in course materials and online resources. x |> f() is equivalent to x %>% f().

Column Operations

select() — keep or drop columns

R
1penguins %>% select(species, body_mass_g, sex)
2penguins %>% select(-island) # drop island

mutate() — create or modify columns

R
1penguins %>%
2 mutate(mass_kg = body_mass_g / 1000) %>%
3 select(species, mass_kg)

Row Operations

filter() — keep rows matching a condition

R
1penguins %>%
2 filter(species == "Chinstrap") %>%
3 nrow()
R
1penguins %>%
2 filter(body_mass_g > 5000 & sex == "male") %>%
3 select(species, body_mass_g)

arrange() — sort rows

R
1penguins %>%
2 arrange(desc(body_mass_g)) %>%
3 select(species, body_mass_g) %>%
4 head(3)
R
1# Filter for multiple values using %in%
2penguins %>%
3 filter(species %in% c("Adelie", "Chinstrap"))
4# Returns only rows where species is Adelie OR Chinstrap

Summarizing Data

group_by() + summarize()

R
1penguins %>%
2 group_by(species) %>%
3 summarize(
4 avg_mass = mean(body_mass_g, na.rm = TRUE),
5 n = n()
6 )

count() — quick row counts per group

R
1penguins %>% count(species)

Note: After group_by() %>% summarize(), the result is still grouped by all-but-last grouping variable. Use .groups = "drop" inside summarize() or chain ungroup() after to remove all grouping and avoid unexpected behavior in downstream operations.

R
1penguins %>%
2 group_by(species, island) %>%
3 summarize(avg_mass = mean(body_mass_g, na.rm = TRUE),
4 .groups = "drop") # removes all grouping from result

case_when() — conditional values

R
1penguins %>%
2 mutate(size = case_when(
3 body_mass_g > 5000 ~ "large",
4 body_mass_g > 3500 ~ "medium",
5 .default = "small"
6 )) %>%
7 count(size)

Conditions evaluate in order — the first TRUE match wins.

Relocating Columns

Move a column to a different position:

R
1penguins %>%
2 relocate(sex, .after = species) %>%
3 head()

Renaming Columns

R
1penguins %>%
2 rename(flipper = flipper_length_mm, mass = body_mass_g) %>%
3 head()

Sort in descending order:

R
1penguins %>%
2 arrange(desc(body_mass_g)) %>%
3 select(species, body_mass_g) %>%
4 head(3)

group_by() + mutate()

Unlike summarize(), mutate() adds a column while keeping all original rows. Each row gets the per-group value:

R
1penguins %>%
2 group_by(species) %>%
3 mutate(
4 species_mean_mass = mean(body_mass_g, na.rm = TRUE)
5 ) %>%
6 select(species, body_mass_g, species_mean_mass) %>%
7 head()

Then use ungroup() to remove grouping:

R
1penguins %>%
2 group_by(species) %>%
3 mutate(species_count = n()) %>%
4 ungroup()

Multiple Summaries at Once

R
1penguins %>%
2 group_by(species) %>%
3 summarize(
4 avg = mean(body_mass_g, na.rm = TRUE),
5 min = min(body_mass_g, na.rm = TRUE),
6 max = max(body_mass_g, na.rm = TRUE),
7 n = n()
8 )

slice_min() and slice_max()

Keep the top/bottom k rows per group using slice_max() and slice_min():

R
1penguins %>%
2 group_by(species) %>%
3 slice_max(body_mass_g, n = 1) # heaviest penguin per species

slice_max(col, n=k) keeps the k rows with the largest values.

slice_min(col, n=k) keeps the k rows with the smallest values.

drop_na()

Remove rows with missing values:

R
1penguins %>% drop_na() # remove any row with ANY NA
2penguins %>% drop_na(body_mass_g) # remove rows where body_mass_g is NA

mutate() vs summarize() — Key Distinction

These two verbs handle grouping differently:

  • mutate() → returns same number of rows as input
  • summarize() → returns one row per group
R
1# mutate: 344 rows → 344 rows
2penguins %>%
3 group_by(species) %>%
4 mutate(species_avg = mean(body_mass_g, na.rm=TRUE)) %>%
5 nrow()
R
1# summarize: 344 rows → 3 rows (one per species)
2penguins %>%
3 group_by(species) %>%
4 summarize(avg = mean(body_mass_g, na.rm=TRUE)) %>%
5 nrow()

Use mutate() to add new columns; use summarize() to collapse groups into summaries.

Full Pipeline Example

Combining filter, group_by, summarize, and arrange:

R
1penguins %>%
2 filter(!is.na(body_mass_g), !is.na(sex)) %>%
3 group_by(species, sex) %>%
4 summarize(
5 avg_mass = mean(body_mass_g),
6 n = n(),
7 .groups = "drop"
8 ) %>%
9 arrange(desc(avg_mass))

This removes rows with missing data, groups by species and sex, computes mean mass and count per group, then sorts by descending average mass.