dplyr: Data Manipulation
The Pipe: %>%
The pipe passes the left-hand result to the next function. This makes chains of operations readable:
1# Without pipe — hard to read
2filter(select(penguins, species, body_mass_g), species == "Adelie")
3
4# With pipe — clean and sequential
5penguins %>%
6 select(species, body_mass_g) %>%
7 filter(species == "Adelie")
Note: R 4.1+ introduced the native pipe |> which works identically to %>% for most purposes. You may see both in course materials and online resources. x |> f() is equivalent to x %>% f().
Column Operations
select() — keep or drop columns
1penguins %>% select(species, body_mass_g, sex)
2penguins %>% select(-island) # drop island
mutate() — create or modify columns
1penguins %>%
2 mutate(mass_kg = body_mass_g / 1000) %>%
3 select(species, mass_kg)
Row Operations
filter() — keep rows matching a condition
1penguins %>%
2 filter(species == "Chinstrap") %>%
3 nrow()
1penguins %>%
2 filter(body_mass_g > 5000 & sex == "male") %>%
3 select(species, body_mass_g)
arrange() — sort rows
1penguins %>%
2 arrange(desc(body_mass_g)) %>%
3 select(species, body_mass_g) %>%
4 head(3)
1# Filter for multiple values using %in%
2penguins %>%
3 filter(species %in% c("Adelie", "Chinstrap"))
4# Returns only rows where species is Adelie OR Chinstrap
Summarizing Data
group_by() + summarize()
1penguins %>%
2 group_by(species) %>%
3 summarize(
4 avg_mass = mean(body_mass_g, na.rm = TRUE),
5 n = n()
6 )
count() — quick row counts per group
1penguins %>% count(species)
Note: After group_by() %>% summarize(), the result is still grouped by all-but-last grouping variable. Use .groups = "drop" inside summarize() or chain ungroup() after to remove all grouping and avoid unexpected behavior in downstream operations.
1penguins %>%
2 group_by(species, island) %>%
3 summarize(avg_mass = mean(body_mass_g, na.rm = TRUE),
4 .groups = "drop") # removes all grouping from result
case_when() — conditional values
1penguins %>%
2 mutate(size = case_when(
3 body_mass_g > 5000 ~ "large",
4 body_mass_g > 3500 ~ "medium",
5 .default = "small"
6 )) %>%
7 count(size)
Conditions evaluate in order — the first TRUE match wins.
Relocating Columns
Move a column to a different position:
1penguins %>%
2 relocate(sex, .after = species) %>%
3 head()
Renaming Columns
1penguins %>%
2 rename(flipper = flipper_length_mm, mass = body_mass_g) %>%
3 head()
Sort in descending order:
1penguins %>%
2 arrange(desc(body_mass_g)) %>%
3 select(species, body_mass_g) %>%
4 head(3)
group_by() + mutate()
Unlike summarize(), mutate() adds a column while keeping all original rows. Each row gets the per-group value:
1penguins %>%
2 group_by(species) %>%
3 mutate(
4 species_mean_mass = mean(body_mass_g, na.rm = TRUE)
5 ) %>%
6 select(species, body_mass_g, species_mean_mass) %>%
7 head()
Then use ungroup() to remove grouping:
1penguins %>%
2 group_by(species) %>%
3 mutate(species_count = n()) %>%
4 ungroup()
Multiple Summaries at Once
1penguins %>%
2 group_by(species) %>%
3 summarize(
4 avg = mean(body_mass_g, na.rm = TRUE),
5 min = min(body_mass_g, na.rm = TRUE),
6 max = max(body_mass_g, na.rm = TRUE),
7 n = n()
8 )
slice_min() and slice_max()
Keep the top/bottom k rows per group using slice_max() and slice_min():
1penguins %>%
2 group_by(species) %>%
3 slice_max(body_mass_g, n = 1) # heaviest penguin per species
slice_max(col, n=k) keeps the k rows with the largest values.
slice_min(col, n=k) keeps the k rows with the smallest values.
drop_na()
Remove rows with missing values:
1penguins %>% drop_na() # remove any row with ANY NA
2penguins %>% drop_na(body_mass_g) # remove rows where body_mass_g is NA
mutate() vs summarize() — Key Distinction
These two verbs handle grouping differently:
- ›
mutate() → returns same number of rows as input - ›
summarize() → returns one row per group
1# mutate: 344 rows → 344 rows
2penguins %>%
3 group_by(species) %>%
4 mutate(species_avg = mean(body_mass_g, na.rm=TRUE)) %>%
5 nrow()
1# summarize: 344 rows → 3 rows (one per species)
2penguins %>%
3 group_by(species) %>%
4 summarize(avg = mean(body_mass_g, na.rm=TRUE)) %>%
5 nrow()
Use mutate() to add new columns; use summarize() to collapse groups into summaries.
Full Pipeline Example
Combining filter, group_by, summarize, and arrange:
1penguins %>%
2 filter(!is.na(body_mass_g), !is.na(sex)) %>%
3 group_by(species, sex) %>%
4 summarize(
5 avg_mass = mean(body_mass_g),
6 n = n(),
7 .groups = "drop"
8 ) %>%
9 arrange(desc(avg_mass))
This removes rows with missing data, groups by species and sex, computes mean mass and count per group, then sorts by descending average mass.