MODULE 0410 QUESTIONS

dplyr: Data Manipulation

ADAPTIVE FLASHCARDS

Flashcard Study Mode

Study this module with spaced repetition. Wrong answers come back weighted heavier.

dplyr: Data Manipulation

The Pipe: `%>%`

The pipe passes the left-hand result to the next function. This makes chains of operations readable:

1# Without pipe — hard to read

2filter(select(penguins, species, body_mass_g), species == "Adelie")

4# With pipe — clean and sequential

5penguins %>%

6 select(species, body_mass_g) %>%

7 filter(species == "Adelie")

Note: R 4.1+ introduced the native pipe |> which works identically to %>% for most purposes. You may see both in course materials and online resources. x |> f() is equivalent to x %>% f().

Column Operations

`select()` — keep or drop columns

1penguins %>% select(species, body_mass_g, sex)

2penguins %>% select(-island) # drop island

`mutate()` — create or modify columns

1penguins %>%

2 mutate(mass_kg = body_mass_g / 1000) %>%

3 select(species, mass_kg)

Row Operations

`filter()` — keep rows matching a condition

1penguins %>%

2 filter(species == "Chinstrap") %>%

3 nrow()

1penguins %>%

2 filter(body_mass_g > 5000 & sex == "male") %>%

3 select(species, body_mass_g)

`arrange()` — sort rows

1penguins %>%

2 arrange(desc(body_mass_g)) %>%

3 select(species, body_mass_g) %>%

4 head(3)

1# Filter for multiple values using %in%

2penguins %>%

3 filter(species %in% c("Adelie", "Chinstrap"))

4# Returns only rows where species is Adelie OR Chinstrap

Summarizing Data

`group_by()` + `summarize()`

1penguins %>%

2 group_by(species) %>%

3 summarize(

4 avg_mass = mean(body_mass_g, na.rm = TRUE),

5 n = n()

6 )

`count()` — quick row counts per group

1penguins %>% count(species)

Note: After group_by() %>% summarize(), the result is still grouped by all-but-last grouping variable. Use .groups = "drop" inside summarize() or chain ungroup() after to remove all grouping and avoid unexpected behavior in downstream operations.

1penguins %>%

2 group_by(species, island) %>%

3 summarize(avg_mass = mean(body_mass_g, na.rm = TRUE),

4 .groups = "drop") # removes all grouping from result

`case_when()` — conditional values

1penguins %>%

2 mutate(size = case_when(

3 body_mass_g > 5000 ~ "large",

4 body_mass_g > 3500 ~ "medium",

5 .default = "small"

6 )) %>%

7 count(size)

Conditions evaluate in order — the first TRUE match wins.

Relocating Columns

Move a column to a different position:

1penguins %>%

2 relocate(sex, .after = species) %>%

3 head()

Renaming Columns

1penguins %>%

2 rename(flipper = flipper_length_mm, mass = body_mass_g) %>%

3 head()

Sort in descending order:

1penguins %>%

2 arrange(desc(body_mass_g)) %>%

3 select(species, body_mass_g) %>%

4 head(3)

group_by() + mutate()

Unlike summarize(), mutate() adds a column while keeping all original rows. Each row gets the per-group value:

1penguins %>%

2 group_by(species) %>%

3 mutate(

4 species_mean_mass = mean(body_mass_g, na.rm = TRUE)

5 ) %>%

6 select(species, body_mass_g, species_mean_mass) %>%

7 head()

Then use ungroup() to remove grouping:

1penguins %>%

2 group_by(species) %>%

3 mutate(species_count = n()) %>%

4 ungroup()

Multiple Summaries at Once

1penguins %>%

2 group_by(species) %>%

3 summarize(

4 avg = mean(body_mass_g, na.rm = TRUE),

5 min = min(body_mass_g, na.rm = TRUE),

6 max = max(body_mass_g, na.rm = TRUE),

7 n = n()

8 )

slice_min() and slice_max()

Keep the top/bottom k rows per group using slice_max() and slice_min():

1penguins %>%

2 group_by(species) %>%

3 slice_max(body_mass_g, n = 1) # heaviest penguin per species

slice_max(col, n=k) keeps the k rows with the largest values.

slice_min(col, n=k) keeps the k rows with the smallest values.

drop_na()

Remove rows with missing values:

1penguins %>% drop_na() # remove any row with ANY NA

2penguins %>% drop_na(body_mass_g) # remove rows where body_mass_g is NA

mutate() vs summarize() — Key Distinction

These two verbs handle grouping differently:

›mutate() → returns same number of rows as input
›summarize() → returns one row per group

1# mutate: 344 rows → 344 rows

2penguins %>%

3 group_by(species) %>%

4 mutate(species_avg = mean(body_mass_g, na.rm=TRUE)) %>%

5 nrow()

1# summarize: 344 rows → 3 rows (one per species)

2penguins %>%

3 group_by(species) %>%

4 summarize(avg = mean(body_mass_g, na.rm=TRUE)) %>%

5 nrow()

Use mutate() to add new columns; use summarize() to collapse groups into summaries.

Full Pipeline Example

Combining filter, group_by, summarize, and arrange:

1penguins %>%

2 filter(!is.na(body_mass_g), !is.na(sex)) %>%

3 group_by(species, sex) %>%

4 summarize(

5 avg_mass = mean(body_mass_g),

6 n = n(),

7 .groups = "drop"

8 ) %>%

9 arrange(desc(avg_mass))

This removes rows with missing data, groups by species and sex, computes mean mass and count per group, then sorts by descending average mass.

dplyr: Data Manipulation

dplyr: Data Manipulation

The Pipe: %>%

Column Operations

select() — keep or drop columns

mutate() — create or modify columns

Row Operations

filter() — keep rows matching a condition

arrange() — sort rows

Summarizing Data

group_by() + summarize()

count() — quick row counts per group

case_when() — conditional values

Relocating Columns

Renaming Columns

group_by() + mutate()

Multiple Summaries at Once

slice_min() and slice_max()

drop_na()

mutate() vs summarize() — Key Distinction

Full Pipeline Example

The Pipe: `%>%`

`select()` — keep or drop columns

`mutate()` — create or modify columns

`filter()` — keep rows matching a condition

`arrange()` — sort rows

`group_by()` + `summarize()`

`count()` — quick row counts per group

`case_when()` — conditional values