MODULE 0510 QUESTIONS

Joining & Pivoting Data

ADAPTIVE FLASHCARDS

Flashcard Study Mode

Study this module with spaced repetition. Wrong answers come back weighted heavier.

Joining & Pivoting Data

Why Multiple Tables?

Real data rarely lives in a single table. A school database might have one table for students, another for courses, another for grades. Keeping them separate avoids repetition — you don't need to re-type "Introduction to Statistics" for every student who takes that course.

Joins let you combine tables using a shared key column — a value that appears in both tables and links the rows.

Key concept: When you join tables, you're asking: "For each row in table A, find the matching row(s) in table B." The answer depends on which join type you use.

semi_join() — Filter by Match (No Duplication)

semi_join(x, y) returns all rows from x that have a match in y — but keeps only x's columns. It's the "positive filter" complement to anti_join():

1# Keep only students who have a grade recorded

2semi_join(students, grades, by = "student_id")

Unlike inner_join(), semi_join() never duplicates rows even if there are multiple matches in y.

Function	Keeps rows from x...	Adds y's columns?
inner_join	with a match in y	Yes
left_join	all rows	Yes (NA if no match)
anti_join	with NO match in y	No
semi_join	with a match in y	No

When Keys Have Different Names

Both tables have the same concept but different column names? Use join_by():

1# Major table uses "Student_Name", Language table uses "Name"

2inner_join(Major, Language,

3 by = join_by(Student_Name == Name))

anti_join: Finding Non-Matches

anti_join is underused but powerful. It returns rows from A that have no match in B:

1# Find players who have NOT previously won an award

2anti_join(all_players, past_winners, by = "player_id")

When to use anti_join: When you want to find "what's in A but not in B" — unmatched students, products never ordered, etc.

right_join() and full_join()

While left_join() keeps all rows from the left table:

1left_join(students, grades) # all students, NAs for those without grades

right_join() keeps all rows from the right table:

1right_join(students, grades) # all students in grades table

Equivalence: right_join(x, y) is the same as left_join(y, x).

full_join() keeps all rows from both tables:

1full_join(students, grades) # all students AND all grades, NAs everywhere they don't match

Pivoting: Reshaping Data

Tidy data has one observation per row and one variable per column. Sometimes data arrives in a "wide" format that needs reshaping.

Wide format — each time point is a separate column:

1# quiz_scores: student | q1 | q2 | q3

Long (tidy) format — one row per student-quiz combination:

1# quiz_scores: student | quiz | score

`pivot_longer()` — wide to long

1quiz_scores %>%

2 pivot_longer(

3 cols = c(q1, q2, q3),

4 names_to = "quiz",

5 values_to = "score"

6 )

`pivot_wider()` — long to wide

1# Long format: child | measurement | value

2children %>%

3 pivot_wider(

4 names_from = measurement,

5 values_from = value

6 )

Why tidy data matters: ggplot2 and dplyr are designed for long/tidy format. A dataset with columns jan, feb, mar can't be plotted easily — you'd need three separate geom calls. After pivot_longer, you have one month column and can use color = month or facet_wrap(vars(month)) automatically.

bind_rows(): Stacking Vertically

When tables have the same columns but different rows, stack them:

1group1 <- tibble(id = 1:3, score = c(85, 90, 78))

2group2 <- tibble(id = 4:6, score = c(92, 88, 81))

4bind_rows(group1, group2) # append rows vertically

Difference from joins: bind_rows() stacks rows directly. full_join() matches on keys and fills NAs — it's for combining related data from different sources. bind_rows() is for appending separate datasets.

Checking for Duplicate Keys

Before joining, verify that the join key is unique in at least one table. Duplicates create unexpected row multiplication:

1# Check before joining

2students %>% count(student_id) %>% filter(n > 1)

If this returns rows, you have duplicate keys — decide whether to aggregate, filter, or use a different join type.

Pivot Longer — Column Selection Options

Select columns to pivot using helper functions:

1# Use starts_with() to select columns

2df %>% pivot_longer(cols = starts_with("month_"),

3 names_to = "month", values_to = "value")

5# Use a column range

6df %>% pivot_longer(cols = jan:dec,

7 names_to = "month", values_to = "sales")

9# Strip a prefix from column names

10df %>% pivot_longer(cols = wk1:wk3,

11 names_to = "week",

12 names_prefix = "wk",

13 values_to = "rank")

names_prefix is useful when column names have a consistent prefix you want to remove in the names_to column.

Joining & Pivoting Data

Joining & Pivoting Data

Why Multiple Tables?

semi_join() — Filter by Match (No Duplication)

When Keys Have Different Names

anti_join: Finding Non-Matches

right_join() and full_join()

Pivoting: Reshaping Data

pivot_longer() — wide to long

pivot_wider() — long to wide

bind_rows(): Stacking Vertically

Checking for Duplicate Keys

Pivot Longer — Column Selection Options

`pivot_longer()` — wide to long

`pivot_wider()` — long to wide