MODULE 0510 QUESTIONS

Joining & Pivoting Data

ADAPTIVE FLASHCARDS
Flashcard Study Mode
Study this module with spaced repetition. Wrong answers come back weighted heavier.

Joining & Pivoting Data

Why Multiple Tables?

Real data rarely lives in a single table. A school database might have one table for students, another for courses, another for grades. Keeping them separate avoids repetition — you don't need to re-type "Introduction to Statistics" for every student who takes that course.

Joins let you combine tables using a shared key column — a value that appears in both tables and links the rows.

Key concept: When you join tables, you're asking: "For each row in table A, find the matching row(s) in table B." The answer depends on which join type you use.

semi_join() — Filter by Match (No Duplication)

semi_join(x, y) returns all rows from x that have a match in y — but keeps only x's columns. It's the "positive filter" complement to anti_join():

R
1# Keep only students who have a grade recorded
2semi_join(students, grades, by = "student_id")

Unlike inner_join(), semi_join() never duplicates rows even if there are multiple matches in y.

FunctionKeeps rows from x...Adds y's columns?
inner_joinwith a match in yYes
left_joinall rowsYes (NA if no match)
anti_joinwith NO match in yNo
semi_joinwith a match in yNo

When Keys Have Different Names

Both tables have the same concept but different column names? Use join_by():

R
1# Major table uses "Student_Name", Language table uses "Name"
2inner_join(Major, Language,
3 by = join_by(Student_Name == Name))

anti_join: Finding Non-Matches

anti_join is underused but powerful. It returns rows from A that have no match in B:

R
1# Find players who have NOT previously won an award
2anti_join(all_players, past_winners, by = "player_id")

When to use anti_join: When you want to find "what's in A but not in B" — unmatched students, products never ordered, etc.

right_join() and full_join()

While left_join() keeps all rows from the left table:

R
1left_join(students, grades) # all students, NAs for those without grades

right_join() keeps all rows from the right table:

R
1right_join(students, grades) # all students in grades table

Equivalence: right_join(x, y) is the same as left_join(y, x).

full_join() keeps all rows from both tables:

R
1full_join(students, grades) # all students AND all grades, NAs everywhere they don't match

Pivoting: Reshaping Data

Tidy data has one observation per row and one variable per column. Sometimes data arrives in a "wide" format that needs reshaping.

Wide format — each time point is a separate column:

R
1# quiz_scores: student | q1 | q2 | q3

Long (tidy) format — one row per student-quiz combination:

R
1# quiz_scores: student | quiz | score

pivot_longer() — wide to long

R
1quiz_scores %>%
2 pivot_longer(
3 cols = c(q1, q2, q3),
4 names_to = "quiz",
5 values_to = "score"
6 )

pivot_wider() — long to wide

R
1# Long format: child | measurement | value
2children %>%
3 pivot_wider(
4 names_from = measurement,
5 values_from = value
6 )

Why tidy data matters: ggplot2 and dplyr are designed for long/tidy format. A dataset with columns jan, feb, mar can't be plotted easily — you'd need three separate geom calls. After pivot_longer, you have one month column and can use color = month or facet_wrap(vars(month)) automatically.

bind_rows(): Stacking Vertically

When tables have the same columns but different rows, stack them:

R
1group1 <- tibble(id = 1:3, score = c(85, 90, 78))
2group2 <- tibble(id = 4:6, score = c(92, 88, 81))
3
4bind_rows(group1, group2) # append rows vertically

Difference from joins: bind_rows() stacks rows directly. full_join() matches on keys and fills NAs — it's for combining related data from different sources. bind_rows() is for appending separate datasets.

Checking for Duplicate Keys

Before joining, verify that the join key is unique in at least one table. Duplicates create unexpected row multiplication:

R
1# Check before joining
2students %>% count(student_id) %>% filter(n > 1)

If this returns rows, you have duplicate keys — decide whether to aggregate, filter, or use a different join type.

Pivot Longer — Column Selection Options

Select columns to pivot using helper functions:

R
1# Use starts_with() to select columns
2df %>% pivot_longer(cols = starts_with("month_"),
3 names_to = "month", values_to = "value")
4
5# Use a column range
6df %>% pivot_longer(cols = jan:dec,
7 names_to = "month", values_to = "sales")
8
9# Strip a prefix from column names
10df %>% pivot_longer(cols = wk1:wk3,
11 names_to = "week",
12 names_prefix = "wk",
13 values_to = "rank")

names_prefix is useful when column names have a consistent prefix you want to remove in the names_to column.