Real data rarely lives in a single table. A school database might have one table for students, another for courses, another for grades. Keeping them separate avoids repetition — you don't need to re-type "Introduction to Statistics" for every student who takes that course.
Joins let you combine tables using a shared key column — a value that appears in both tables and links the rows.
Key concept: When you join tables, you're asking: "For each row in table A, find the matching row(s) in table B." The answer depends on which join type you use.
semi_join(x, y) returns all rows from x that have a match in y — but keeps only x's columns. It's the "positive filter" complement to anti_join():
Unlike inner_join(), semi_join() never duplicates rows even if there are multiple matches in y.
| Function | Keeps rows from x... | Adds y's columns? |
|---|---|---|
| inner_join | with a match in y | Yes |
| left_join | all rows | Yes (NA if no match) |
| anti_join | with NO match in y | No |
| semi_join | with a match in y | No |
Both tables have the same concept but different column names? Use join_by():
anti_join is underused but powerful. It returns rows from A that have no match in B:
When to use anti_join: When you want to find "what's in A but not in B" — unmatched students, products never ordered, etc.
While left_join() keeps all rows from the left table:
right_join() keeps all rows from the right table:
Equivalence: right_join(x, y) is the same as left_join(y, x).
full_join() keeps all rows from both tables:
Tidy data has one observation per row and one variable per column. Sometimes data arrives in a "wide" format that needs reshaping.
Wide format — each time point is a separate column:
Long (tidy) format — one row per student-quiz combination:
pivot_longer() — wide to longpivot_wider() — long to wideWhy tidy data matters: ggplot2 and dplyr are designed for long/tidy format. A dataset with columns jan, feb, mar can't be plotted easily — you'd need three separate geom calls. After pivot_longer, you have one month column and can use color = month or facet_wrap(vars(month)) automatically.
When tables have the same columns but different rows, stack them:
Difference from joins: bind_rows() stacks rows directly. full_join() matches on keys and fills NAs — it's for combining related data from different sources. bind_rows() is for appending separate datasets.
Before joining, verify that the join key is unique in at least one table. Duplicates create unexpected row multiplication:
If this returns rows, you have duplicate keys — decide whether to aggregate, filter, or use a different join type.
Select columns to pivot using helper functions:
names_prefix is useful when column names have a consistent prefix you want to remove in the names_to column.