MODULE 0310 QUESTIONS

Data Visualization with ggplot2

ADAPTIVE FLASHCARDS
Flashcard Study Mode
Study this module with spaced repetition. Wrong answers come back weighted heavier.

Data Visualization with ggplot2

The Grammar of Graphics

ggplot2 is built on a concept called the grammar of graphics — the idea that every plot can be described by a small set of components assembled together. Once you understand the grammar, you can build any chart.

The three required components:

  • Data — the dataframe
  • Aesthetics (aes()) — which columns map to which visual properties
  • Geoms — what shape to draw (points, bars, lines, etc.)

Why ggplot2? Most plotting tools require you to specify low-level details (this pixel, that color). ggplot2 lets you think in terms of your data — "map species to color" — and handles the rendering automatically.

Building a Plot Layer by Layer

Every ggplot starts the same way and grows with +:

R
1library(tidyverse)
2
3ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
4 geom_point()

Add layers to make it richer:

R
1ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g,
2 color = species)) +
3 geom_point(alpha = 0.7) +
4 geom_smooth(method = "lm", se = FALSE) +
5 labs(
6 title = "Penguin Flipper Length vs Body Mass",
7 x = "Flipper Length (mm)",
8 y = "Body Mass (g)"
9 )

Variable vs Constant Aesthetics

This distinction trips up almost everyone at first.

Variable aesthetic — inside aes(), maps a data column to a visual property. Each unique value gets a different appearance:

R
1geom_point(aes(color = species)) # different color per species

Constant aesthetic — outside aes(), applies the same value to every element:

R
1geom_point(color = "red") # ALL points are red
2geom_point(size = 3) # ALL points size 3

The rule: If the value comes from your data, put it inside aes(). If it's a fixed visual setting, put it outside aes() directly in the geom.

Common mistake — putting a fixed color inside aes():

R
1geom_point(aes(color = "blue")) # WRONG: treats "blue" as a data value
2geom_point(color = "blue") # CORRECT: sets all points blue

Global vs Local Aesthetics

Global aesthetics go in ggplot(aes(...)) and are inherited by all geom layers:

R
1ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
2 geom_point() + # inherits x and y
3 geom_smooth() # also inherits x and y — no error

Local aesthetics go inside a specific geom and only apply there:

R
1ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
2 geom_point(aes(color = species)) + # color only for points
3 geom_smooth() # no color grouping on smooth

Choosing the Right Geom

GoalGeomNotes
Distribution of continuous vargeom_histogram()Use binwidth to control bin size
Smooth density curvegeom_density()Better for comparing groups
Count of categoriesgeom_bar()Counts rows automatically
You have y values alreadygeom_col()You supply both x and y
Relationship between two varsgeom_point()Add geom_smooth() for trend
Change over timegeom_line()Connect points in order
Reference linegeom_hline() / geom_vline()Constant line on plot

geom_bar vs geom_col: This is a frequent exam question. geom_bar() counts rows for you — you only provide x. geom_col() plots y values you already have — you provide both x and y.

Facets: Small Multiples

facet_wrap() splits your plot into separate panels by a variable. This is more honest than using color alone:

R
1ggplot(penguins, aes(x = body_mass_g)) +
2 geom_histogram(binwidth = 200) +
3 facet_wrap(vars(species)) +
4 labs(title = "Body Mass Distribution by Species")

When to facet: When you have 3+ groups and they overlap so much that a single plot is unreadable. Facets trade space for clarity.

One-Variable Distributions: Histograms & Density

For continuous variables, plot the distribution with a histogram:

R
1ggplot(penguins, aes(x = body_mass_g)) +
2 geom_histogram(binwidth = 200)

Control bin edges with boundary:

R
1ggplot(penguins, aes(x = body_mass_g)) +
2 geom_histogram(binwidth = 200, boundary = 0)

Overlay a smooth density curve:

R
1ggplot(penguins, aes(x = body_mass_g)) +
2 geom_histogram(binwidth = 200, aes(y = after_stat(density))) +
3 geom_density(color = "blue", linewidth = 1)

Boxplots for Distributions

R
1ggplot(penguins, aes(x = species, y = body_mass_g)) +
2 geom_boxplot(aes(fill = species), alpha = 0.7)

Adding Trend Lines

Use geom_smooth() to add a trend line:

R
1ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
2 geom_point() +
3 geom_smooth(method = "lm", se = FALSE) # se=FALSE removes confidence band

Transparency & Shapes

Control transparency with alpha (0 = fully transparent, 1 = fully opaque):

R
1ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
2 geom_point(alpha = 0.5, size = 3) # semi-transparent points

Control point shape with shape:

R
1ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
2 geom_point(aes(shape = species), size = 3)

Labels & Titles

R
1ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
2 geom_point() +
3 labs(
4 title = "Penguin Flipper Length vs Body Mass",
5 x = "Flipper Length (mm)",
6 y = "Body Mass (g)"
7 )

color vs fill

For plots with bars or filled shapes:

  • color colors the border or outline
  • fill colors the interior
R
1ggplot(penguins, aes(x = species)) +
2 geom_bar(aes(fill = species)) # fill bars by species
3
4ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
5 geom_point(aes(color = species)) # color the points

Scales and Themes

Control plot colors and appearance with scale_*() functions and theme():

R
1ggplot(penguins, aes(x = species, fill = species)) +
2 geom_bar() +
3 scale_fill_manual(values = c("darkorange","purple","cyan4")) +
4 theme_minimal() +
5 theme(legend.position = "none")

Common complete themes:

  • theme_minimal() — clean, minimal background
  • theme_classic() — classic x/y axes, no grid
  • theme_bw() — white background with gridlines

Reordering Categories with fct_reorder()

By default, factors are ordered alphabetically. Use fct_reorder() from the forcats package to reorder by another variable:

R
1ggplot(penguins, aes(x = fct_reorder(species, body_mass_g, .fun=mean),
2 y = body_mass_g)) +
3 geom_boxplot()

fct_reorder(x, y) orders the levels of x by the median of y (the default). Pass .fun = mean to sort by mean instead. This is essential for ranked bar charts and ordered boxplots.

Common Exam Plot Mistakes

Watch out for these:

  • Wrong geom: geom_bar() counts rows; geom_col() uses your y values. Know which you need.
  • Aesthetic placement: Fixed values (like colors) go OUTSIDE aes(). Data-driven aesthetics go INSIDE.
  • Confidence bands: geom_smooth() adds a band by default. Use se=FALSE to remove it.
  • Both facet_wrap(vars(col)) and facet_wrap(~col) work in ggplot2. The vars() style is preferred in tidyverse code, but ~col is not deprecated and won't cause errors.
  • Missing na.rm: Summary geoms warn if data has NAs. Add na.rm=TRUE to suppress warnings.