Visualizing Proportions
Quick Review
Review
What are the core elements of ggplot2 grammar?
From last week: * Mappings: base layer + ggplot()
and aes()
* Scales: control and modify your mappings + e.g., scale_x_continuous()
and scale_fill_manual()
* Geoms: plot elements + e.g., geom_point()
and geom_line()
-
Facets: panel your plot
-
facet_wrap()
andfacet_grid()
-
-
Themes: style your figure
- Built-in: e.g.,
theme_classic()
- Manual:
theme()
(legend
,strip
,axis
,plot
,panel
)
- Built-in: e.g.,
Quick Review
Colorblindness and accessible plots
- Adding in a colorblind-friendly palette from Wong (2011)
Code
cbsafe_pal <- tribble(
~name, ~rgb
, "black", c(0, 0, 0)
, "sky blue", c(86, 180, 233)
, "bluish green", c(0, 158, 115)
, "yellow", c(240, 228, 66)
, "orange", c(230, 159, 0)
, "blue", c(0, 114, 178)
, "vermillion", c(213, 94, 0)
, "reddish purple", c(204, 121, 167)
) %>%
mutate(hex = map_chr(rgb, function(x) rgb(x[1], x[2], x[3], maxColorValue = 255)))
cbsafe_pal
Visualizating Proportions
- Proportions are often important in our research
- From describing sample-level differences to describing the frequency of behaviors / events / experiences, etc., we often reach toward describing amounts relative to the whole
- But the goals we are trying to achieve are varied, which necesssitates the use of different graphics
Agenda
- We will cover X kinds of ways of visualizations, all of which were covered in your readings
- We will cover both when to use them and how to create them
- Pie Charts
- Bar Charts (Stacked)
- Bar Charts (Side-by-Side)
- Bar Charts and Density Across Continuous Variables
- Mosaic Plots
- Parallel Sets
But First, Our Data
- Today, we’ll use the teaching sample from the German Socioeconomic Panel Study (GSOEP)
- GSOEP is an ongoing longitudinal panel study that began in 1984 (26 waves of data!)
- ~20,000 people are sampled each year
- Samples households in Germany
- Has additional sub-projects (e.g., innovation studies, migrant panel, etc.)
- The data are publicly available via application
Pie Charts
- You may be wondering if you should ever use a pie chart
- The answer is, of course, it depends
- Pie charts are great when:
- What you want to visualize is simple (e.g., basic fractions)
- You want to clearly emphasize proportion relative to the whole
- You have a small data set
- In our data, we have a few variables that follow this, but we’ll focus on one:
- marital status (4 groups)
-
ggplot2
doesn’t specifically support pie charts - Why? Because it’s a layered grammar of graphics and an explicit function for it would be redundant with some of the built in coordinates
- specifically,
coord_polar()
- specifically,
- So to make a pie chart, we’ll use
geom_bar() + coord_polar()
Basic Syntax
Code
gsoep %>%
filter(year == 2009 & !is.na(marital)) %>% # random
group_by(marital) %>%
tally() %>%
mutate(marital = factor(
marital
, 1:4
, c("Married", "Separated", "Widowed", "Never Married")
)) %>%
ggplot(aes(x = "", y = n, fill = marital)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
theme_void()
Improvements: Slice Labels and Colors
Code
gsoep %>%
filter(year == 2009 & !is.na(marital)) %>% # random
group_by(marital) %>%
tally() %>%
mutate(marital = factor(
marital
, 1:4
, c("Married", "Separated", "Widowed", "Never Married")
)) %>%
arrange(desc(marital)) %>%
mutate(prop = n / sum(n) * 100
, ypos = cumsum(prop)- 0.5*prop) %>%
ggplot(aes(x = "", y = prop, fill = marital)) +
geom_bar(stat = "identity", width = 1, color = "white") +
geom_text(
aes(y = ypos, label = marital)
, color = "white"
, size=4
) +
scale_fill_manual(values = cbsafe_pal$hex[c(2, 8, 3, 4)]) +
coord_polar("y", start = 0) +
theme_void() +
theme(legend.position = "none")
More Improvements: Title and Story-Congruent Colors
Code
gsoep %>%
filter(year == 2009 & !is.na(marital)) %>% # random
group_by(marital) %>%
tally() %>%
mutate(marital = factor(marital, 1:4, c("Married", "Separated", "Widowed", "Never Married"))) %>%
arrange(desc(marital)) %>%
mutate(prop = n / sum(n) * 100
, ypos = cumsum(prop)- 0.5*prop) %>%
ggplot(aes(x = "", y = prop, fill = marital)) +
geom_bar(stat = "identity", width = 1, color = "black") +
geom_label(
aes(y = ypos, label = marital)
, color = "white"
, size = 6
, fontface = 2) +
scale_fill_manual(values = c(rev(brewer.pal(9,"Greens")[c(4,6,8)]), "grey60")) +
coord_polar("y", start = 0) +
labs(
title = "In 2009, the majority of GSOEP participants\nwere or had been married/partnered"
) +
theme_void() +
theme(
legend.position = "none"
, plot.title = element_text(face = "bold.italic", size = rel(1.4), hjust = .5)
)
Stacked Bar Charts
- Like pie charts, stacked bar charts have their time and place
- In particular:
- Show proportions relative to the total
- Can be used to show changes over time
- To demonstrate, let’s look at marital status across emerging adulthood (18-26)
Basic Syntax
Code
gsoep %>%
filter(age %in% 18:26 & !is.na(marital)) %>%
group_by(age, marital) %>%
tally() %>%
group_by(age) %>%
mutate(
marital = factor(
marital
, 1:4
, c("Married", "Separated", "Widowed", "Never Married")
)
, age = factor(age)
, prop = n/sum(n)
) %>%
ggplot(aes(x = age, y = prop, fill = marital)) +
geom_bar(stat = "identity", color = "black") +
theme_classic()
Improvements: Color
Code
gsoep %>%
filter(age %in% 18:26 & !is.na(marital)) %>%
group_by(age, marital) %>%
tally() %>%
group_by(age) %>%
mutate(marital = factor(marital, seq(4,1,-1), rev(c("Married", "Separated", "Widowed", "Never Married")))
, age = factor(age)
, prop = n/sum(n)) %>%
ggplot(aes(x = age, y = prop, fill = marital)) +
geom_bar(stat = "identity", color = "black") +
scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) +
theme_classic()
Improvements: Label & Scales
Code
gsoep %>%
filter(age %in% 18:26 & !is.na(marital)) %>%
group_by(age, marital) %>%
tally() %>%
group_by(age) %>%
mutate(
marital = factor(
marital
, seq(4,1,-1)
, rev(c("Married", "Separated", "Widowed", "Never Married"))
)
, age = factor(age)
, prop = n/sum(n)
) %>%
ggplot(aes(x = age, y = prop, fill = marital)) +
geom_bar(stat = "identity", color = "black") +
scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) +
scale_y_continuous(
limits = c(0,1)
, breaks = seq(0, 1, .25)
, labels = c("0%", "25%", "50%", "75%", "100%")
) +
labs(
x = "Age"
, y = "Percent of Sample"
, title = "Rates of relationships increase in emerging adulthood"
, subtitle = "But most remain unpartnered by 26"
) +
theme_classic()
Improvements: Legend
Code
gsoep %>%
filter(age %in% 18:26 & !is.na(marital)) %>%
group_by(age, marital) %>%
tally() %>%
group_by(age) %>%
mutate(marital = factor(marital, seq(4,1,-1), rev(c("Married", "Separated", "Widowed", "Never Married")))
, age = factor(age)
, prop = n/sum(n)) %>%
ggplot(aes(x = age, y = prop, fill = marital)) +
geom_bar(stat = "identity", color = "black") +
scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) +
scale_y_continuous(
limits = c(0,1)
, breaks = seq(0, 1, .25)
, labels = c("0%", "25%", "50%", "75%", "100%")
) +
annotate("text", x = "26", y = .60, label = "Never Married", angle = 90) +
annotate("text", x = "26", y = .13, label = "Married", angle = 90, color = "white") +
labs(
x = "Age"
, y = "Percent of Sample"
, title = "Rates of relationships increase in emerging adulthood"
, subtitle = "But most remain unpartnered by 26"
, fill = NULL
) +
theme_classic() +
theme(legend.position = "bottom")
Improvements: Theme Elements Exercise
- Bold axis text and increase size
- Bold axis titles and increase size
- Bold title and subtitle and center (hint, you will also need to wrap the title text)
(Answers)
Code
gsoep %>%
filter(age %in% 18:26 & !is.na(marital)) %>%
group_by(age, marital) %>%
tally() %>%
group_by(age) %>%
mutate(marital = factor(marital, seq(4,1,-1), rev(c("Married", "Separated", "Widowed", "Never Married")))
, age = factor(age)
, prop = n/sum(n)) %>%
ggplot(aes(x = age, y = prop, fill = marital)) +
geom_bar(stat = "identity", color = "black") +
scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) +
scale_y_continuous(
limits = c(0,1)
, breaks = seq(0, 1, .25)
, labels = c("0%", "25%", "50%", "75%", "100%")
) +
annotate("text", x = "26", y = .60, label = "Never Married", angle = 90) +
annotate("text", x = "26", y = .13, label = "Married", angle = 90, color = "white") +
labs(
x = "Age"
, y = "Percent of Sample"
, title = "Rates of relationships increase in\nemerging adulthood"
, subtitle = "But most remain unpartnered by 26"
, fill = NULL
) +
theme_classic() +
theme(
legend.position = "bottom"
, axis.text = element_text(face = "bold", size = rel(1.1))
, axis.title = element_text(face = "bold", size = rel(1.1))
, plot.title = element_text(face = "bold", size = rel(1.2), hjust = .5)
, plot.subtitle = element_text(face = "italic", size = rel(1.1), hjust = .5)
)
Side-by-Side Bar Charts
- Stacked bar charts are great for showing sequences but can make it difficult to compare within a stack
- Side-by-side bar charts make it much easier to compare across categories and work well when broken into many categories
- But they can be difficult to understand across sequences
- To demonstrate, let’s look at marriage rates across three waves
Basic Syntax
Code
gsoep %>%
filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
group_by(year, marital) %>%
tally() %>%
mutate(marital = factor(marital, 1:4, c("Married", "Separated", "Widowed", "Never Married"))) %>%
group_by(year) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = year, y = prop, fill = marital)) +
geom_bar(stat = "identity", color = "black", position = "dodge") +
theme_classic()
Improvements: Order
Code
gsoep %>%
filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
group_by(year, marital) %>%
tally() %>%
mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>%
group_by(year) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = year, y = prop, fill = marital)) +
geom_bar(stat = "identity", color = "black", position = "dodge") +
theme_classic()
Improvements: Labels
We could label the bars, but let’s label the axes instead
Code
gsoep %>%
filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
group_by(year, marital) %>%
tally() %>%
mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>%
group_by(year) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = marital, y = prop, fill = marital)) +
geom_bar(stat = "identity", color = "black", position = "dodge") +
scale_y_continuous(
limits = c(0,.7), breaks = seq(0,.7, .2), labels = c("0%", "20%", "40%", "60%")
) +
facet_grid(~year) +
labs(
x = NULL
, y = "Percentage of Participants"
, title = "Marital Status Has Remained Consistent Throughout the 21st Century"
) +
theme_classic() +
theme(
legend.position = "none"
, axis.text.x = element_text(angle = 45, hjust = 1)
)
Improvements: Theme Elements
Let’s label and improve the theme elements
Code
gsoep %>%
filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
group_by(year, marital) %>%
tally() %>%
mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>%
group_by(year) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = marital, y = prop, fill = marital)) +
geom_bar(stat = "identity", color = "black", position = "dodge") +
scale_y_continuous(
limits = c(0,.7), breaks = seq(0,.7, .2), labels = c("0%", "20%", "40%", "60%")
) +
facet_grid(~year) +
labs(
x = NULL
, y = "Percentage of Participants"
, title = "Marital Status Has Remained Consistent\nThroughout the 21st Century"
) +
theme_classic() +
theme(
legend.position = "none"
, axis.text = element_text(face = "bold", size = rel(1.2))
, axis.text.x = element_text(angle = 45, hjust = 1, size = rel(1))
, axis.title = element_text(face = "bold", size = rel(1.2))
, strip.background = element_rect(fill = "grey90", color = "black")
, strip.text = element_text(face = "bold", size = rel(1.2))
, plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
)
Improvements: Colors (Exercise)
Exercise: * Improve the colors by making them: + Colorblind-friendly + Match the goal of the plot (see title)
(Answers)
Code
gsoep %>%
filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
group_by(year, marital) %>%
tally() %>%
mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>%
group_by(year) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = marital, y = prop, fill = marital)) +
geom_bar(stat = "identity", color = "black", position = "dodge") +
scale_y_continuous(
limits = c(0,.7), breaks = seq(0,.7, .2), labels = c("0%", "20%", "40%", "60%")
) +
scale_fill_manual(values = cbsafe_pal$hex[2:5]) +
facet_grid(~year) +
labs(
x = NULL
, y = "Percentage of Participants"
, title = "Marital Status Has Remained Consistent\nThroughout the 21st Century"
) +
theme_classic() +
theme(
legend.position = "none"
, axis.text = element_text(face = "bold", size = rel(1.2))
, axis.text.x = element_text(angle = 45, hjust = 1, size = rel(1))
, axis.title = element_text(face = "bold", size = rel(1.2))
, strip.background = element_rect(fill = "grey90", color = "black")
, strip.text = element_text(face = "bold", size = rel(1.2))
, plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
)
Bar Charts and Density Across Continuous Variables
- One challenge with stacked bar charts is that when there are more than two categories, it can be very difficult to track the visualized trend
- Relative to side-by-side bar charts, it’s easy to see any category relative to the total but somewhat more difficult to also account for differing numbers of people in different categories or across time
- One possible solution to this is to look at densities across time and groups or relative to the total
- Let’s do both now
Stacked Area Charts
But first, remember stacked bar charts? Stacked area charts are sort of an extension of those:
Code
gsoep %>%
filter(age %in% c(20, 30, 40, 50, 60, 70, 80) & !is.na(SRhealth)) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(SRhealth, seq(5,1,-1), c("Very good", "Good", "Satisfactory", "Poor", "Bad"))) %>%
group_by(age) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = age, y = prop, fill = SRhealth)) +
geom_bar(stat = "identity", color = "black") +
scale_fill_manual(values = cbsafe_pal$hex[2:6]) +
theme_classic()
But without the bars separating them.
Code
gsoep %>%
filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(SRhealth, seq(5,1,-1), c("Very good", "Good", "Satisfactory", "Poor", "Bad"))) %>%
group_by(age) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = age, y = prop, fill = SRhealth)) +
geom_area() +
# scale_fill_manual(values = cbsafe_pal$hex[2:6]) +
theme_classic()
Improvements: Color
Code
gsoep %>%
filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(SRhealth, seq(5,1,-1), c("Very good", "Good", "Satisfactory", "Poor", "Bad"))) %>%
group_by(age) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = age, y = prop, fill = SRhealth)) +
geom_area(color = "white", alpha = .6) +
scale_fill_viridis_d() +
theme_classic()
Improvements: Color Labels
Code
gsoep %>%
filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
group_by(age) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = age, y = prop, fill = SRhealth)) +
geom_area(color = "white", alpha = .6) +
annotate("text", x = 85, y = .95, label = "Bad", color = "white", fontface = 2) +
annotate("text", x = 75, y = .80, label = "Poor", color = "white", fontface = 2) +
annotate("text", x = 62, y = .55, label = "Satisfactory", color = "white", fontface = 2) +
annotate("text", x = 43, y = .3, label = "Good", color = "black", fontface = 2) +
annotate("text", x = 30, y = .07, label = "Very Good", color = "black", fontface = 2) +
scale_fill_viridis_d() +
theme_classic() +
theme(legend.position = "none")
Improvements: Theme Elements
Code
gsoep %>%
filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
group_by(age) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = age, y = prop, fill = SRhealth)) +
geom_area(color = "white", alpha = .6) +
annotate("text", x = 85, y = .95, label = "Bad", color = "white", fontface = 2) +
annotate("text", x = 75, y = .80, label = "Poor", color = "white", fontface = 2) +
annotate("text", x = 62, y = .55, label = "Satisfactory", color = "white", fontface = 2) +
annotate("text", x = 43, y = .3, label = "Good", color = "black", fontface = 2) +
annotate("text", x = 30, y = .07, label = "Very Good", color = "black", fontface = 2) +
scale_fill_viridis_d() +
theme_classic() +
theme(legend.position = "none"
, axis.text = element_text(face = "bold", size = rel(1.1))
, axis.title = element_text(face = "bold", size = rel(1.1))
, plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
)
Improvements: Labels and Title (Exercise)
Exercise:
- Add plot title
- Change
x
andy
scale labels and titles
(Answers)
Code
gsoep %>%
filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
group_by(age) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = age, y = prop, fill = SRhealth)) +
geom_area(color = "white", alpha = .6) +
annotate("text", x = 85, y = .95, label = "Bad", color = "white", fontface = 2) +
annotate("text", x = 75, y = .80, label = "Poor", color = "white", fontface = 2) +
annotate("text", x = 62, y = .55, label = "Satisfactory", color = "white", fontface = 2) +
annotate("text", x = 43, y = .3, label = "Good", color = "black", fontface = 2) +
annotate("text", x = 30, y = .07, label = "Very Good", color = "black", fontface = 2) +
scale_fill_viridis_d() +
theme_classic() +
theme(legend.position = "none"
, axis.text = element_text(face = "bold", size = rel(1.1))
, axis.title = element_text(face = "bold", size = rel(1.1))
, plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
)
Improvements: Labels and Title
Code
gsoep %>%
filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
group_by(age) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = age, y = prop, fill = SRhealth)) +
geom_area(color = "white", alpha = .6) +
annotate("text", x = 85, y = .95, label = "Bad", color = "white", fontface = 2) +
annotate("text", x = 75, y = .80, label = "Poor", color = "white", fontface = 2) +
annotate("text", x = 62, y = .55, label = "Satisfactory", color = "white", fontface = 2) +
annotate("text", x = 43, y = .3, label = "Good", color = "black", fontface = 2) +
annotate("text", x = 30, y = .07, label = "Very Good", color = "black", fontface = 2) +
scale_x_continuous(limits = c(18, 100), breaks = seq(20, 100, 10)) +
scale_y_continuous(limits = c(0,1), breaks = seq(0,1, .25), labels = c("0%", "25%", "50%", "75%", "100%")) +
scale_fill_viridis_d() +
labs(
x = "Age (Years)"
, y = "Percentage of Participants"
, title = "Levels of Self-Rated Health Decrease Across the Lifespan"
) +
theme_classic() +
theme(legend.position = "none"
, axis.text = element_text(face = "bold", size = rel(1.1))
, axis.title = element_text(face = "bold", size = rel(1.1))
, plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
)
Total Density Plots
- Let’s revisit these data but also demonstrating how sample size changes across the lifespan
- To do this, we need two pieces of information:
- sample size in each self-rated health category at each age group
- total in each age group
Let’s start by using stat_smooth()
to get a smoothed geom_area()
of the total sample size onto the figure
Code
gsoep %>%
filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
group_by(age) %>%
mutate(total_n = sum(n)) %>%
ggplot(aes(x = age, y = n)) +
stat_smooth(
aes(y = total_n)
, geom = 'area'
, method = 'loess'
, span = 1/3
, alpha = .8
, fill = "grey"
) +
facet_grid(~SRhealth) +
theme_classic()
Then add the area for each ordinal level of self-rated health.
Code
gsoep %>%
filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
group_by(age) %>%
mutate(total_n = sum(n)) %>%
ggplot(aes(x = age, y = n)) +
stat_smooth(aes(y = total_n), geom = 'area', method = 'loess'
, span = 1/3, alpha = .8, fill = "grey") +
stat_smooth(
aes(fill = SRhealth)
, geom = 'area'
, method = 'loess'
, span = 1/3
, alpha = .8
) +
annotate("text", x = 45, y = 3000, label = "Total") +
facet_grid(~SRhealth) +
theme_classic() +
theme(legend.position = "none")
Let’s not belabor this too much.
Code
gsoep %>%
filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
group_by(age, SRhealth) %>%
tally() %>%
mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
group_by(age) %>%
mutate(total_n = sum(n)) %>%
ggplot(aes(x = age, y = n)) +
stat_smooth(aes(y = total_n), geom = 'area', method = 'loess'
, span = 1/3, alpha = .8, fill = "grey") +
stat_smooth(
aes(fill = SRhealth)
, geom = 'area'
, method = 'loess'
, span = 1/3
, alpha = .8
) +
scale_x_continuous(limits = c(18, 100), breaks = seq(20, 100, 10)) +
scale_fill_viridis_d() +
annotate("text", x = 45, y = 3000, label = "Total") +
labs(
x = "Age (Years)"
, y = "Number of People"
, title = "Good Self-Rated Health Decreases Across the Lifespan"
, subtitle = "But bad decreases less, likely because all-cause sample drop-out"
) +
facet_grid(~SRhealth) +
theme_classic() +
theme(legend.position = "none"
, axis.text = element_text(face = "bold", size = rel(1.1))
, axis.title = element_text(face = "bold", size = rel(1.1))
, plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
, plot.subtitle = element_text(face = "italic", size = rel(1), hjust = .5)
, strip.background = element_rect(fill = "grey90", color = "black")
, strip.text = element_text(face = "bold", size = rel(1.2))
)
Nested Proportions
- Sometimes, the proportions that we want to visualize are more complex and can’t just be simply binned
- In such cases, there may be hierarchical relationships among the categories
- Today, we’ll cover two core nested proportion plots:
- Mosaic plots
- Parallel Sets
- To do this, we’ll use 2-Digit NACE Industry Sector codes from participants’ last reported jobs in the SOEP, which I’ve broken down into 9 higher-order categories
- This is a lot of categories, so we’ll further eventually exclude categories that don’t have at least 2% of the share of participants
Code
library(kableExtra)
jobs <- read_csv("https://raw.githubusercontent.com/emoriebeck/psc290-data-viz-2022/main/03-week3-proportions/05-job-codes.csv")
jobs %>%
select(cat, job, old) %>%
arrange(cat, old) %>%
kable(.
, "html"
, col.names = c("Category", "Job", "Code")
, caption = "2-Digit NACE Industry Sector Codes and Categories") %>%
kable_classic(full_width = F, html_font = "Times New Roman") %>%
collapse_rows(1, valign = "top") %>%
scroll_box(height = "500px")
Category | Job | Code |
---|---|---|
Agriculture | Agriculture Hunting Rel.Serv.Activities | 1 |
Forestry Logging Rel.Service activities | 2 | |
Fishing Fish Hatcheries Fish Farms | 5 | |
Energy and Utilities | Mining Coal Lignite; Extraction Of Peat | 10 |
Extraction Crude Petroleum Natural Gas | 11 | |
Mining Of Uranium And Thorium Ores | 12 | |
Mining Of Metal Ores | 13 | |
Other Mining And Quarrying | 14 | |
Recycling | 37 | |
Electricity Gas Steam Hot Water Supply | 40 | |
Sewage Refuse Disposal Sanitationa.a.Re | 90 | |
Finance and Tech | Financ.Intermediat. Exc.Insur. Pens.Fund | 65 |
Insurance Pens.Funding Ex.Compuls.SocSe | 66 | |
Activ.Aux.To Financial Intermediation | 67 | |
Computer And Related Activities | 72 | |
Research And Development | 73 | |
Other Business Activities | 74 | |
Industry - NEC | 96 | |
Manufacturing | Manuf Food Products And Beverages | 15 |
Manuf Tobacco Products | 16 | |
Manuf Textiles | 17 | |
Manuf Wear. Apparel; Dressing Dyeing Fur | 18 | |
Tanning Dress.Leather; luggage Footwear | 19 | |
Manuf Wood Products Except Furniture | 20 | |
Manuf Pulp Paper And Paper Products | 21 | |
Manuf Coke Ref.Petroleum Nuclear Fuel | 23 | |
Manuf Chemicals And Chemical Products | 24 | |
Manuf Rubber And Plastic Products | 25 | |
Manuf Other Non-metallic Mineral Product | 26 | |
Manuf Basic Metals | 27 | |
Manuf Fabric.Metal Prod. Ex.Machin. Equi | 28 | |
Manuf Machinery And Equipment NEC | 29 | |
Manuf Office Machinery And Computers | 30 | |
Manuf Electrical Machinery Apparatus NE | 31 | |
Manuf Radio Television Communic.Equipmen | 32 | |
Manuf Medical Precision Optical Instrum. | 33 | |
Manuf Motor Vehicles Trailers Semi-tr. | 34 | |
Manuf Other Transport Equipment | 35 | |
Manuf Furniture; Manufacturing NEC | 36 | |
Collection Purification Distrib.Of Water | 41 | |
Handcraft Trade - NEC | 97 | |
Manufacturing - NEC | 100 | |
Other | Private Households With Employed Persons | 95 |
Extra-territorial Organizations.a.Bodies | 99 | |
Public Service | Publ.Administr. Defense; Compuls.SocSec | 75 |
Education | 80 | |
Health And Social Work | 85 | |
Activit.of.Membership Organizations NEC. | 91 | |
Other Service Activities | 93 | |
Sales and Service | Publishing Printing Recorded Media | 22 |
Construction | 45 | |
Sale Maint Rep.Mot.Vehicles;Ret.Sale Fue | 50 | |
Wholesale Commission Trade Exc.Mot.Vehic | 51 | |
Retail Trade Exc.Mot.Vehic;Mot.Cyc Repai | 52 | |
Hotels And Restaurants | 55 | |
Post And Telecommunications | 64 | |
Real Estate Property Activities | 70 | |
Rent.Machinery Equip Wo.Oper. P. HH Good | 71 | |
Recreational Cultural Sporting Activity | 92 | |
Services - NEC | 98 | |
Transportation | Land Transport; Transport Via Pipelines | 60 |
Water Transport | 61 | |
Air Transport | 62 | |
Supporting Aux.Transp.Activ;Trav.Agencie | 63 |
Mosaic Plots
- Unlike bar charts, mosaic plots allow us to index relative areas, sizes, proportions, etc. relative to two dimensions (so not just amount)
- So in our example, this will let us see relative differences within categories vertically and across categories horizontally
- To build this, we will finally leave the basic
ggplot2
package and use theggmosaic
package - There are other packages, but we’ll use this one because (1) it’s great and (2) it let’s us still use everything we’ve learned about ggplot
But first, the data:
Code
if(!"ggmosaic" %in% installed.packages()) install.packages("ggmosaic")
library(ggmosaic)
gsoep_jobs <- gsoep %>%
mutate(age_gr = mapvalues(age, 20:99, rep(seq(20, 90, 10), each = 10))) %>%
filter(!is.na(age_gr) & age >= 20 & age < 100) %>%
group_by(SID) %>%
filter(!is.na(job)) %>%
filter(age_gr == max(age_gr)) %>%
group_by(SID, age_gr) %>%
summarize(job = max(job)) %>%
ungroup() %>%
rename(code = job) %>%
left_join(jobs %>% rename(code = old)) %>%
group_by(code) %>%
filter(n() / nrow(.) >= .02) %>%
ungroup()
gsoep_jobs
- Let’s say, for example, that we think that some professions may restrict certain age groups due to experience (younger age groups) or functional limitations (older age groups)
- We could look at this simply as a stacked bar chart, but it wouldn’t clarify that there are different proportions of people in each job category
Code
Improvements: Let’s polish it
Code
gsoep_jobs %>%
ggplot() +
geom_mosaic(aes(x = product(age_gr), fill = cat)) +
labs(
x = "Age Group (Decades)"
, title = "There are small changes in category of professions across the lifespan"
, subtitle = "Younger adults are more likely to be be in service and sales positions"
) +
theme_classic() +
theme(
legend.position = "none"
, axis.text = element_text(face = "bold", size = rel(1.1))
, axis.title = element_text(face = "bold", size = rel(1.2))
, plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
, plot.subtitle = element_text(face = "italic", size = rel(1.1), hjust = .5)
)
Treemap
- Mosaic plots are sort of just fancy stacked bar plots that let you also index by size
- Treemaps are helpful when we have nested categorical (and sometimes, to a lesser degree continuous) variables
- We’ll use the example of our jobs data, but this could be used for lots of other types of variables
- Crossed conditions in an experiment
- Intergenerational data
- Average scores on variables within categories
- Brain activation across broader and narrower brain regions
- Political affiliation across states, demographic groups, and more
Code
gsoep_tm <- gsoep %>%
group_by(SID) %>%
filter(!is.na(job)) %>%
group_by(SID) %>%
summarize(job = max(job)) %>%
ungroup() %>%
rename(code = job) %>%
left_join(jobs %>% rename(code = old)) %>%
group_by(code, cat, job) %>%
tally() %>%
ungroup() %>%
filter(n/sum(n) > .02) %>%
mutate(job = str_wrap(job, 15))
Basic Syntax
Improvements: Remove Legend and Add Labels
Improvements: Add Subgroup Text
Code
gsoep_tm %>%
arrange(cat, code) %>%
ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
geom_treemap() +
geom_treemap_text(
colour = "white"
, place = "centre"
, size = 15
, grow = FALSE
) +
geom_treemap_subgroup_text(
place = "bottom"
, grow = TRUE
, alpha = 0.4
, colour = "white"
, fontface = "italic"
) +
scale_fill_viridis_d() +
theme(legend.position = "none")
Improvements: Color Palette
Code
gsoep_tm %>%
arrange(cat, code) %>%
ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
geom_treemap() +
geom_treemap_text(
colour = "white"
, place = "centre"
, size = 15
, grow = FALSE
) +
geom_treemap_subgroup_text(
place = "bottom"
, grow = TRUE
, alpha = 0.4
, colour = "white"
, fontface = "italic"
) +
scale_fill_viridis_d() +
theme(legend.position = "none")
Improvements: Group and Subgroup Borders + Text Color
Code
gsoep_tm %>%
arrange(cat, code) %>%
ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
geom_treemap(color = "grey", size = 3) +
geom_treemap_text(
colour = c(rep("white", 11), rep("black",4))
, place = "centre"
, size = 15
, grow = FALSE
) +
geom_treemap_subgroup_text(
place = "bottom"
, grow = TRUE
, alpha = 0.4
, colour = c(rep("white", 11), rep("black",4))
, fontface = "italic"
) +
geom_treemap_subgroup_border(
colour = "white"
, size = 5
) +
scale_fill_viridis_d() +
theme(legend.position = "none")
Improvements: Title
Code
gsoep_tm %>%
arrange(cat, code) %>%
ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
geom_treemap(color = "grey", size = 3) +
geom_treemap_text(
colour = c(rep("white", 11), rep("black",4))
, place = "centre"
, size = 15
, grow = FALSE
) +
geom_treemap_subgroup_text(
place = "bottom"
, grow = TRUE
, alpha = 0.4
, colour = c(rep("white", 11), rep("black",4))
, fontface = "italic"
) +
geom_treemap_subgroup_border(
colour = "white"
, size = 5
) +
scale_fill_viridis_d() +
labs(title = "White Collar Public Service, Sales, and\nFinance Jobs Far Outnumber Blue Collar Jobs") +
theme(legend.position = "none"
, plot.title = element_text(face = "bold", hjust = .5))
Alternative Exercise
If you don’t have your own data, you can use some open data from (infoisbeautiful.net)[https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/]. Here’s the commands you’ll need to load the data (note you will have to authenticate via Google):
Code
Choose one of the visualizations from this week and use these data to create it, also keeping in mind principles and skills from previous weeks. This is not a graded exercise but rather a challenge to create a really compelling visualization using what you’ve learned.