Visualizing Proportions

Author

Emorie D Beck

Code

library(RColorBrewer)
library(plyr)
library(tidyverse)

Quick Review

Review

What are the core elements of ggplot2 grammar?

From last week: * Mappings: base layer + ggplot() and aes() * Scales: control and modify your mappings + e.g., scale_x_continuous() and scale_fill_manual() * Geoms: plot elements + e.g., geom_point() and geom_line()

Facets: panel your plot
- facet_wrap() and facet_grid()
Themes: style your figure
- Built-in: e.g., theme_classic()
- Manual: theme() (legend, strip, axis, plot, panel)

Quick Review

Colorblindness and accessible plots

Adding in a colorblind-friendly palette from Wong (2011)

Code

cbsafe_pal <- tribble(
  ~name, ~rgb
  , "black", c(0, 0, 0)
  , "sky blue", c(86, 180, 233)
  , "bluish green", c(0, 158, 115)
  , "yellow", c(240, 228, 66)
  , "orange", c(230, 159, 0)
  , "blue", c(0, 114, 178)
  , "vermillion", c(213, 94, 0)
  , "reddish purple", c(204, 121, 167)
) %>%
  mutate(hex = map_chr(rgb, function(x) rgb(x[1], x[2], x[3], maxColorValue = 255)))
cbsafe_pal

Visualizating Proportions

Proportions are often important in our research
From describing sample-level differences to describing the frequency of behaviors / events / experiences, etc., we often reach toward describing amounts relative to the whole
But the goals we are trying to achieve are varied, which necesssitates the use of different graphics

Agenda

We will cover X kinds of ways of visualizations, all of which were covered in your readings
We will cover both when to use them and how to create them
- Pie Charts
- Bar Charts (Stacked)
- Bar Charts (Side-by-Side)
- Bar Charts and Density Across Continuous Variables
- Mosaic Plots
- Parallel Sets

But First, Our Data

Today, we’ll use the teaching sample from the German Socioeconomic Panel Study (GSOEP)
GSOEP is an ongoing longitudinal panel study that began in 1984 (26 waves of data!)
~20,000 people are sampled each year
Samples households in Germany
Has additional sub-projects (e.g., innovation studies, migrant panel, etc.)
The data are publicly available via application

Code

load(url("https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/03-week3-proportions/04-data/gsoep.RData"))
gsoep

Pie Charts

You may be wondering if you should ever use a pie chart
The answer is, of course, it depends
Pie charts are great when:
- What you want to visualize is simple (e.g., basic fractions)
- You want to clearly emphasize proportion relative to the whole
- You have a small data set
In our data, we have a few variables that follow this, but we’ll focus on one:
- marital status (4 groups)
ggplot2 doesn’t specifically support pie charts
Why? Because it’s a layered grammar of graphics and an explicit function for it would be redundant with some of the built in coordinates
- specifically, coord_polar()
So to make a pie chart, we’ll use geom_bar() + coord_polar()

Basic Syntax

Code

gsoep %>%
  filter(year == 2009 & !is.na(marital)) %>% # random
  group_by(marital) %>%
  tally() %>%
  mutate(marital = factor(
    marital
    , 1:4
    , c("Married", "Separated", "Widowed", "Never Married")
    )) %>%
  ggplot(aes(x = "", y = n, fill = marital)) + 
    geom_bar(stat = "identity", width = 1, color = "white") + 
    coord_polar("y", start = 0) + 
    theme_void()

Improvements: Slice Labels and Colors

Code

gsoep %>%
  filter(year == 2009 & !is.na(marital)) %>% # random
  group_by(marital) %>%
  tally() %>%
  mutate(marital = factor(
    marital
    , 1:4
    , c("Married", "Separated", "Widowed", "Never Married")
    )) %>%
  arrange(desc(marital)) %>%
  mutate(prop = n / sum(n) * 100
         , ypos = cumsum(prop)- 0.5*prop) %>%
  ggplot(aes(x = "", y = prop, fill = marital)) + 
    geom_bar(stat = "identity", width = 1, color = "white") + 
    geom_text(
      aes(y = ypos, label = marital)
      , color = "white"
      , size=4
      ) +
    scale_fill_manual(values = cbsafe_pal$hex[c(2, 8, 3, 4)]) + 
    coord_polar("y", start = 0) + 
    theme_void() + 
    theme(legend.position = "none")

More Improvements: Title and Story-Congruent Colors

Code

gsoep %>%
  filter(year == 2009 & !is.na(marital)) %>% # random
  group_by(marital) %>%
  tally() %>%
  mutate(marital = factor(marital, 1:4, c("Married", "Separated", "Widowed", "Never Married"))) %>%
  arrange(desc(marital)) %>%
  mutate(prop = n / sum(n) * 100
         , ypos = cumsum(prop)- 0.5*prop) %>%
  ggplot(aes(x = "", y = prop, fill = marital)) + 
    geom_bar(stat = "identity", width = 1, color = "black") + 
    geom_label(
      aes(y = ypos, label = marital)
      , color = "white"
      , size = 6
      , fontface = 2) +
    scale_fill_manual(values = c(rev(brewer.pal(9,"Greens")[c(4,6,8)]), "grey60")) + 
    coord_polar("y", start = 0) + 
    labs(
      title = "In 2009, the majority of GSOEP participants\nwere or had been married/partnered"
    ) + 
    theme_void() + 
    theme(
      legend.position = "none"
      , plot.title = element_text(face = "bold.italic", size = rel(1.4), hjust = .5)
      )

Stacked Bar Charts

Like pie charts, stacked bar charts have their time and place
In particular:
- Show proportions relative to the total
- Can be used to show changes over time
To demonstrate, let’s look at marital status across emerging adulthood (18-26)

Basic Syntax

Code

gsoep %>%
  filter(age %in% 18:26 & !is.na(marital)) %>%
  group_by(age, marital) %>%
  tally() %>%
  group_by(age) %>%
  mutate(
    marital = factor(
      marital
      , 1:4
      , c("Married", "Separated", "Widowed", "Never Married")
      )
    , age = factor(age)
    , prop = n/sum(n)
    ) %>%
  ggplot(aes(x = age, y = prop, fill = marital)) + 
    geom_bar(stat = "identity", color = "black") + 
    theme_classic()

Improvements: Color

Code

gsoep %>%
  filter(age %in% 18:26 & !is.na(marital)) %>%
  group_by(age, marital) %>%
  tally() %>%
  group_by(age) %>%
  mutate(marital = factor(marital, seq(4,1,-1), rev(c("Married", "Separated", "Widowed", "Never Married")))
         , age = factor(age)
         , prop = n/sum(n)) %>%
  ggplot(aes(x = age, y = prop, fill = marital)) + 
    geom_bar(stat = "identity", color = "black") + 
    scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) + 
    theme_classic()

Improvements: Label & Scales

Code

gsoep %>%
  filter(age %in% 18:26 & !is.na(marital)) %>%
  group_by(age, marital) %>%
  tally() %>%
  group_by(age) %>%
  mutate(
    marital = factor(
      marital
      , seq(4,1,-1)
      , rev(c("Married", "Separated", "Widowed", "Never Married"))
      )
    , age = factor(age)
    , prop = n/sum(n)
    ) %>%
  ggplot(aes(x = age, y = prop, fill = marital)) + 
    geom_bar(stat = "identity", color = "black") + 
    scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) + 
    scale_y_continuous(
      limits = c(0,1)
      , breaks = seq(0, 1, .25)
      , labels = c("0%", "25%", "50%", "75%", "100%")
      ) + 
    labs(
      x = "Age"
      , y = "Percent of Sample"
      , title = "Rates of relationships increase in emerging adulthood"
      , subtitle = "But most remain unpartnered by 26"
      ) +
    theme_classic()

Improvements: Legend

Code

gsoep %>%
  filter(age %in% 18:26 & !is.na(marital)) %>%
  group_by(age, marital) %>%
  tally() %>%
  group_by(age) %>%
  mutate(marital = factor(marital, seq(4,1,-1), rev(c("Married", "Separated", "Widowed", "Never Married")))
         , age = factor(age)
         , prop = n/sum(n)) %>%
  ggplot(aes(x = age, y = prop, fill = marital)) + 
    geom_bar(stat = "identity", color = "black") + 
    scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) + 
    scale_y_continuous(
      limits = c(0,1)
      , breaks = seq(0, 1, .25)
      , labels = c("0%", "25%", "50%", "75%", "100%")
      ) + 
    annotate("text", x = "26", y = .60, label = "Never Married", angle = 90) + 
    annotate("text", x = "26", y = .13, label = "Married", angle = 90, color = "white") + 
    labs(
      x = "Age"
      , y = "Percent of Sample"
      , title = "Rates of relationships increase in emerging adulthood"
      , subtitle = "But most remain unpartnered by 26"
      , fill = NULL
      ) +
    theme_classic() + 
    theme(legend.position = "bottom")

Improvements: Theme Elements Exercise

Bold axis text and increase size
Bold axis titles and increase size
Bold title and subtitle and center (hint, you will also need to wrap the title text)

(Answers)

Code

gsoep %>%
  filter(age %in% 18:26 & !is.na(marital)) %>%
  group_by(age, marital) %>%
  tally() %>%
  group_by(age) %>%
  mutate(marital = factor(marital, seq(4,1,-1), rev(c("Married", "Separated", "Widowed", "Never Married")))
         , age = factor(age)
         , prop = n/sum(n)) %>%
  ggplot(aes(x = age, y = prop, fill = marital)) + 
    geom_bar(stat = "identity", color = "black") + 
    scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) + 
    scale_y_continuous(
      limits = c(0,1)
      , breaks = seq(0, 1, .25)
      , labels = c("0%", "25%", "50%", "75%", "100%")
      ) + 
    annotate("text", x = "26", y = .60, label = "Never Married", angle = 90) + 
    annotate("text", x = "26", y = .13, label = "Married", angle = 90, color = "white") + 
    labs(
      x = "Age"
      , y = "Percent of Sample"
      , title = "Rates of relationships increase in\nemerging adulthood"
      , subtitle = "But most remain unpartnered by 26"
      , fill = NULL
      ) +
    theme_classic() + 
    theme(
      legend.position = "bottom"
      , axis.text = element_text(face = "bold", size = rel(1.1))
      , axis.title = element_text(face = "bold", size = rel(1.1))
      , plot.title = element_text(face = "bold", size = rel(1.2), hjust = .5)
      , plot.subtitle = element_text(face = "italic", size = rel(1.1), hjust = .5)
      )

Side-by-Side Bar Charts

Stacked bar charts are great for showing sequences but can make it difficult to compare within a stack
Side-by-side bar charts make it much easier to compare across categories and work well when broken into many categories
But they can be difficult to understand across sequences
To demonstrate, let’s look at marriage rates across three waves

Basic Syntax

Code

gsoep %>%
  filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
  group_by(year, marital) %>%
  tally() %>%
  mutate(marital = factor(marital, 1:4, c("Married", "Separated", "Widowed", "Never Married"))) %>%
  group_by(year) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = year, y = prop, fill = marital)) + 
    geom_bar(stat = "identity", color = "black", position = "dodge") + 
    theme_classic()

Improvements: Order

Code

gsoep %>%
  filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
  group_by(year, marital) %>%
  tally() %>%
  mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>%
  group_by(year) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = year, y = prop, fill = marital)) + 
    geom_bar(stat = "identity", color = "black", position = "dodge") + 
    theme_classic()

Improvements: Labels

We could label the bars, but let’s label the axes instead

Code

gsoep %>%
  filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
  group_by(year, marital) %>%
  tally() %>%
  mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>%
  group_by(year) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = marital, y = prop, fill = marital)) + 
    geom_bar(stat = "identity", color = "black", position = "dodge") + 
    scale_y_continuous(
      limits = c(0,.7), breaks = seq(0,.7, .2), labels = c("0%", "20%", "40%", "60%")
    ) +
    facet_grid(~year) + 
    labs(
      x = NULL
      , y = "Percentage of Participants"
      , title = "Marital Status Has Remained Consistent Throughout the 21st Century"
      ) + 
    theme_classic() + 
    theme(
      legend.position = "none"
      , axis.text.x = element_text(angle = 45, hjust = 1)
      )

Improvements: Theme Elements

Let’s label and improve the theme elements

Code

gsoep %>%
  filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
  group_by(year, marital) %>%
  tally() %>%
  mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>%
  group_by(year) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = marital, y = prop, fill = marital)) + 
    geom_bar(stat = "identity", color = "black", position = "dodge") + 
    scale_y_continuous(
      limits = c(0,.7), breaks = seq(0,.7, .2), labels = c("0%", "20%", "40%", "60%")
    ) +
    facet_grid(~year) + 
    labs(
      x = NULL
      , y = "Percentage of Participants"
      , title = "Marital Status Has Remained Consistent\nThroughout the 21st Century"
      ) + 
    theme_classic() + 
    theme(
      legend.position = "none"
      , axis.text = element_text(face = "bold", size = rel(1.2))
      , axis.text.x = element_text(angle = 45, hjust = 1, size = rel(1))
      , axis.title = element_text(face = "bold", size = rel(1.2))
      , strip.background = element_rect(fill = "grey90", color = "black")
      , strip.text = element_text(face = "bold", size = rel(1.2))
      , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
      )

Improvements: Colors (Exercise)

Exercise: * Improve the colors by making them: + Colorblind-friendly + Match the goal of the plot (see title)

(Answers)

Code

gsoep %>%
  filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random
  group_by(year, marital) %>%
  tally() %>%
  mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>%
  group_by(year) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = marital, y = prop, fill = marital)) + 
    geom_bar(stat = "identity", color = "black", position = "dodge") + 
    scale_y_continuous(
      limits = c(0,.7), breaks = seq(0,.7, .2), labels = c("0%", "20%", "40%", "60%")
    ) +
    scale_fill_manual(values = cbsafe_pal$hex[2:5]) +
    facet_grid(~year) + 
    labs(
      x = NULL
      , y = "Percentage of Participants"
      , title = "Marital Status Has Remained Consistent\nThroughout the 21st Century"
      ) + 
    theme_classic() + 
    theme(
      legend.position = "none"
      , axis.text = element_text(face = "bold", size = rel(1.2))
      , axis.text.x = element_text(angle = 45, hjust = 1, size = rel(1))
      , axis.title = element_text(face = "bold", size = rel(1.2))
      , strip.background = element_rect(fill = "grey90", color = "black")
      , strip.text = element_text(face = "bold", size = rel(1.2))
      , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
      )

Bar Charts and Density Across Continuous Variables

One challenge with stacked bar charts is that when there are more than two categories, it can be very difficult to track the visualized trend
Relative to side-by-side bar charts, it’s easy to see any category relative to the total but somewhat more difficult to also account for differing numbers of people in different categories or across time
One possible solution to this is to look at densities across time and groups or relative to the total
Let’s do both now

Stacked Area Charts

But first, remember stacked bar charts? Stacked area charts are sort of an extension of those:

Code

gsoep %>%
  filter(age %in% c(20, 30, 40, 50, 60, 70, 80) & !is.na(SRhealth)) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(SRhealth, seq(5,1,-1), c("Very good", "Good", "Satisfactory", "Poor", "Bad"))) %>%
  group_by(age) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = age, y = prop, fill = SRhealth)) + 
    geom_bar(stat = "identity", color = "black") + 
    scale_fill_manual(values = cbsafe_pal$hex[2:6]) +
    theme_classic()

But without the bars separating them.

Code

gsoep %>%
  filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(SRhealth, seq(5,1,-1), c("Very good", "Good", "Satisfactory", "Poor", "Bad"))) %>%
  group_by(age) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = age, y = prop, fill = SRhealth)) + 
    geom_area() + 
    # scale_fill_manual(values = cbsafe_pal$hex[2:6]) +
    theme_classic()

Improvements: Color

Code

gsoep %>%
  filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(SRhealth, seq(5,1,-1), c("Very good", "Good", "Satisfactory", "Poor", "Bad"))) %>%
  group_by(age) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = age, y = prop, fill = SRhealth)) + 
    geom_area(color = "white", alpha = .6) + 
    scale_fill_viridis_d() +
    theme_classic()

Improvements: Color Labels

Code

gsoep %>%
  filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
  group_by(age) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = age, y = prop, fill = SRhealth)) + 
    geom_area(color = "white", alpha = .6) + 
    annotate("text", x = 85, y = .95, label = "Bad", color = "white", fontface = 2) + 
    annotate("text", x = 75, y = .80, label = "Poor", color = "white", fontface = 2) + 
    annotate("text", x = 62, y = .55, label = "Satisfactory", color = "white", fontface = 2) + 
    annotate("text", x = 43, y = .3, label = "Good", color = "black", fontface = 2) + 
    annotate("text", x = 30, y = .07, label = "Very Good", color = "black", fontface = 2) + 
    scale_fill_viridis_d() +
    theme_classic() + 
    theme(legend.position = "none")

Improvements: Theme Elements

Code

gsoep %>%
  filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
  group_by(age) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = age, y = prop, fill = SRhealth)) + 
    geom_area(color = "white", alpha = .6) + 
    annotate("text", x = 85, y = .95, label = "Bad", color = "white", fontface = 2) + 
    annotate("text", x = 75, y = .80, label = "Poor", color = "white", fontface = 2) + 
    annotate("text", x = 62, y = .55, label = "Satisfactory", color = "white", fontface = 2) + 
    annotate("text", x = 43, y = .3, label = "Good", color = "black", fontface = 2) + 
    annotate("text", x = 30, y = .07, label = "Very Good", color = "black", fontface = 2) + 
    scale_fill_viridis_d() +
    theme_classic() + 
    theme(legend.position = "none"
          , axis.text = element_text(face = "bold", size = rel(1.1))
          , axis.title = element_text(face = "bold", size = rel(1.1))
          , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
    )

Improvements: Labels and Title (Exercise)

Exercise:

Add plot title
Change x and y scale labels and titles

(Answers)

Code

gsoep %>%
  filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
  group_by(age) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = age, y = prop, fill = SRhealth)) + 
    geom_area(color = "white", alpha = .6) + 
    annotate("text", x = 85, y = .95, label = "Bad", color = "white", fontface = 2) + 
    annotate("text", x = 75, y = .80, label = "Poor", color = "white", fontface = 2) + 
    annotate("text", x = 62, y = .55, label = "Satisfactory", color = "white", fontface = 2) + 
    annotate("text", x = 43, y = .3, label = "Good", color = "black", fontface = 2) + 
    annotate("text", x = 30, y = .07, label = "Very Good", color = "black", fontface = 2) + 
    scale_fill_viridis_d() +
    theme_classic() + 
    theme(legend.position = "none"
          , axis.text = element_text(face = "bold", size = rel(1.1))
          , axis.title = element_text(face = "bold", size = rel(1.1))
          , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
    )

Improvements: Labels and Title

Code

gsoep %>%
  filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
  group_by(age) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = age, y = prop, fill = SRhealth)) + 
    geom_area(color = "white", alpha = .6) + 
    annotate("text", x = 85, y = .95, label = "Bad", color = "white", fontface = 2) + 
    annotate("text", x = 75, y = .80, label = "Poor", color = "white", fontface = 2) + 
    annotate("text", x = 62, y = .55, label = "Satisfactory", color = "white", fontface = 2) + 
    annotate("text", x = 43, y = .3, label = "Good", color = "black", fontface = 2) + 
    annotate("text", x = 30, y = .07, label = "Very Good", color = "black", fontface = 2) + 
    scale_x_continuous(limits = c(18, 100), breaks = seq(20, 100, 10)) + 
    scale_y_continuous(limits = c(0,1), breaks = seq(0,1, .25), labels = c("0%", "25%", "50%", "75%", "100%")) + 
    scale_fill_viridis_d() +
    labs(
      x = "Age (Years)"
      , y = "Percentage of Participants"
      , title = "Levels of Self-Rated Health Decrease Across the Lifespan"
    ) + 
    theme_classic() + 
    theme(legend.position = "none"
          , axis.text = element_text(face = "bold", size = rel(1.1))
          , axis.title = element_text(face = "bold", size = rel(1.1))
          , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
    )

Total Density Plots

Let’s revisit these data but also demonstrating how sample size changes across the lifespan
To do this, we need two pieces of information:
- sample size in each self-rated health category at each age group
- total in each age group

Let’s start by using stat_smooth() to get a smoothed geom_area() of the total sample size onto the figure

Code

gsoep %>%
  filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
  group_by(age) %>%
  mutate(total_n = sum(n))  %>%
  ggplot(aes(x = age, y = n)) + 
    stat_smooth(
        aes(y = total_n)
        , geom = 'area'
        , method = 'loess'
        , span = 1/3
        , alpha = .8
        , fill = "grey"
        ) + 
    facet_grid(~SRhealth) + 
    theme_classic()

Then add the area for each ordinal level of self-rated health.

Code

gsoep %>%
  filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
  group_by(age) %>%
  mutate(total_n = sum(n))  %>%
  ggplot(aes(x = age, y = n)) + 
    stat_smooth(aes(y = total_n), geom = 'area', method = 'loess'
        , span = 1/3, alpha = .8, fill = "grey") + 
    stat_smooth(
        aes(fill = SRhealth)
        , geom = 'area'
        , method = 'loess'
        , span = 1/3
        , alpha = .8
        ) + 
    annotate("text", x = 45, y = 3000, label = "Total") + 
    facet_grid(~SRhealth) + 
    theme_classic() + 
    theme(legend.position = "none")

Let’s not belabor this too much.

Code

gsoep %>%
  filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random
  group_by(age, SRhealth) %>%
  tally() %>%
  mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>%
  group_by(age) %>%
  mutate(total_n = sum(n))  %>%
  ggplot(aes(x = age, y = n)) + 
    stat_smooth(aes(y = total_n), geom = 'area', method = 'loess'
        , span = 1/3, alpha = .8, fill = "grey") + 
    stat_smooth(
        aes(fill = SRhealth)
        , geom = 'area'
        , method = 'loess'
        , span = 1/3
        , alpha = .8
        ) + 
    scale_x_continuous(limits = c(18, 100), breaks = seq(20, 100, 10)) + 
    scale_fill_viridis_d() +
    annotate("text", x = 45, y = 3000, label = "Total") + 
    labs(
      x = "Age (Years)"
      , y = "Number of People"
      , title = "Good Self-Rated Health Decreases Across the Lifespan"
      , subtitle = "But bad decreases less, likely because all-cause sample drop-out"
      ) + 
    facet_grid(~SRhealth) + 
    theme_classic() + 
    theme(legend.position = "none"
          , axis.text = element_text(face = "bold", size = rel(1.1))
          , axis.title = element_text(face = "bold", size = rel(1.1))
          , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
          , plot.subtitle = element_text(face = "italic", size = rel(1), hjust = .5)
          , strip.background = element_rect(fill = "grey90", color = "black")
          , strip.text = element_text(face = "bold", size = rel(1.2))
          )

Nested Proportions

Sometimes, the proportions that we want to visualize are more complex and can’t just be simply binned
In such cases, there may be hierarchical relationships among the categories
Today, we’ll cover two core nested proportion plots:
- Mosaic plots
- Parallel Sets
To do this, we’ll use 2-Digit NACE Industry Sector codes from participants’ last reported jobs in the SOEP, which I’ve broken down into 9 higher-order categories
This is a lot of categories, so we’ll further eventually exclude categories that don’t have at least 2% of the share of participants

Code

library(kableExtra)
jobs <- read_csv("https://raw.githubusercontent.com/emoriebeck/psc290-data-viz-2022/main/03-week3-proportions/05-job-codes.csv")
jobs %>%
  select(cat, job, old) %>%
  arrange(cat, old) %>%
  kable(.
        , "html"
        , col.names = c("Category", "Job", "Code")
        , caption = "2-Digit NACE Industry Sector Codes and Categories") %>%
  kable_classic(full_width = F, html_font = "Times New Roman") %>%
  collapse_rows(1, valign = "top") %>%
  scroll_box(height = "500px")

2-Digit NACE Industry Sector Codes and Categories
Category	Job	Code
Agriculture	Agriculture Hunting Rel.Serv.Activities	1
	Forestry Logging Rel.Service activities	2
	Fishing Fish Hatcheries Fish Farms	5
Energy and Utilities	Mining Coal Lignite; Extraction Of Peat	10
	Extraction Crude Petroleum Natural Gas	11
	Mining Of Uranium And Thorium Ores	12
	Mining Of Metal Ores	13
	Other Mining And Quarrying	14
	Recycling	37
	Electricity Gas Steam Hot Water Supply	40
	Sewage Refuse Disposal Sanitationa.a.Re	90
Finance and Tech	Financ.Intermediat. Exc.Insur. Pens.Fund	65
	Insurance Pens.Funding Ex.Compuls.SocSe	66
	Activ.Aux.To Financial Intermediation	67
	Computer And Related Activities	72
	Research And Development	73
	Other Business Activities	74
	Industry - NEC	96
Manufacturing	Manuf Food Products And Beverages	15
	Manuf Tobacco Products	16
	Manuf Textiles	17
	Manuf Wear. Apparel; Dressing Dyeing Fur	18
	Tanning Dress.Leather; luggage Footwear	19
	Manuf Wood Products Except Furniture	20
	Manuf Pulp Paper And Paper Products	21
	Manuf Coke Ref.Petroleum Nuclear Fuel	23
	Manuf Chemicals And Chemical Products	24
	Manuf Rubber And Plastic Products	25
	Manuf Other Non-metallic Mineral Product	26
	Manuf Basic Metals	27
	Manuf Fabric.Metal Prod. Ex.Machin. Equi	28
	Manuf Machinery And Equipment NEC	29
	Manuf Office Machinery And Computers	30
	Manuf Electrical Machinery Apparatus NE	31
	Manuf Radio Television Communic.Equipmen	32
	Manuf Medical Precision Optical Instrum.	33
	Manuf Motor Vehicles Trailers Semi-tr.	34
	Manuf Other Transport Equipment	35
	Manuf Furniture; Manufacturing NEC	36
	Collection Purification Distrib.Of Water	41
	Handcraft Trade - NEC	97
	Manufacturing - NEC	100
Other	Private Households With Employed Persons	95
Other	Extra-territorial Organizations.a.Bodies	99
Public Service	Publ.Administr. Defense; Compuls.SocSec	75
	Education	80
	Health And Social Work	85
	Activit.of.Membership Organizations NEC.	91
	Other Service Activities	93
Sales and Service	Publishing Printing Recorded Media	22
	Construction	45
	Sale Maint Rep.Mot.Vehicles;Ret.Sale Fue	50
	Wholesale Commission Trade Exc.Mot.Vehic	51
	Retail Trade Exc.Mot.Vehic;Mot.Cyc Repai	52
	Hotels And Restaurants	55
	Post And Telecommunications	64
	Real Estate Property Activities	70
	Rent.Machinery Equip Wo.Oper. P. HH Good	71
	Recreational Cultural Sporting Activity	92
	Services - NEC	98
Transportation	Land Transport; Transport Via Pipelines	60
	Water Transport	61
	Air Transport	62
	Supporting Aux.Transp.Activ;Trav.Agencie	63

Mosaic Plots

Unlike bar charts, mosaic plots allow us to index relative areas, sizes, proportions, etc. relative to two dimensions (so not just amount)
So in our example, this will let us see relative differences within categories vertically and across categories horizontally
To build this, we will finally leave the basic ggplot2 package and use the ggmosaic package
There are other packages, but we’ll use this one because (1) it’s great and (2) it let’s us still use everything we’ve learned about ggplot

But first, the data:

Code

if(!"ggmosaic" %in% installed.packages()) install.packages("ggmosaic")
library(ggmosaic)

gsoep_jobs <- gsoep %>%
  mutate(age_gr = mapvalues(age, 20:99, rep(seq(20, 90, 10), each = 10))) %>%
  filter(!is.na(age_gr) & age >= 20 & age < 100) %>%
  group_by(SID) %>%
  filter(!is.na(job)) %>%
  filter(age_gr == max(age_gr)) %>%
  group_by(SID, age_gr) %>%
  summarize(job = max(job)) %>%
  ungroup() %>%
  rename(code = job) %>%
  left_join(jobs %>% rename(code = old)) %>%
  group_by(code) %>%
  filter(n() / nrow(.) >= .02) %>%
  ungroup() 
gsoep_jobs

Let’s say, for example, that we think that some professions may restrict certain age groups due to experience (younger age groups) or functional limitations (older age groups)
We could look at this simply as a stacked bar chart, but it wouldn’t clarify that there are different proportions of people in each job category

Code

gsoep_jobs %>%
  ggplot() + 
    geom_mosaic(aes(x = product(age_gr), fill = cat)) + 
    theme_classic() + 
    theme(legend.position = "none")

Improvements: Let’s polish it

Code

gsoep_jobs %>%
  ggplot() + 
    geom_mosaic(aes(x = product(age_gr), fill = cat)) + 
    labs(
      x = "Age Group (Decades)"
      , title = "There are small changes in category of professions across the lifespan"
      , subtitle = "Younger adults are more likely to be be in service and sales positions"
      ) +
    theme_classic() + 
    theme(
      legend.position = "none"
      , axis.text = element_text(face = "bold", size = rel(1.1))
      , axis.title = element_text(face = "bold", size = rel(1.2))
      , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5)
      , plot.subtitle = element_text(face = "italic", size = rel(1.1), hjust = .5)
      )

Treemap

Mosaic plots are sort of just fancy stacked bar plots that let you also index by size
Treemaps are helpful when we have nested categorical (and sometimes, to a lesser degree continuous) variables
We’ll use the example of our jobs data, but this could be used for lots of other types of variables
- Crossed conditions in an experiment
- Intergenerational data
- Average scores on variables within categories
- Brain activation across broader and narrower brain regions
- Political affiliation across states, demographic groups, and more

Code

gsoep_tm <- gsoep %>%
  group_by(SID) %>%
  filter(!is.na(job)) %>%
  group_by(SID) %>%
  summarize(job = max(job)) %>%
  ungroup() %>%
  rename(code = job) %>%
  left_join(jobs %>% rename(code = old)) %>%
  group_by(code, cat, job) %>%
  tally()  %>%
  ungroup() %>%
  filter(n/sum(n) > .02) %>%
  mutate(job = str_wrap(job, 15))

Basic Syntax

Code

if(!"treemapify" %in% installed.packages()) install.packages("treemapify")
library(treemapify)
gsoep_tm %>%
  arrange(cat, code) %>%
  ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
  geom_treemap(color = "grey", size = 3)

Improvements: Remove Legend and Add Labels

Code

gsoep_tm %>%
  arrange(cat, code) %>%
  ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
  geom_treemap() +
  geom_treemap_text(
    colour = "white"
    , place = "centre"
    , size = 15
    , grow = FALSE
    ) +
  theme(legend.position = "none")

Improvements: Add Subgroup Text

Code

gsoep_tm %>%
  arrange(cat, code) %>%
  ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
  geom_treemap() +
  geom_treemap_text(
    colour = "white"
    , place = "centre"
    , size = 15
    , grow = FALSE
    ) +
  geom_treemap_subgroup_text(
    place = "bottom"
    , grow = TRUE
    , alpha = 0.4
    , colour = "white"
    , fontface = "italic"
    ) +
  scale_fill_viridis_d()  +
  theme(legend.position = "none")

Improvements: Color Palette

Code

gsoep_tm %>%
  arrange(cat, code) %>%
  ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
  geom_treemap() +
  geom_treemap_text(
    colour = "white"
    , place = "centre"
    , size = 15
    , grow = FALSE
    ) +
  geom_treemap_subgroup_text(
    place = "bottom"
    , grow = TRUE
    , alpha = 0.4
    , colour = "white"
    , fontface = "italic"
    ) +
  scale_fill_viridis_d()  +
  theme(legend.position = "none")

Improvements: Group and Subgroup Borders + Text Color

Code

gsoep_tm %>%
  arrange(cat, code) %>%
  ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
  geom_treemap(color = "grey", size = 3) +
  geom_treemap_text(
    colour = c(rep("white", 11), rep("black",4))
    , place = "centre"
    , size = 15
    , grow = FALSE
    ) +
  geom_treemap_subgroup_text(
    place = "bottom"
    , grow = TRUE
    , alpha = 0.4
    , colour = c(rep("white", 11), rep("black",4))
    , fontface = "italic"
    ) +
  geom_treemap_subgroup_border(
    colour = "white"
    , size = 5
    ) +
  scale_fill_viridis_d()  +
  theme(legend.position = "none")

Improvements: Title

Code

gsoep_tm %>%
  arrange(cat, code) %>%
  ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) +
  geom_treemap(color = "grey", size = 3) +
  geom_treemap_text(
    colour = c(rep("white", 11), rep("black",4))
    , place = "centre"
    , size = 15
    , grow = FALSE
    ) +
  geom_treemap_subgroup_text(
    place = "bottom"
    , grow = TRUE
    , alpha = 0.4
    , colour = c(rep("white", 11), rep("black",4))
    , fontface = "italic"
    ) +
  geom_treemap_subgroup_border(
    colour = "white"
    , size = 5
    ) +
  scale_fill_viridis_d()  +
  labs(title = "White Collar Public Service, Sales, and\nFinance Jobs Far Outnumber Blue Collar Jobs") + 
  theme(legend.position = "none"
        , plot.title = element_text(face = "bold", hjust = .5))

Alternative Exercise

If you don’t have your own data, you can use some open data from (infoisbeautiful.net)[https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/]. Here’s the commands you’ll need to load the data (note you will have to authenticate via Google):

Code

if(!"googlesheets4" %in% installed.packages()) install.packages("googlesheets4")
library(googlesheets4)

breach_data <- read_sheet("https://docs.google.com/spreadsheets/d/1i0oIJJMRG-7t1GT-mr4smaTTU7988yXVz8nPlwaJ8Xk/edit#gid=2")
breach_data

Choose one of the visualizations from this week and use these data to create it, also keeping in mind principles and skills from previous weeks. This is not a graded exercise but rather a challenge to create a really compelling visualization using what you’ve learned.

--- title: "Visualizing Proportions" author: "Emorie D Beck" format: html: code-tools: true code-copy: true code-line-numbers: true code-link: true theme: united highlight-style: tango df-print: paged code-fold: show toc: true toc-float: true self-contained: true # height: 900 footer: "PSC 290 - Data Visualization" logo: "https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/01-week1-intro/02-code/02-images/ucdavis_logo_blue.png" --- ```{r, echo = F} knitr::opts_chunk$set(echo = TRUE, warning = F, message = F, error = F) options(knitr.kable.NA = '') ``` ```{r, echo = T} library(RColorBrewer) library(plyr) library(tidyverse) ``` # Quick Review ## Review ### What are the core elements of ggplot2 grammar? From last week: * **Mappings**: base layer + `ggplot()` and `aes()` * **Scales**: control and modify your mappings + e.g., `scale_x_continuous()` and `scale_fill_manual()` * **Geoms**: plot elements + e.g., `geom_point()` and `geom_line()` * **Facets**: panel your plot + `facet_wrap()` and `facet_grid()` * **Themes**: style your figure + Built-in: e.g., `theme_classic()` + Manual: `theme()` (`legend`, `strip`, `axis`, `plot`, `panel`) ## Quick Review ### Colorblindness and accessible plots * Adding in a colorblind-friendly palette from [Wong (2011)](https://www.nature.com/articles/nmeth.1618) ```{r} cbsafe_pal <- tribble( ~name, ~rgb , "black", c(0, 0, 0) , "sky blue", c(86, 180, 233) , "bluish green", c(0, 158, 115) , "yellow", c(240, 228, 66) , "orange", c(230, 159, 0) , "blue", c(0, 114, 178) , "vermillion", c(213, 94, 0) , "reddish purple", c(204, 121, 167) ) %>% mutate(hex = map_chr(rgb, function(x) rgb(x[1], x[2], x[3], maxColorValue = 255))) cbsafe_pal ``` # Visualizating Proportions * Proportions are often important in our research * From describing sample-level differences to describing the frequency of behaviors / events / experiences, etc., we often reach toward describing amounts relative to the whole * But the goals we are trying to achieve are varied, which necesssitates the use of different graphics # Agenda * We will cover X kinds of ways of visualizations, all of which were covered in your readings * We will cover both when to use them and how to create them + Pie Charts + Bar Charts (Stacked) + Bar Charts (Side-by-Side) + Bar Charts and Density Across Continuous Variables + Mosaic Plots + Parallel Sets # But First, Our Data * Today, we'll use the teaching sample from the German Socioeconomic Panel Study (GSOEP) * GSOEP is an ongoing longitudinal panel study that began in 1984 (26 waves of data!) * ~20,000 people are sampled each year * Samples households in Germany * Has additional sub-projects (e.g., innovation studies, migrant panel, etc.) * The data are publicly available via application ```{r gsoep clean fun, eval = F, echo = F} gsoep_read_fun <- function(Year, WL){ print(Year) old.names <- (gsoep_codebook %>% filter(year == Year))$orig_itemname p <- sprintf("%s/gsoep/%sp.sav", data_path, WL) %>% haven::read_sav(.) %>% full_join(sprintf("%s/gsoep/%spequiv.sav", data_path, WL) %>% haven::read_sav(.)) %>% full_join(ref) %>% filter(rgroup20 > 10) %>% select(SID = persnr, HID = hhnr, one_of(old.names)) %>% pivot_longer( cols = c(-SID, -HID) , values_to = "value" , names_to = "orig_itemname" , values_drop_na = T ) } gsoep_codebook <- read_csv("https://raw.githubusercontent.com/emoriebeck/psc290-data-viz-2022/main/03-week3-proportions/01-codebook.csv") %>% mutate(orig_itemname = str_to_lower(orig_itemname)) gsoep_codebook data_path <- "/Volumes/Emorie/data" ref <- sprintf("%s/gsoep/cirdef.sav", data_path) %>% haven::read_sav(.) %>% select(cid, rgroup20) gsoep_st <- sprintf("%s/gsoep/ppath.sav", data_path) %>% haven::read_sav(.) %>% full_join(ref) %>% filter(rgroup20 > 10) %>% left_join( sprintf("%s/gsoep/biojob.sav", data_path) %>% haven::read_sav(.) %>% select(pid = persnr, job = nacelj)) %>% select(SID = pid, gender = sex, yearBrth = gebjahr, mortality = todjahr, job) %>% distinct() gsoep <- gsoep_codebook %>% select(wave, waveletter, year) %>% filter(complete.cases(.)) %>% distinct() %>% arrange(year) %>% mutate(data = map2(year, waveletter, gsoep_read_fun)) gsoep_long <- gsoep %>% unnest(data) Mode <- function(x) { ux <- unique(x) ux <- ux[!is.na(ux)] ux[which.max(tabulate(match(x, ux)))] } gsoep_recode <- gsoep_long %>% left_join(gsoep_codebook %>% select(name, itemname, wave, year, orig_itemname, reverse_code:comp_rule)) %>% group_by(name) %>% nest() %>% ungroup() # recode recode_fun <- function(rule, y, year){ x <- y$value if(!is.na(rule)){y$value <- eval(parse(text = rule))} return(y) } fun_call <- function(x, rule){ switch(rule, average = mean(x, na.rm = T), mode = Mode(x)[1], sum = sum(x, na.rm = T), skip = unique(x)[1], max = max(x, na.rm = T), min = min(x, na.rm = T)) } gsoep_recode <- gsoep_recode %>% mutate(data = map(data, ~(.) %>% group_by(recode, year) %>% nest() %>% ungroup() %>% mutate(data = pmap(list(recode, data, year), recode_fun)) %>% unnest(data) %>% mutate(value = ifelse(value < 0 | is.nan(value) | is.infinite(value), NA, value)))) gsoep <- gsoep_recode %>% unnest(data) %>% select(name, year, SID, value) %>% pivot_wider( names_from = "name" , values_from = "value" ) %>% full_join(gsoep_st) %>% mutate( age = year - yearBrth , mortality = ifelse(mortality < 0, 0, ifelse(year >= mortality, 1, 0)) , job = ifelse(job < 0, NA, job) , SRhealth = mapvalues(SRhealth, seq(1,5), seq(5,1)) ) gsoep_st <- gsoep_st %>% mutate( mortality = ifelse(mortality < 0, 0, 1) , job = ifelse(job < 0, NA, job) ) save(gsoep, gsoep_st, file = "/Volumes/Emorie/GitHub/psc290-data-viz-2022/03-week3-proportions/04-data/gsoep.RData") ``` ```{r} load(url("https://github.com/emoriebeck/psc290-data-viz-2022/raw/main/03-week3-proportions/04-data/gsoep.RData")) gsoep ``` # Pie Charts * You may be wondering if you should ever use a pie chart * The answer is, of course, it depends * Pie charts are great when: + What you want to visualize is simple (e.g., basic fractions) + You want to clearly emphasize proportion relative to the whole + You have a small data set * In our data, we have a few variables that follow this, but we'll focus on one: + marital status (4 groups) * `ggplot2` doesn't specifically support pie charts * Why? Because it's a layered grammar of graphics and an explicit function for it would be redundant with some of the built in coordinates + specifically, `coord_polar()` * So to make a pie chart, we'll use `geom_bar() + coord_polar()` ## Basic Syntax ```{r, echo = T, fig.show='show', fig.align='center', fig.dim= c(5,3), out.width="100%"} gsoep %>% filter(year == 2009 & !is.na(marital)) %>% # random group_by(marital) %>% tally() %>% mutate(marital = factor( marital , 1:4 , c("Married", "Separated", "Widowed", "Never Married") )) %>% ggplot(aes(x = "", y = n, fill = marital)) + geom_bar(stat = "identity", width = 1, color = "white") + coord_polar("y", start = 0) + theme_void() ``` ## Improvements: Slice Labels and Colors ```{r, echo = T, fig.show='show', fig.align='center', fig.dim= c(4,4), out.width="85%"} gsoep %>% filter(year == 2009 & !is.na(marital)) %>% # random group_by(marital) %>% tally() %>% mutate(marital = factor( marital , 1:4 , c("Married", "Separated", "Widowed", "Never Married") )) %>% arrange(desc(marital)) %>% mutate(prop = n / sum(n) * 100 , ypos = cumsum(prop)- 0.5*prop) %>% ggplot(aes(x = "", y = prop, fill = marital)) + geom_bar(stat = "identity", width = 1, color = "white") + geom_text( aes(y = ypos, label = marital) , color = "white" , size=4 ) + scale_fill_manual(values = cbsafe_pal$hex[c(2, 8, 3, 4)]) + coord_polar("y", start = 0) + theme_void() + theme(legend.position = "none") ``` ## More Improvements: Title and Story-Congruent Colors ```{r, echo = T, fig.show='show', fig.align='center', fig.dim= c(6,6)} gsoep %>% filter(year == 2009 & !is.na(marital)) %>% # random group_by(marital) %>% tally() %>% mutate(marital = factor(marital, 1:4, c("Married", "Separated", "Widowed", "Never Married"))) %>% arrange(desc(marital)) %>% mutate(prop = n / sum(n) * 100 , ypos = cumsum(prop)- 0.5*prop) %>% ggplot(aes(x = "", y = prop, fill = marital)) + geom_bar(stat = "identity", width = 1, color = "black") + geom_label( aes(y = ypos, label = marital) , color = "white" , size = 6 , fontface = 2) + scale_fill_manual(values = c(rev(brewer.pal(9,"Greens")[c(4,6,8)]), "grey60")) + coord_polar("y", start = 0) + labs( title = "In 2009, the majority of GSOEP participants\nwere or had been married/partnered" ) + theme_void() + theme( legend.position = "none" , plot.title = element_text(face = "bold.italic", size = rel(1.4), hjust = .5) ) ``` # Stacked Bar Charts * Like pie charts, stacked bar charts have their time and place * In particular: + Show proportions relative to the total + Can be used to show changes over time * To demonstrate, let's look at marital status across emerging adulthood (18-26) ## Basic Syntax ```{r, echo = T, fig.show='show', fig.align='center', fig.dim= c(5,3.5), out.width = "90%"} gsoep %>% filter(age %in% 18:26 & !is.na(marital)) %>% group_by(age, marital) %>% tally() %>% group_by(age) %>% mutate( marital = factor( marital , 1:4 , c("Married", "Separated", "Widowed", "Never Married") ) , age = factor(age) , prop = n/sum(n) ) %>% ggplot(aes(x = age, y = prop, fill = marital)) + geom_bar(stat = "identity", color = "black") + theme_classic() ``` ## Improvements: Color ```{r, echo = T, fig.show='show', fig.align='center', fig.dim= c(5,3.5), out.width = "90%"} gsoep %>% filter(age %in% 18:26 & !is.na(marital)) %>% group_by(age, marital) %>% tally() %>% group_by(age) %>% mutate(marital = factor(marital, seq(4,1,-1), rev(c("Married", "Separated", "Widowed", "Never Married"))) , age = factor(age) , prop = n/sum(n)) %>% ggplot(aes(x = age, y = prop, fill = marital)) + geom_bar(stat = "identity", color = "black") + scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) + theme_classic() ``` ## Improvements: Label & Scales ```{r, echo = T, fig.show='show', fig.align='center', fig.dim= c(5,3.5), out.width = "90%"} gsoep %>% filter(age %in% 18:26 & !is.na(marital)) %>% group_by(age, marital) %>% tally() %>% group_by(age) %>% mutate( marital = factor( marital , seq(4,1,-1) , rev(c("Married", "Separated", "Widowed", "Never Married")) ) , age = factor(age) , prop = n/sum(n) ) %>% ggplot(aes(x = age, y = prop, fill = marital)) + geom_bar(stat = "identity", color = "black") + scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) + scale_y_continuous( limits = c(0,1) , breaks = seq(0, 1, .25) , labels = c("0%", "25%", "50%", "75%", "100%") ) + labs( x = "Age" , y = "Percent of Sample" , title = "Rates of relationships increase in emerging adulthood" , subtitle = "But most remain unpartnered by 26" ) + theme_classic() ``` ## Improvements: Legend ```{r, echo = T, fig.show='show', fig.align='center', fig.dim= c(6,4), out.width = "90%"} gsoep %>% filter(age %in% 18:26 & !is.na(marital)) %>% group_by(age, marital) %>% tally() %>% group_by(age) %>% mutate(marital = factor(marital, seq(4,1,-1), rev(c("Married", "Separated", "Widowed", "Never Married"))) , age = factor(age) , prop = n/sum(n)) %>% ggplot(aes(x = age, y = prop, fill = marital)) + geom_bar(stat = "identity", color = "black") + scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) + scale_y_continuous( limits = c(0,1) , breaks = seq(0, 1, .25) , labels = c("0%", "25%", "50%", "75%", "100%") ) + annotate("text", x = "26", y = .60, label = "Never Married", angle = 90) + annotate("text", x = "26", y = .13, label = "Married", angle = 90, color = "white") + labs( x = "Age" , y = "Percent of Sample" , title = "Rates of relationships increase in emerging adulthood" , subtitle = "But most remain unpartnered by 26" , fill = NULL ) + theme_classic() + theme(legend.position = "bottom") ``` ## Improvements: Theme Elements Exercise 1. Bold axis text and increase size 2. Bold axis titles and increase size 3. Bold title and subtitle and center (hint, you will also need to wrap the title text) ### (Answers) ```{r, echo = T, fig.show='show', fig.align='center', fig.dim= c(6,4), out.width = "90%"} gsoep %>% filter(age %in% 18:26 & !is.na(marital)) %>% group_by(age, marital) %>% tally() %>% group_by(age) %>% mutate(marital = factor(marital, seq(4,1,-1), rev(c("Married", "Separated", "Widowed", "Never Married"))) , age = factor(age) , prop = n/sum(n)) %>% ggplot(aes(x = age, y = prop, fill = marital)) + geom_bar(stat = "identity", color = "black") + scale_fill_manual(values = c("grey80",brewer.pal(9,"Greens")[c(2,4,6)])) + scale_y_continuous( limits = c(0,1) , breaks = seq(0, 1, .25) , labels = c("0%", "25%", "50%", "75%", "100%") ) + annotate("text", x = "26", y = .60, label = "Never Married", angle = 90) + annotate("text", x = "26", y = .13, label = "Married", angle = 90, color = "white") + labs( x = "Age" , y = "Percent of Sample" , title = "Rates of relationships increase in\nemerging adulthood" , subtitle = "But most remain unpartnered by 26" , fill = NULL ) + theme_classic() + theme( legend.position = "bottom" , axis.text = element_text(face = "bold", size = rel(1.1)) , axis.title = element_text(face = "bold", size = rel(1.1)) , plot.title = element_text(face = "bold", size = rel(1.2), hjust = .5) , plot.subtitle = element_text(face = "italic", size = rel(1.1), hjust = .5) ) ``` # Side-by-Side Bar Charts * Stacked bar charts are great for showing sequences but can make it difficult to compare *within* a stack * Side-by-side bar charts make it much easier to compare across categories and work well when broken into many categories * But they can be difficult to understand across sequences * To demonstrate, let's look at marriage rates across three waves ## Basic Syntax ```{r, echo = T, fig.show='show', fig.align='center', fig.dim= c(6,4), out.width = "90%"} gsoep %>% filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random group_by(year, marital) %>% tally() %>% mutate(marital = factor(marital, 1:4, c("Married", "Separated", "Widowed", "Never Married"))) %>% group_by(year) %>% mutate(prop = n/sum(n)) %>% ggplot(aes(x = year, y = prop, fill = marital)) + geom_bar(stat = "identity", color = "black", position = "dodge") + theme_classic() ``` ## Improvements: Order ```{r, echo = T, fig.show='show', fig.align='center', fig.dim= c(6,4), out.width = "90%"} gsoep %>% filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random group_by(year, marital) %>% tally() %>% mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>% group_by(year) %>% mutate(prop = n/sum(n)) %>% ggplot(aes(x = year, y = prop, fill = marital)) + geom_bar(stat = "identity", color = "black", position = "dodge") + theme_classic() ``` ## Improvements: Labels We could label the bars, but let's label the axes instead ```{r, echo = T, fig.show='show', fig.align='center', fig.dim= c(6,4), out.width = "90%"} gsoep %>% filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random group_by(year, marital) %>% tally() %>% mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>% group_by(year) %>% mutate(prop = n/sum(n)) %>% ggplot(aes(x = marital, y = prop, fill = marital)) + geom_bar(stat = "identity", color = "black", position = "dodge") + scale_y_continuous( limits = c(0,.7), breaks = seq(0,.7, .2), labels = c("0%", "20%", "40%", "60%") ) + facet_grid(~year) + labs( x = NULL , y = "Percentage of Participants" , title = "Marital Status Has Remained Consistent Throughout the 21st Century" ) + theme_classic() + theme( legend.position = "none" , axis.text.x = element_text(angle = 45, hjust = 1) ) ``` ## Improvements: Theme Elements Let's label and improve the theme elements ```{r} gsoep %>% filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random group_by(year, marital) %>% tally() %>% mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>% group_by(year) %>% mutate(prop = n/sum(n)) %>% ggplot(aes(x = marital, y = prop, fill = marital)) + geom_bar(stat = "identity", color = "black", position = "dodge") + scale_y_continuous( limits = c(0,.7), breaks = seq(0,.7, .2), labels = c("0%", "20%", "40%", "60%") ) + facet_grid(~year) + labs( x = NULL , y = "Percentage of Participants" , title = "Marital Status Has Remained Consistent\nThroughout the 21st Century" ) + theme_classic() + theme( legend.position = "none" , axis.text = element_text(face = "bold", size = rel(1.2)) , axis.text.x = element_text(angle = 45, hjust = 1, size = rel(1)) , axis.title = element_text(face = "bold", size = rel(1.2)) , strip.background = element_rect(fill = "grey90", color = "black") , strip.text = element_text(face = "bold", size = rel(1.2)) , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5) ) ``` ## Improvements: Colors (Exercise) Exercise: * Improve the colors by making them: + Colorblind-friendly + Match the goal of the plot (see title) ### (Answers) ```{r} gsoep %>% filter(year %in% c(2000, 2005, 2010, 2015) & !is.na(marital)) %>% # random group_by(year, marital) %>% tally() %>% mutate(marital = factor(marital, c(1,4,2,3), c("Married", "Never Married", "Separated", "Widowed"))) %>% group_by(year) %>% mutate(prop = n/sum(n)) %>% ggplot(aes(x = marital, y = prop, fill = marital)) + geom_bar(stat = "identity", color = "black", position = "dodge") + scale_y_continuous( limits = c(0,.7), breaks = seq(0,.7, .2), labels = c("0%", "20%", "40%", "60%") ) + scale_fill_manual(values = cbsafe_pal$hex[2:5]) + facet_grid(~year) + labs( x = NULL , y = "Percentage of Participants" , title = "Marital Status Has Remained Consistent\nThroughout the 21st Century" ) + theme_classic() + theme( legend.position = "none" , axis.text = element_text(face = "bold", size = rel(1.2)) , axis.text.x = element_text(angle = 45, hjust = 1, size = rel(1)) , axis.title = element_text(face = "bold", size = rel(1.2)) , strip.background = element_rect(fill = "grey90", color = "black") , strip.text = element_text(face = "bold", size = rel(1.2)) , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5) ) ``` # Bar Charts and Density Across Continuous Variables * One challenge with stacked bar charts is that when there are more than two categories, it can be very difficult to track the visualized trend * Relative to side-by-side bar charts, it's easy to see any category relative to the total but somewhat more difficult to also account for differing numbers of people in different categories or across time * One possible solution to this is to look at densities across time and groups or relative to the total * Let's do both now ## Stacked Area Charts But first, remember stacked bar charts? Stacked area charts are sort of an extension of those: ```{r} gsoep %>% filter(age %in% c(20, 30, 40, 50, 60, 70, 80) & !is.na(SRhealth)) %>% # random group_by(age, SRhealth) %>% tally() %>% mutate(SRhealth = factor(SRhealth, seq(5,1,-1), c("Very good", "Good", "Satisfactory", "Poor", "Bad"))) %>% group_by(age) %>% mutate(prop = n/sum(n)) %>% ggplot(aes(x = age, y = prop, fill = SRhealth)) + geom_bar(stat = "identity", color = "black") + scale_fill_manual(values = cbsafe_pal$hex[2:6]) + theme_classic() ``` But without the bars separating them. ```{r} gsoep %>% filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random group_by(age, SRhealth) %>% tally() %>% mutate(SRhealth = factor(SRhealth, seq(5,1,-1), c("Very good", "Good", "Satisfactory", "Poor", "Bad"))) %>% group_by(age) %>% mutate(prop = n/sum(n)) %>% ggplot(aes(x = age, y = prop, fill = SRhealth)) + geom_area() + # scale_fill_manual(values = cbsafe_pal$hex[2:6]) + theme_classic() ``` ### Improvements: Color ```{r} gsoep %>% filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random group_by(age, SRhealth) %>% tally() %>% mutate(SRhealth = factor(SRhealth, seq(5,1,-1), c("Very good", "Good", "Satisfactory", "Poor", "Bad"))) %>% group_by(age) %>% mutate(prop = n/sum(n)) %>% ggplot(aes(x = age, y = prop, fill = SRhealth)) + geom_area(color = "white", alpha = .6) + scale_fill_viridis_d() + theme_classic() ``` ### Improvements: Color Labels ```{r} gsoep %>% filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random group_by(age, SRhealth) %>% tally() %>% mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>% group_by(age) %>% mutate(prop = n/sum(n)) %>% ggplot(aes(x = age, y = prop, fill = SRhealth)) + geom_area(color = "white", alpha = .6) + annotate("text", x = 85, y = .95, label = "Bad", color = "white", fontface = 2) + annotate("text", x = 75, y = .80, label = "Poor", color = "white", fontface = 2) + annotate("text", x = 62, y = .55, label = "Satisfactory", color = "white", fontface = 2) + annotate("text", x = 43, y = .3, label = "Good", color = "black", fontface = 2) + annotate("text", x = 30, y = .07, label = "Very Good", color = "black", fontface = 2) + scale_fill_viridis_d() + theme_classic() + theme(legend.position = "none") ``` ### Improvements: Theme Elements ```{r} gsoep %>% filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random group_by(age, SRhealth) %>% tally() %>% mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>% group_by(age) %>% mutate(prop = n/sum(n)) %>% ggplot(aes(x = age, y = prop, fill = SRhealth)) + geom_area(color = "white", alpha = .6) + annotate("text", x = 85, y = .95, label = "Bad", color = "white", fontface = 2) + annotate("text", x = 75, y = .80, label = "Poor", color = "white", fontface = 2) + annotate("text", x = 62, y = .55, label = "Satisfactory", color = "white", fontface = 2) + annotate("text", x = 43, y = .3, label = "Good", color = "black", fontface = 2) + annotate("text", x = 30, y = .07, label = "Very Good", color = "black", fontface = 2) + scale_fill_viridis_d() + theme_classic() + theme(legend.position = "none" , axis.text = element_text(face = "bold", size = rel(1.1)) , axis.title = element_text(face = "bold", size = rel(1.1)) , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5) ) ``` ### Improvements: Labels and Title (Exercise) Exercise: 1. Add plot title 2. Change `x` and `y` scale labels and titles #### (Answers) ```{r} gsoep %>% filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random group_by(age, SRhealth) %>% tally() %>% mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>% group_by(age) %>% mutate(prop = n/sum(n)) %>% ggplot(aes(x = age, y = prop, fill = SRhealth)) + geom_area(color = "white", alpha = .6) + annotate("text", x = 85, y = .95, label = "Bad", color = "white", fontface = 2) + annotate("text", x = 75, y = .80, label = "Poor", color = "white", fontface = 2) + annotate("text", x = 62, y = .55, label = "Satisfactory", color = "white", fontface = 2) + annotate("text", x = 43, y = .3, label = "Good", color = "black", fontface = 2) + annotate("text", x = 30, y = .07, label = "Very Good", color = "black", fontface = 2) + scale_fill_viridis_d() + theme_classic() + theme(legend.position = "none" , axis.text = element_text(face = "bold", size = rel(1.1)) , axis.title = element_text(face = "bold", size = rel(1.1)) , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5) ) ``` ### Improvements: Labels and Title ```{r} gsoep %>% filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random group_by(age, SRhealth) %>% tally() %>% mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>% group_by(age) %>% mutate(prop = n/sum(n)) %>% ggplot(aes(x = age, y = prop, fill = SRhealth)) + geom_area(color = "white", alpha = .6) + annotate("text", x = 85, y = .95, label = "Bad", color = "white", fontface = 2) + annotate("text", x = 75, y = .80, label = "Poor", color = "white", fontface = 2) + annotate("text", x = 62, y = .55, label = "Satisfactory", color = "white", fontface = 2) + annotate("text", x = 43, y = .3, label = "Good", color = "black", fontface = 2) + annotate("text", x = 30, y = .07, label = "Very Good", color = "black", fontface = 2) + scale_x_continuous(limits = c(18, 100), breaks = seq(20, 100, 10)) + scale_y_continuous(limits = c(0,1), breaks = seq(0,1, .25), labels = c("0%", "25%", "50%", "75%", "100%")) + scale_fill_viridis_d() + labs( x = "Age (Years)" , y = "Percentage of Participants" , title = "Levels of Self-Rated Health Decrease Across the Lifespan" ) + theme_classic() + theme(legend.position = "none" , axis.text = element_text(face = "bold", size = rel(1.1)) , axis.title = element_text(face = "bold", size = rel(1.1)) , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5) ) ``` ## Total Density Plots * Let's revisit these data but also demonstrating how sample size changes across the lifespan * To do this, we need two pieces of information: + sample size in each self-rated health category at each age group + total in each age group Let's start by using `stat_smooth()` to get a smoothed `geom_area()` of the total sample size onto the figure ```{r} gsoep %>% filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random group_by(age, SRhealth) %>% tally() %>% mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>% group_by(age) %>% mutate(total_n = sum(n)) %>% ggplot(aes(x = age, y = n)) + stat_smooth( aes(y = total_n) , geom = 'area' , method = 'loess' , span = 1/3 , alpha = .8 , fill = "grey" ) + facet_grid(~SRhealth) + theme_classic() ``` Then add the area for each ordinal level of self-rated health. ```{r} gsoep %>% filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random group_by(age, SRhealth) %>% tally() %>% mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>% group_by(age) %>% mutate(total_n = sum(n)) %>% ggplot(aes(x = age, y = n)) + stat_smooth(aes(y = total_n), geom = 'area', method = 'loess' , span = 1/3, alpha = .8, fill = "grey") + stat_smooth( aes(fill = SRhealth) , geom = 'area' , method = 'loess' , span = 1/3 , alpha = .8 ) + annotate("text", x = 45, y = 3000, label = "Total") + facet_grid(~SRhealth) + theme_classic() + theme(legend.position = "none") ``` Let's not belabor this too much. ```{r} gsoep %>% filter(!is.na(SRhealth) & age >= 18 & age <= 100) %>% # random group_by(age, SRhealth) %>% tally() %>% mutate(SRhealth = factor(SRhealth, 1:5, rev(c("Very good", "Good", "Satisfactory", "Poor", "Bad")))) %>% group_by(age) %>% mutate(total_n = sum(n)) %>% ggplot(aes(x = age, y = n)) + stat_smooth(aes(y = total_n), geom = 'area', method = 'loess' , span = 1/3, alpha = .8, fill = "grey") + stat_smooth( aes(fill = SRhealth) , geom = 'area' , method = 'loess' , span = 1/3 , alpha = .8 ) + scale_x_continuous(limits = c(18, 100), breaks = seq(20, 100, 10)) + scale_fill_viridis_d() + annotate("text", x = 45, y = 3000, label = "Total") + labs( x = "Age (Years)" , y = "Number of People" , title = "Good Self-Rated Health Decreases Across the Lifespan" , subtitle = "But bad decreases less, likely because all-cause sample drop-out" ) + facet_grid(~SRhealth) + theme_classic() + theme(legend.position = "none" , axis.text = element_text(face = "bold", size = rel(1.1)) , axis.title = element_text(face = "bold", size = rel(1.1)) , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5) , plot.subtitle = element_text(face = "italic", size = rel(1), hjust = .5) , strip.background = element_rect(fill = "grey90", color = "black") , strip.text = element_text(face = "bold", size = rel(1.2)) ) ``` # Nested Proportions * Sometimes, the proportions that we want to visualize are more complex and can't just be simply binned * In such cases, there may be hierarchical relationships among the categories * Today, we'll cover two core nested proportion plots: + Mosaic plots + Parallel Sets * To do this, we'll use 2-Digit NACE Industry Sector codes from participants' last reported jobs in the SOEP, which I've broken down into 9 higher-order categories * This is a lot of categories, so we'll further eventually exclude categories that don't have at least 2% of the share of participants ```{r} library(kableExtra) jobs <- read_csv("https://raw.githubusercontent.com/emoriebeck/psc290-data-viz-2022/main/03-week3-proportions/05-job-codes.csv") jobs %>% select(cat, job, old) %>% arrange(cat, old) %>% kable(. , "html" , col.names = c("Category", "Job", "Code") , caption = "2-Digit NACE Industry Sector Codes and Categories") %>% kable_classic(full_width = F, html_font = "Times New Roman") %>% collapse_rows(1, valign = "top") %>% scroll_box(height = "500px") ``` ## Mosaic Plots * Unlike bar charts, mosaic plots allow us to index relative areas, sizes, proportions, etc. relative to two dimensions (so not just amount) * So in our example, this will let us see relative differences within categories vertically and across categories horizontally * To build this, we will finally leave the basic `ggplot2` package and use the `ggmosaic` package * There are other packages, but we'll use this one because (1) it's great and (2) it let's us still use everything we've learned about ggplot But first, the data: ```{r} if(!"ggmosaic" %in% installed.packages()) install.packages("ggmosaic") library(ggmosaic) gsoep_jobs <- gsoep %>% mutate(age_gr = mapvalues(age, 20:99, rep(seq(20, 90, 10), each = 10))) %>% filter(!is.na(age_gr) & age >= 20 & age < 100) %>% group_by(SID) %>% filter(!is.na(job)) %>% filter(age_gr == max(age_gr)) %>% group_by(SID, age_gr) %>% summarize(job = max(job)) %>% ungroup() %>% rename(code = job) %>% left_join(jobs %>% rename(code = old)) %>% group_by(code) %>% filter(n() / nrow(.) >= .02) %>% ungroup() gsoep_jobs ``` * Let's say, for example, that we think that some professions may restrict certain age groups due to experience (younger age groups) or functional limitations (older age groups) * We could look at this simply as a stacked bar chart, but it wouldn't clarify that there are different proportions of people in each job category ```{r} gsoep_jobs %>% ggplot() + geom_mosaic(aes(x = product(age_gr), fill = cat)) + theme_classic() + theme(legend.position = "none") ``` ### Improvements: Let's polish it ```{r} gsoep_jobs %>% ggplot() + geom_mosaic(aes(x = product(age_gr), fill = cat)) + labs( x = "Age Group (Decades)" , title = "There are small changes in category of professions across the lifespan" , subtitle = "Younger adults are more likely to be be in service and sales positions" ) + theme_classic() + theme( legend.position = "none" , axis.text = element_text(face = "bold", size = rel(1.1)) , axis.title = element_text(face = "bold", size = rel(1.2)) , plot.title = element_text(face = "bold", size = rel(1.1), hjust = .5) , plot.subtitle = element_text(face = "italic", size = rel(1.1), hjust = .5) ) ``` ## Treemap * Mosaic plots are sort of just fancy stacked bar plots that let you also index by size * Treemaps are helpful when we have nested categorical (and sometimes, to a lesser degree continuous) variables * We'll use the example of our jobs data, but this could be used for lots of other types of variables + Crossed conditions in an experiment + Intergenerational data + Average scores on variables within categories + Brain activation across broader and narrower brain regions + Political affiliation across states, demographic groups, and more ```{r} gsoep_tm <- gsoep %>% group_by(SID) %>% filter(!is.na(job)) %>% group_by(SID) %>% summarize(job = max(job)) %>% ungroup() %>% rename(code = job) %>% left_join(jobs %>% rename(code = old)) %>% group_by(code, cat, job) %>% tally() %>% ungroup() %>% filter(n/sum(n) > .02) %>% mutate(job = str_wrap(job, 15)) ``` ### Basic Syntax ```{r} if(!"treemapify" %in% installed.packages()) install.packages("treemapify") library(treemapify) gsoep_tm %>% arrange(cat, code) %>% ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) + geom_treemap(color = "grey", size = 3) ``` ### Improvements: Remove Legend and Add Labels ```{r} gsoep_tm %>% arrange(cat, code) %>% ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) + geom_treemap() + geom_treemap_text( colour = "white" , place = "centre" , size = 15 , grow = FALSE ) + theme(legend.position = "none") ``` ### Improvements: Add Subgroup Text ```{r} gsoep_tm %>% arrange(cat, code) %>% ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) + geom_treemap() + geom_treemap_text( colour = "white" , place = "centre" , size = 15 , grow = FALSE ) + geom_treemap_subgroup_text( place = "bottom" , grow = TRUE , alpha = 0.4 , colour = "white" , fontface = "italic" ) + scale_fill_viridis_d() + theme(legend.position = "none") ``` ### Improvements: Color Palette ```{r} gsoep_tm %>% arrange(cat, code) %>% ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) + geom_treemap() + geom_treemap_text( colour = "white" , place = "centre" , size = 15 , grow = FALSE ) + geom_treemap_subgroup_text( place = "bottom" , grow = TRUE , alpha = 0.4 , colour = "white" , fontface = "italic" ) + scale_fill_viridis_d() + theme(legend.position = "none") ``` ### Improvements: Group and Subgroup Borders + Text Color ```{r} gsoep_tm %>% arrange(cat, code) %>% ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) + geom_treemap(color = "grey", size = 3) + geom_treemap_text( colour = c(rep("white", 11), rep("black",4)) , place = "centre" , size = 15 , grow = FALSE ) + geom_treemap_subgroup_text( place = "bottom" , grow = TRUE , alpha = 0.4 , colour = c(rep("white", 11), rep("black",4)) , fontface = "italic" ) + geom_treemap_subgroup_border( colour = "white" , size = 5 ) + scale_fill_viridis_d() + theme(legend.position = "none") ``` ### Improvements: Title ```{r} gsoep_tm %>% arrange(cat, code) %>% ggplot(aes(area = n, fill = cat, label = job, subgroup = cat)) + geom_treemap(color = "grey", size = 3) + geom_treemap_text( colour = c(rep("white", 11), rep("black",4)) , place = "centre" , size = 15 , grow = FALSE ) + geom_treemap_subgroup_text( place = "bottom" , grow = TRUE , alpha = 0.4 , colour = c(rep("white", 11), rep("black",4)) , fontface = "italic" ) + geom_treemap_subgroup_border( colour = "white" , size = 5 ) + scale_fill_viridis_d() + labs(title = "White Collar Public Service, Sales, and\nFinance Jobs Far Outnumber Blue Collar Jobs") + theme(legend.position = "none" , plot.title = element_text(face = "bold", hjust = .5)) ``` # Alternative Exercise If you don't have your own data, you can use some open data from (infoisbeautiful.net)[https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/]. Here's the commands you'll need to load the data (note you will have to authenticate via Google): ```{r} if(!"googlesheets4" %in% installed.packages()) install.packages("googlesheets4") library(googlesheets4) breach_data <- read_sheet("https://docs.google.com/spreadsheets/d/1i0oIJJMRG-7t1GT-mr4smaTTU7988yXVz8nPlwaJ8Xk/edit#gid=2") breach_data ``` Choose one of the visualizations from this week and use these data to create it, also keeping in mind principles and skills from previous weeks. This is not a graded exercise but rather a challenge to create a really compelling visualization using what you've learned.