Visualizing Uncertainty
Visualizing Uncertainty
Visualizing Uncertainty
- Why is visualizing uncertainty important?
- Point estimates are over-emphasized and interval estimates are unemphasized (or ignored)
- Most people misperceive both (1) common uncertainty visualizations and (2) most common uncertainty metrics
- In other words, people make errors about error
- Probability is hard, and most aren’t taught about probability distributions (and more)
Theories of Visualizing Uncertainty
Why do people misperceive uncertainty, and how can we mitigate it?
Quick Side Note: Custom Themes
- We have a lot to cover today, so I’m going to skip over some of the usual “how to start with the basics and make it aesthetically pleasing”
- Instead, we’ll create a custom these that captures some of our usual additions
- This will save us both time and text!
- I highly recommend doing this in your own work
Code
my_theme <- function(){
theme_classic() +
theme(
legend.position = "bottom"
, legend.title = element_text(face = "bold", size = rel(1))
, legend.text = element_text(face = "italic", size = rel(1))
, axis.text = element_text(face = "bold", size = rel(1.1), color = "black")
, axis.title = element_text(face = "bold", size = rel(1.2))
, plot.title = element_text(face = "bold", size = rel(1.2), hjust = .5)
, plot.subtitle = element_text(face = "italic", size = rel(1.2), hjust = .5)
, strip.text = element_text(face = "bold", size = rel(1.1), color = "white")
, strip.background = element_rect(fill = "black")
)
}
Error Bars
- First, let’s examine the usual ways that we show uncertainty around point estimates (e.g., means, model parameter estimates, etc.) using interval estimates (+/- 1 SE/D, confidence interval)
Code
pomp <- function(x) (x - min(x, na.rm = T))/(max(x, na.rm = T) - min(x, na.rm = T))*10
gsoep %>%
filter(year == 2005) %>%
filter(SID %in% sample(unique(.$SID, 500))) %>%
mutate(SRhealth = pomp(SRhealth)) %>%
group_by(SID, age) %>%
mutate(health = rowMeans(cbind(SRhealth, satHealth))) %>%
ungroup() %>%
select(SID, age, health) %>%
drop_na()
Code
gsoep_desc <- gsoep %>%
filter(year == 2005 & age < 30) %>%
filter(SID %in% sample(unique(.$SID), 500)) %>%
mutate(SRhealth = pomp(SRhealth)) %>%
group_by(SID, age) %>%
mutate(health = rowMeans(cbind(SRhealth, satHealth), na.rm = T)) %>%
ungroup() %>%
select(SID, age, health) %>%
drop_na() %>%
mutate(
mean = mean(health)
, sd = sd(health)
, se = sd/sqrt(n())
, ci99 = se*2.576
, ci95 = se*1.96
, ci80 = se*1.282
)
gsoep_desc
Code
gsoep_desc <- gsoep_desc %>%
select(mean:ci80, health, SID) %>%
pivot_longer(
cols = c(-mean, -SID)
, names_to = "measure"
, values_to = "value"
) %>%
mutate(SID = ifelse(measure == "health" | row_number() %in% 1:5, SID, NA)) %>%
drop_na() %>%
mutate(measure = factor(measure, rev(c("health", "sd", "se", "ci99", "ci95", "ci80"))))
gsoep_desc
Code
gsoep_desc %>%
ggplot(aes(y = measure, x = mean)) +
geom_point(size = 3, color = "darkorange3") +
geom_jitter(
data = gsoep_desc %>% filter(measure == "health")
, aes(x = value), alpha = .5, height = .3, width = 0
) +
geom_errorbar(
data = gsoep_desc %>% filter(measure != "health")
, aes(xmin = mean - value, xmax = mean + value)
, width = .1
) +
geom_point(size = 3, color = "darkorange3") +
my_theme()
So What Do We Do?
- Lots of things, not all of which we have time for today. See also:
Outline for Today:
- Proportions / Probability
- icon array
- Point Estimates
- half-eye
- gradient interval
- quantile dotplot
- raincloud
- Animated (sometimes)
- hypothetical outcome plots
- ensemble display
Proportions / Probability
- We already covered proportions and probability, but this one deserves being highlighted itself
- How much of our sample was unmarried?
Code
- We have to trick
ggplot2
into making this figure with a grid
Code
Code
Let’s clean it up:
Code
tibble(
value = c(rep(1, 76), rep(2,24))
, y = rep(1:10, each = 10)
, x = rep(1:10, times = 10)
) %>%
ggplot(aes(x = x, y = y, color = factor(value))) +
geom_point(shape = "square", size = 8) +
scale_color_manual(values = c("lightgrey", "darkblue")) +
labs(title = "24% Remained Unmarried in 2012") +
theme_void() +
theme(legend.position = "none"
, plot.title = element_text(hjust = .5, face = "bold"))
Point Estimates
- Most often, we want to visualize either point esimates or other visualizations of models
- We touched on this a couple of weeks ago when we talked about
broom
andbroom.mixed
- I mentioned then that one of the challenges comes from where the interval estimate comes from, which includes:
- (Frequentist) Standard Errors
- (Frequentist) Confidence Intervals
- Bootstrapped / Profile Confidence Intervals
- Prediction Intervals
- (Bayesian) Credible Intervals
- (Bayesian) Posterior Distributions
- I’ll stay out of Bayes for now :(
Code
gsoep_ex <- gsoep %>%
filter(year == "2000") %>%
select(SID, age, marital, gender) %>%
inner_join(
gsoep %>%
filter(year == "2015") %>%
select(SID, SRhealth)
) %>%
mutate(marital = factor(
marital
, 1:4
, c("Married", "Separated", "Widowed", "Never Married")
), age = age/10
, gender = factor(gender, c(1,2), c("Male", "Female"))) %>%
drop_na()
gsoep_ex
Point Estimates From Model Predictions
Marginal Means from Model Predictions
Point Estimates and Marginal Means
-
stat_halfeye()
: Visual Boundaries -
stat_eye()
: Visual Boundaries -
stat_gradientinterval()
: Visual Semiotics -
stat_dots()
: Frequency Framing -
stat_dotsinterval()
: Frequency Framing -
stat_halfeye()
+stat_dots()
: Visual Boundaries + Frequency Framing
Core syntax
- The benefit of ggdist is that it allows you to use essentially identical syntax to produce lots of different kinds of plots
- All we have to do is swap out the
geom
stat_halfeye()
We can pull the predictions from model terms or marginal means
Model Terms:
Code
Marginal Means:
Code
Let’s do a little hack and create our whole plots except the geom
, so that we can build them with less syntax:
We can pull the predictions from model terms or marginal means
Model Terms:
Code
Marginal Means:
stat_eye()
Model Terms:
Code
Marginal Means:
stat_gradientinterval()
Model Terms:
Code
Marginal Means:
stat_dots()
Model Terms:
Code
Marginal Means:
Code
You can also change the number of dots:
Model Terms:
Code
Marginal Means:
Code
There are also three different layouts
layout = "bin"
:
Code
layout = "weave"
:
Code
layout = "swarm"
:
stat_dotsinterval()
Model Terms:
Code
Marginal Means:
Code
You can apply many of the same arguments as “regular” stat_dots()
Model Terms:
Code
Marginal Means:
stat_halfeye()
+ stat_dots()
Model Terms:
Code
p1 +
stat_dots(
aes(xdist = dist_student_t(df = df.residual(m1), mu = estimate, sigma = std.error))
, quantiles = 50
, side = "bottomleft"
, layout = "swarm"
) +
stat_halfeye(
aes(xdist = dist_student_t(df = df.residual(m1), mu = estimate, sigma = std.error))
) +
labs(subtitle = "`stat_halfeye()`+ `stat_dots()")
Marginal Means:
Code
p2 +
stat_dots(
aes(xdist = dist_student_t(df = df.residual(m1), mu = .fitted, sigma = .se.fit))
, quantiles = 50
, side = "bottomleft"
, layout = "swarm"
) +
stat_halfeye(
aes(xdist = dist_student_t(df = df.residual(m1), mu = .fitted, sigma = .se.fit))
) +
labs(subtitle = "`stat_halfeye()`+ `stat_dots()")
Simple Slopes
Let’s say we want to know if married and unmarried people differ in their self-rated health as a function of their age:
We can plot this using an extension of geom_ribbon()
, stat_lineribbon()
Code
gsoep_ex2 %>%
group_by(marital) %>%
data_grid(age= seq_range(age, n = 101)) %>%
ungroup() %>%
augment(m2, newdata = ., se_fit = T) %>%
ggplot(aes(x = age*10, fill = ordered(marital), color = ordered(marital))) +
stat_lineribbon(
aes(ydist = dist_student_t(df = df.residual(m2), mu = .fitted, sigma = .se.fit)),
alpha = 1/4) +
scale_fill_brewer(palette = "Set2") +
scale_color_brewer(palette = "Dark2") +
labs(x = "Age (Years)", y = "Predicted Self-Rated Health"
, fill = "Marital Status", color = "Marital Status") +
my_theme()
- I promised I wouldn’t go Bayes, but I never promised I wouldn’t bootstrap!
- Knowing how to get bootstrapped confidence and/or prediction intervals is important, especially if you work with any sort of multilevel / hierarchical models
- Getting bootstrapped interval estimates is easy to get point estimates, but it’s a little harder if we want to get it around prediction lines
- (i.e. there’s no great built in funcitons in
R
that get it for you, in my opinion)
- (i.e. there’s no great built in funcitons in
But first, let’s get some longitudinal data that let’s us look at the interaction between marital status and changes in self-rated health within a person as they age:
Code
set.seed(5)
gsoep_ex3 <- gsoep %>%
select(SID, age, marital, gender, SRhealth) %>%
filter(marital %in% c(1,4)) %>%
group_by(SID) %>%
mutate(marital = min(marital, na.rm = T)) %>%
ungroup() %>%
mutate(marital = factor(
marital
, c(1,4)
, c("Married", "Never Married")
), age = age/10
, gender = factor(gender, c(1,2), c("Male", "Female"))) %>%
drop_na()
gsoep_ex3 <- gsoep_ex3 %>%
filter(SID %in% sample(unique(gsoep_ex3$SID), 2000))
The critical term is the interaction between the two:
Changes in health differ across marital groups
- But how?
- Interactions can often be tricky to unpack by point estimates alone
- So we may want to plot separate trajectories for married and unmarried people
Code
predIntlme4 <- function(m, mod_frame, ref){
## get bootstrapped estimates
b <- bootMer(
m
, FUN = function(x) lme4:::predict.merMod(
x
, newdata = mod_frame
, re.form = ref
)
, nsim = 100 # do not use 100 in practice, please
, parallel = "multicore"
, ncpus = 16
)
## get long form bootstrapped draws
b_df <- bind_cols(
mod_frame
, t(b$t) %>%
data.frame()
) %>%
pivot_longer(
cols = c(-age, -marital)
, names_to = "boot"
, values_to = "pred"
)
return(list(boot = b, b_df = b_df))
}
pred_fx_fun <- function(m){
mod_frame <- crossing(
age = seq(min(m@frame$age), max(m@frame$age), .1)
, marital = levels(m@frame$marital)
)
boot <- predIntlme4(m = m, mod_frame = mod_frame, ref = NA)
}
boot3 <- pred_fx_fun(m3)
b_df3 <- boot3$b_df
b3 <- boot3$boot
geom_line()
geom_lineribbon()
: summarized
geom_lineribbon()
bands: summarized
stat_lineribbon()
bands: samples
Code
We can also use a new aesthetic
: fill_ramp
:
geom_lineribbon()
gradient: samples
Code
Let’s clean it up:
Code
ms <- b_df3 %>% filter(age == max(age)) %>% group_by(marital) %>% summarize(m = mean(pred))
b_df3 %>%
ggplot(aes(x = age*10, y = pred, fill = marital, fill_ramp = stat(.width))) +
stat_lineribbon(alpha = .25, .width = ppoints(25)) +
scale_x_continuous(limits = c(15,100), breaks = seq(15,90,15)) +
scale_fill_manual(values = c("grey", "darkorange")) +
annotate("text", label = "Married", x = max(b_df3$age)*10+1, y = ms$m[1], hjust = 0) +
annotate("text", label = "Never\nMarried", x = max(b_df3$age)*10+1, y = ms$m[2], hjust = 0) +
labs(
x = "Age (Years)"
, y = "Predicted Self Rated Health\n(Bootstrapped Interval Estimates)"
, fill = NULL
, title = "Self-Rated Health Declines More Rapidly for Unmarried People"
) +
guides(fill = "none") +
my_theme()
Animated Uncertainty
Ensemble Displays
- Ensemble displays are an alternative to putting hard boundaries around an interval estimate
- Remember that hard boundaries make people interpret categorical differences even when the underlying distribution is continuous
- We’ve already seen this:
- But the challenge with visualizing uncertainty is between inference and understanding
- We need to leverage a knowledge of perception and cognitive processes to help us leverage strengths and overcome weaknesses
- Animating visualizations can help us nudge people to process was they see and update their uncertainty estimates over time
Here’s a quick example:
Let’s break this down:
Now, let’s plot the ribbon:
Code
And add the lines (all of them, it will be ugly):
Code
b_df3 %>%
group_by(marital, age) %>%
median_qi(pred) %>%
ggplot(aes(x = age, y = pred)) +
geom_ribbon(
aes(fill = marital, ymin = .lower, ymax = .upper)
, size = .9, alpha = .5
) +
scale_fill_brewer(palette = "Set2") +
geom_line(
data = b_df3
, aes(group = interaction(marital, boot))
, size = 1
) +
facet_grid(~marital) +
my_theme() +
theme(legend.position = "none")
And finally, use transition_states()
to animate it
Code
b_df3 %>%
group_by(marital, age) %>%
median_qi(pred) %>%
ggplot(aes(x = age, y = pred)) +
geom_ribbon(
aes(fill = marital, ymin = .lower, ymax = .upper)
, size = .9, alpha = .5
) +
scale_fill_brewer(palette = "Set2") +
geom_line(
data = b_df3
, aes(group = interaction(marital, boot))
, size = 1
) +
facet_grid(~marital) +
my_theme() +
theme(legend.position = "none") +
transition_states(boot, 1, 1)
Hypothetical Outcome Plots (HOPs)
- Similarly, hypothetical outcome plots let us see plausible mean estimates among raw data
- Here’s self-rated health (1-5) across married and unmarried individuals:
Code
- Using the
ungeviz
package, we can then use thegeom_vpline()
function to sample from the data across groups and plot the mean from different samples:
Code
And finally, we can animate those samples the transition_states()
function from the gganimate
package again: