Data Manipulation: Intro to dplyr

Author

Emorie D Beck

library(psych)
library(dplyr)

dplyr

The dplyr package is a powerful set of tools from within the larger tidyverse suite of functions. dplyr functions are useful for a variety of functions, perhaps particularly data manipulation.

Although there are a large number of functions within the dplyr package, today I’m going to introduce a subset of them along with a small number of use cases. But don’t worry, these functions will be the key underpinning of code I use in all tutorials going forward, so there will be lots more examples and use cases.

For now, here’s a quick list of the functions we’ll cover today:

  1. %>%: The pipe. Read as “and then.”
  2. filter(): Pick observations (rows) by their values.
  3. select(): Pick variables (columns) by their names.
  4. arrange(): Reorder the rows.
  5. group_by(): Implicitly split the data set by grouping by names (columns).
  6. mutate(): Create new variables with functions of existing variables.
  7. summarize() / summarise(): Collapse many values down to a single summary.

Although each of these functions are powerful alone, they are incredibly powerful in conjunction with one another. So below, I’ll briefly introduce each function, then link them all together using an example of basic data cleaning and summary.

Key dplyr Functions

1. %>%

The pipe %>% is wonderful. It makes coding intuitive. Often in coding, you need to use so-called nested functions. For example, you might want to round a number after taking the square of 43.

sqrt(43)
[1] 6.557439
round(sqrt(43), 2)
[1] 6.56

The issue with this comes whenever we need to do a series of operations on a data set or other type of object. In such cases, if we run it in a single call, then we have to start in the middle and read our way out.

The pipe solves this by allowing you to read from left to right (or top to bottom). The easiest way to think of it is that each call of %>% reads and operates as “and then.” So with the rounded square root of 43, for example:

sqrt(43) %>%
  round(2)
[1] 6.56

As you can see, the two results are the same but the second is slightly easier to follow. And, as you’ll see below, this becomes even more intuitive when you start using it in conjunction with dplyr functions.

2. filter()

Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.

Say for example, that you’re interested personality change in adolescence, but you just opened a survey up online. So when you actually download and examine your data, you realize that you have an age range of something like 3-86, not 12-18. In this case, you want to get rid of the people over 18 – that is, filter() them out.

data(bfi) # grab the bfi data from the psych package
bfi <- bfi %>% as_tibble()

summary(bfi$age) # get age descriptives
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3.00   20.00   26.00   28.78   35.00   86.00 
bfi2 <- bfi %>% # see a pipe!
  filter(age <= 18) # filter to age up to 18

summary(bfi2$age) # summary of the new data 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    3.0    16.0    17.0    16.3    18.0    18.0 

But this isn’t quite right. We still have folks below 12. But, the beauty of filter() is that you can do sequence of OR and AND statements when there is more than one condition, such as up to 18 AND at least 12.

bfi2 <- bfi %>%
  filter(age <= 18 & age >= 12) # filter to age up to 18 and at least 12

summary(bfi2$age) # summary of the new data 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   12.0    16.0    17.0    16.4    18.0    18.0 

Got it!

But filter works for more use cases than just conditional <, >, <=, and >=. It can also be used for cases where we want a single values to match cases with text. Before I demonstrate that, though, I need to convert one of the variables in the bfi data frame to a string. So let’s change gender (1 = male, 2 = female) to text (we’ll get into factors later).

bfi$education <- plyr::mapvalues(bfi$education, 1:5, c("Below HS", "HS", "Some College", "College", "Higher Degree"))

Now let’s try a few things:

1. Create a data set with only individuals with some college (==).

bfi2 <- bfi %>% 
  filter(education == "Some College")
unique(bfi2$education)
[1] "Some College"

2. Create a data set with only people age 18 (==).

bfi2 <- bfi %>%
  filter(age == 18)
summary(bfi2$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     18      18      18      18      18      18 

3. Create a data set with individuals with some college or above (%in%).

bfi2 <- bfi %>%
  filter(education %in% c("Some College", "College", "Higher Degree"))
unique(bfi2$education)
[1] "Some College"  "Higher Degree" "College"      

The %in% operator is wonderful. Instead of comparing a column to a single value, you can compare it to several. So above, when we wanted ages between 12 and 18, we could have done:

bfi2 <- bfi %>%
  filter(age %in% 12:18)
summary(bfi2$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   12.0    16.0    17.0    16.4    18.0    18.0 

I’ve been using dplyr for nearly five years, and I still have to remind myself that when you want to remove rows, you use filter().

3. select()

If filter() is for pulling certain observations (rows), then select() is for pulling certain variables (columns). Almost without fail, any data that are received for collected are going to have some variables that are not used, not useful, extraneous, etc. In such cases, it’s good practice to remove these columns to stop your environment from becoming cluttered and eating up your RAM.

In our bfi data, most of these have been pre-removed, so instead, we’ll imagine we don’t want to use any indicators of Agreeableness (A1-A5) and that we aren’t interested in gender.

With select(), there are few ways choose variables. We can bare quote name the ones we want to keep, bare quote names we want to remove, or use any of a number of select() helper functions.

1. Bare quote columns we want to keep:

bfi %>%
  select(C1, C2, C3, C4, C5)
# A tibble: 2,800 × 5
      C1    C2    C3    C4    C5
   <int> <int> <int> <int> <int>
 1     2     3     3     4     4
 2     5     4     4     3     4
 3     4     5     4     2     5
 4     4     4     3     5     5
 5     4     4     5     3     2
 6     6     6     6     1     3
 7     5     4     4     2     3
 8     3     2     4     2     4
 9     6     6     3     4     5
10     6     5     6     2     1
# ℹ 2,790 more rows

I’m going to stop there because I don’t want to name the additional 17 columns we want to keep. Instead we’ll use : to grab a range of columns.

bfi %>%
  select(C1:O5, education, age)
# A tibble: 2,800 × 22
      C1    C2    C3    C4    C5    E1    E2    E3    E4    E5    N1    N2    N3
   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1     2     3     3     4     4     3     3     3     4     4     3     4     2
 2     5     4     4     3     4     1     1     6     4     3     3     3     3
 3     4     5     4     2     5     2     4     4     4     5     4     5     4
 4     4     4     3     5     5     5     3     4     4     4     2     5     2
 5     4     4     5     3     2     2     2     5     4     5     2     3     4
 6     6     6     6     1     3     2     1     6     5     6     3     5     2
 7     5     4     4     2     3     4     3     4     5     5     1     2     2
 8     3     2     4     2     4     3     6     4     2     1     6     3     2
 9     6     6     3     4     5     5     3    NA     4     3     5     5     2
10     6     5     6     2     1     2     2     4     5     5     5     5     5
# ℹ 2,790 more rows
# ℹ 9 more variables: N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>,
#   O4 <int>, O5 <int>, education <chr>, age <int>

2. Bare quote columns we don’t want to keep:

bfi %>% 
  select(-(A1:A5), -gender)
# A tibble: 2,800 × 22
      C1    C2    C3    C4    C5    E1    E2    E3    E4    E5    N1    N2    N3
   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1     2     3     3     4     4     3     3     3     4     4     3     4     2
 2     5     4     4     3     4     1     1     6     4     3     3     3     3
 3     4     5     4     2     5     2     4     4     4     5     4     5     4
 4     4     4     3     5     5     5     3     4     4     4     2     5     2
 5     4     4     5     3     2     2     2     5     4     5     2     3     4
 6     6     6     6     1     3     2     1     6     5     6     3     5     2
 7     5     4     4     2     3     4     3     4     5     5     1     2     2
 8     3     2     4     2     4     3     6     4     2     1     6     3     2
 9     6     6     3     4     5     5     3    NA     4     3     5     5     2
10     6     5     6     2     1     2     2     4     5     5     5     5     5
# ℹ 2,790 more rows
# ℹ 9 more variables: N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>,
#   O4 <int>, O5 <int>, education <chr>, age <int>

Note the () around the columns. That is necessary when you want to remove a range of columns.

3. Add or remove using select() helper functions.

  • starts_with(): matches names that begin with quoted argument. For example, if we wanted all the Conscientiousness items, we could call the following:
bfi %>%
  select(starts_with("C"))
# A tibble: 2,800 × 5
      C1    C2    C3    C4    C5
   <int> <int> <int> <int> <int>
 1     2     3     3     4     4
 2     5     4     4     3     4
 3     4     5     4     2     5
 4     4     4     3     5     5
 5     4     4     5     3     2
 6     6     6     6     1     3
 7     5     4     4     2     3
 8     3     2     4     2     4
 9     6     6     3     4     5
10     6     5     6     2     1
# ℹ 2,790 more rows
  • ends_with(): matches names that end with quoted argument. For example, if we wanted the first item in each Big Five scale, we could call:
bfi %>% 
  select(ends_with("1"))
# A tibble: 2,800 × 5
      A1    C1    E1    N1    O1
   <int> <int> <int> <int> <int>
 1     2     2     3     3     3
 2     2     5     1     3     4
 3     5     4     2     4     4
 4     4     4     5     2     3
 5     2     4     2     2     3
 6     6     6     2     3     4
 7     2     5     4     1     5
 8     4     3     3     6     3
 9     4     6     5     5     6
10     2     6     2     5     5
# ℹ 2,790 more rows
  • contains(): matches names that contain quote material. This can be any subset of a string, which makes it very useful for a number of contexts we’ll see later. But for now, if I wanted to be lazy or couldn’t remember the name of th education variable, I could call:
bfi %>% 
  select(contains("edu"))
# A tibble: 2,800 × 1
   education   
   <chr>       
 1 <NA>        
 2 <NA>        
 3 <NA>        
 4 <NA>        
 5 <NA>        
 6 Some College
 7 <NA>        
 8 HS          
 9 Below HS    
10 <NA>        
# ℹ 2,790 more rows
  • matches(): selects variables that match a regular expression (regex). Regex is tricky. I tend to end up referencing online documentation when I need to use this beyond a few basic expressions that I use very regularly. We’ll start with a simple one, keeping only those variables that either have or do not have numbers:
# contains numbers
bfi %>%
  select(matches("[0-9]")) 
# A tibble: 2,800 × 25
      A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2    E3
   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1     2     4     3     4     4     2     3     3     4     4     3     3     3
 2     2     4     5     2     5     5     4     4     3     4     1     1     6
 3     5     4     5     4     4     4     5     4     2     5     2     4     4
 4     4     4     6     5     5     4     4     3     5     5     5     3     4
 5     2     3     3     4     5     4     4     5     3     2     2     2     5
 6     6     6     5     6     5     6     6     6     1     3     2     1     6
 7     2     5     5     3     5     5     4     4     2     3     4     3     4
 8     4     3     1     5     1     3     2     4     2     4     3     6     4
 9     4     3     6     3     3     6     6     3     4     5     5     3    NA
10     2     5     6     6     5     6     5     6     2     1     2     2     4
# ℹ 2,790 more rows
# ℹ 12 more variables: E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>,
#   N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>
# does not contain numbers
bfi %>%
  select(!matches("[0-9]")) 
# A tibble: 2,800 × 3
   gender education      age
    <int> <chr>        <int>
 1      1 <NA>            16
 2      2 <NA>            18
 3      2 <NA>            17
 4      2 <NA>            17
 5      1 <NA>            17
 6      2 Some College    21
 7      1 <NA>            18
 8      1 HS              19
 9      1 Below HS        19
10      2 <NA>            17
# ℹ 2,790 more rows
  • num_range(): Given a stem and a range of numbers, this selects items in a sequence. This is especially useful when variables of your data set may not be in order.
# select first 2 Extraversion items
bfi %>%
  select(num_range("E", 1:2))
# A tibble: 2,800 × 2
      E1    E2
   <int> <int>
 1     3     3
 2     1     1
 3     2     4
 4     5     3
 5     2     2
 6     2     1
 7     4     3
 8     3     6
 9     5     3
10     2     2
# ℹ 2,790 more rows
  • one_of(): select any of a subset of items from a vector. This is one of my favorites, for reasons we’ll see in my tutorial on workflow and data documentation. But for now, let’s say I thought there were six items in each personality when there are actually five. So when I call the following, one_of() will be forgiving and ignore the fact that I messed up.
bfi %>% 
  select(one_of(paste0("E", 1:6)))
# A tibble: 2,800 × 5
      E1    E2    E3    E4    E5
   <int> <int> <int> <int> <int>
 1     3     3     3     4     4
 2     1     1     6     4     3
 3     2     4     4     4     5
 4     5     3     4     4     4
 5     2     2     5     4     5
 6     2     1     6     5     6
 7     4     3     4     5     5
 8     3     6     4     2     1
 9     5     3    NA     4     3
10     2     2     4     5     5
# ℹ 2,790 more rows
  • all_of(): select all of a subset of items from a vector. Unlike one_of(), all_of() is less forgiving and will throw an error if we try to call for 6 Extraversion items.
bfi %>%
  select(all_of(paste0("E", 1:6)))
Error in `all_of()`:
! Can't subset columns that don't exist.
✖ Column `E6` doesn't exist.

Oops. In this case, we’d then need to modify the code to reflect the correct number of items.

bfi %>%
  select(all_of(paste0("E", 1:5)))
# A tibble: 2,800 × 5
      E1    E2    E3    E4    E5
   <int> <int> <int> <int> <int>
 1     3     3     3     4     4
 2     1     1     6     4     3
 3     2     4     4     4     5
 4     5     3     4     4     4
 5     2     2     5     4     5
 6     2     1     6     5     6
 7     4     3     4     5     5
 8     3     6     4     2     1
 9     5     3    NA     4     3
10     2     2     4     5     5
# ℹ 2,790 more rows

4. arrange()

Sometimes, either in order to get a better sense of our data or in order to well, order our data, we want to sort it. Although there is a base R sort() function, the arrange() function is tidyverse version that plays nicely with other tidyverse functions.

So in our previous examples, we could also arrange() our data by age or education, rather than simply filtering. (Or as we’ll see later, we can do both!)

# sort by age
bfi %>% 
  select(gender:age) %>%
  arrange(age)
# A tibble: 2,800 × 3
   gender education       age
    <int> <chr>         <int>
 1      1 Higher Degree     3
 2      2 <NA>              9
 3      2 Some College     11
 4      2 <NA>             11
 5      2 <NA>             11
 6      2 <NA>             12
 7      2 <NA>             12
 8      2 <NA>             12
 9      2 <NA>             12
10      1 <NA>             12
# ℹ 2,790 more rows
# sort by education
bfi %>%
  select(gender:age) %>%
  arrange(education)
# A tibble: 2,800 × 3
   gender education   age
    <int> <chr>     <int>
 1      1 Below HS     19
 2      1 Below HS     21
 3      1 Below HS     17
 4      1 Below HS     18
 5      1 Below HS     18
 6      2 Below HS     18
 7      2 Below HS     43
 8      1 Below HS     32
 9      1 Below HS     18
10      2 Below HS     18
# ℹ 2,790 more rows

We can also arrange by multiple columns, like if we wanted to sort by gender then education:

bfi %>%
  select(gender:age) %>%
  arrange(gender, education)
# A tibble: 2,800 × 3
   gender education   age
    <int> <chr>     <int>
 1      1 Below HS     19
 2      1 Below HS     21
 3      1 Below HS     17
 4      1 Below HS     18
 5      1 Below HS     18
 6      1 Below HS     32
 7      1 Below HS     18
 8      1 Below HS     18
 9      1 Below HS     53
10      1 Below HS     18
# ℹ 2,790 more rows

Bringing it all together: Split-Apply-Combine

Much of the power of dplyr functions lay in the split-apply-combine method. The method is kind of what it sounds like. A given set of of data are split into smaller chunks, then a function or series of functions are applied to each chunk, and then the chunks are combined back together.

Although all of the dplyr functions can be used in conjunction with one another, I’m going to highlight the group_by(), mutate(), and summarize() / summarise() functions to highlight the core of the split-apply-combine method.

5. group_by()

The group_by() function is the “split” of the method. It basically implicitly breaks the data set into chunks by whatever bare quoted column(s)/variable(s) are supplied as arguments.

So imagine that we wanted to group_by() education levels to get average ages at each level. We would simply call:

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  group_by(education)
# A tibble: 2,800 × 8
# Groups:   education [6]
      C1    C2    C3    C4    C5   age gender education   
   <int> <int> <int> <int> <int> <int>  <int> <chr>       
 1     2     3     3     4     4    16      1 <NA>        
 2     5     4     4     3     4    18      2 <NA>        
 3     4     5     4     2     5    17      2 <NA>        
 4     4     4     3     5     5    17      2 <NA>        
 5     4     4     5     3     2    17      1 <NA>        
 6     6     6     6     1     3    21      2 Some College
 7     5     4     4     2     3    18      1 <NA>        
 8     3     2     4     2     4    19      1 HS          
 9     6     6     3     4     5    19      1 Below HS    
10     6     5     6     2     1    17      2 <NA>        
# ℹ 2,790 more rows

We can now see that it tells us that we have a tibble with 2,800 rows and 8 columns as well as Groups: education [6]

Importantly, once you group, you must ungroup() or your data frame will remain “split” and cause you problems. In other words, you must “combine” your data frame back together. This is super easy with the ungroup() function:

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  group_by(education) %>%
  ungroup()
# A tibble: 2,800 × 8
      C1    C2    C3    C4    C5   age gender education   
   <int> <int> <int> <int> <int> <int>  <int> <chr>       
 1     2     3     3     4     4    16      1 <NA>        
 2     5     4     4     3     4    18      2 <NA>        
 3     4     5     4     2     5    17      2 <NA>        
 4     4     4     3     5     5    17      2 <NA>        
 5     4     4     5     3     2    17      1 <NA>        
 6     6     6     6     1     3    21      2 Some College
 7     5     4     4     2     3    18      1 <NA>        
 8     3     2     4     2     4    19      1 HS          
 9     6     6     3     4     5    19      1 Below HS    
10     6     5     6     2     1    17      2 <NA>        
# ℹ 2,790 more rows

You can also overwrite groups by calling group_by() more than once. We’ll touch more on that in future tutorials, but for now, notice what happens when I call group_by() twice sequentially:

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  group_by(education) %>%
  group_by(gender, age)
# A tibble: 2,800 × 8
# Groups:   gender, age [115]
      C1    C2    C3    C4    C5   age gender education   
   <int> <int> <int> <int> <int> <int>  <int> <chr>       
 1     2     3     3     4     4    16      1 <NA>        
 2     5     4     4     3     4    18      2 <NA>        
 3     4     5     4     2     5    17      2 <NA>        
 4     4     4     3     5     5    17      2 <NA>        
 5     4     4     5     3     2    17      1 <NA>        
 6     6     6     6     1     3    21      2 Some College
 7     5     4     4     2     3    18      1 <NA>        
 8     3     2     4     2     4    19      1 HS          
 9     6     6     3     4     5    19      1 Below HS    
10     6     5     6     2     1    17      2 <NA>        
# ℹ 2,790 more rows

Note that the resulting data frame is not grouped by education at all, but by gender and age (i.e. it is not cumulative).

6. mutate()

The mutate() function is one of a few options for how to “apply” (a) function(s) to your split (i.e. group_by()) data frame. When you use mutate(), the resulting data frame will have the same number of rows you started with (which will not be true with summarize() / summarise()). One way to remember is this is that you are directly mutating the existing data frame, either modifying existing columns or creating new ones.

So to continue with the example above, if we were to add a column that indicated average age levels within each age group, we would call:

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  arrange(education) %>%
  group_by(education) %>% 
  mutate(age_by_edu = mean(age, na.rm = T))
# A tibble: 2,800 × 9
# Groups:   education [6]
      C1    C2    C3    C4    C5   age gender education age_by_edu
   <int> <int> <int> <int> <int> <int>  <int> <chr>          <dbl>
 1     6     6     3     4     5    19      1 Below HS        25.1
 2     4     3     5     3     2    21      1 Below HS        25.1
 3     5     5     5     2     2    17      1 Below HS        25.1
 4     5     5     4     1     1    18      1 Below HS        25.1
 5     4     5     4     3     3    18      1 Below HS        25.1
 6     3     2     3     4     6    18      2 Below HS        25.1
 7     3     6     3     1     3    43      2 Below HS        25.1
 8     5     3     2     4     6    32      1 Below HS        25.1
 9     5     5     4     3     4    18      1 Below HS        25.1
10     4     5     5     2     3    18      2 Below HS        25.1
# ℹ 2,790 more rows

As you can see in the resulting data frame, each person (row) with the same education level has the same value in the new age_by_edu column I just added.

mutate() is also super useful even when you aren’t grouping. For example, if I wanted to recode gender so that 1 = “male” and 2 = “female,” we could do that like:

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  mutate(gender_cat = plyr::mapvalues(gender, c(1,2), c("Male", "Female")))
# A tibble: 2,800 × 9
      C1    C2    C3    C4    C5   age gender education    gender_cat
   <int> <int> <int> <int> <int> <int>  <int> <chr>        <chr>     
 1     2     3     3     4     4    16      1 <NA>         Male      
 2     5     4     4     3     4    18      2 <NA>         Female    
 3     4     5     4     2     5    17      2 <NA>         Female    
 4     4     4     3     5     5    17      2 <NA>         Female    
 5     4     4     5     3     2    17      1 <NA>         Male      
 6     6     6     6     1     3    21      2 Some College Female    
 7     5     4     4     2     3    18      1 <NA>         Male      
 8     3     2     4     2     4    19      1 HS           Male      
 9     6     6     3     4     5    19      1 Below HS     Male      
10     6     5     6     2     1    17      2 <NA>         Female    
# ℹ 2,790 more rows

We could also just write over the original gender category like:

bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  mutate(gender = plyr::mapvalues(gender, c(1,2), c("Male", "Female")))
# A tibble: 2,800 × 8
      C1    C2    C3    C4    C5   age gender education   
   <int> <int> <int> <int> <int> <int> <chr>  <chr>       
 1     2     3     3     4     4    16 Male   <NA>        
 2     5     4     4     3     4    18 Female <NA>        
 3     4     5     4     2     5    17 Female <NA>        
 4     4     4     3     5     5    17 Female <NA>        
 5     4     4     5     3     2    17 Male   <NA>        
 6     6     6     6     1     3    21 Female Some College
 7     5     4     4     2     3    18 Male   <NA>        
 8     3     2     4     2     4    19 Male   HS          
 9     6     6     3     4     5    19 Male   Below HS    
10     6     5     6     2     1    17 Female <NA>        
# ℹ 2,790 more rows

7. summarize() / summarise()

The summarize() / summarise() functions (choose your spelling as you will) is another of the options for how to “apply” (a) function(s) to your split (i.e. group_by()) data frame. When you use summarize() (I made by choice), the resulting data frame will have the number of rows equal to the number of group_by() categories you provide. So if you provided education, that will be 6, and if you provided none, it would be one.

# group_by() education
bfi %>%
  select(starts_with("C"), age, gender, education) %>%
  arrange(education) %>%
  group_by(education) %>% 
  summarize(age_by_edu = mean(age, na.rm = T))  
# A tibble: 6 × 2
  education     age_by_edu
  <chr>              <dbl>
1 Below HS            25.1
2 College             33.0
3 HS                  31.5
4 Higher Degree       35.3
5 Some College        27.2
6 <NA>                18.0
# no groups  
bfi %>% 
  select(starts_with("C"), age, gender, education) %>%
  arrange(education) %>%
  summarize(age_by_edu = mean(age, na.rm = T))  
# A tibble: 1 × 1
  age_by_edu
       <dbl>
1       28.8

From this, for example, it becomes clear that all the NAs in the education variable were because they reflected 18 year olds who had not yet completed high school as well as that the sample is pretty young overall.

Conclusion

This has been a quick and relatively low level introduction into some of the core functions in the dplyr package. I challenge you to try to take what you learned and apply it to some of your own data.

In the next tutorial, we will touch on data wrangling, introducing the tidyr packages. This will provide methods for changing the shape for your data, which, in turn, will open up new opportunities to use dplyr functions in useful ways. We will then take all of that and roll it into additional packages and tools.