library(psych)
library(dplyr)
Data Manipulation: Intro to dplyr
dplyr
The dplyr
package is a powerful set of tools from within the larger tidyverse
suite of functions. dplyr
functions are useful for a variety of functions, perhaps particularly data manipulation.
Although there are a large number of functions within the dplyr
package, today I’m going to introduce a subset of them along with a small number of use cases. But don’t worry, these functions will be the key underpinning of code I use in all tutorials going forward, so there will be lots more examples and use cases.
For now, here’s a quick list of the functions we’ll cover today:
%>%
: The pipe. Read as “and then.”
filter()
: Pick observations (rows) by their values.
select()
: Pick variables (columns) by their names.
arrange()
: Reorder the rows.
group_by()
: Implicitly split the data set by grouping by names (columns).
mutate()
: Create new variables with functions of existing variables.
summarize()
/summarise()
: Collapse many values down to a single summary.
Although each of these functions are powerful alone, they are incredibly powerful in conjunction with one another. So below, I’ll briefly introduce each function, then link them all together using an example of basic data cleaning and summary.
Key dplyr
Functions
1. %>%
The pipe %>%
is wonderful. It makes coding intuitive. Often in coding, you need to use so-called nested functions. For example, you might want to round a number after taking the square of 43.
sqrt(43)
[1] 6.557439
round(sqrt(43), 2)
[1] 6.56
The issue with this comes whenever we need to do a series of operations on a data set or other type of object. In such cases, if we run it in a single call, then we have to start in the middle and read our way out.
The pipe solves this by allowing you to read from left to right (or top to bottom). The easiest way to think of it is that each call of %>%
reads and operates as “and then.” So with the rounded square root of 43, for example:
sqrt(43) %>%
round(2)
[1] 6.56
As you can see, the two results are the same but the second is slightly easier to follow. And, as you’ll see below, this becomes even more intuitive when you start using it in conjunction with dplyr
functions.
2. filter()
Often times, when conducting research (experiments or otherwise), there are observations (people, specific trials, etc.) that you don’t want to include.
Say for example, that you’re interested personality change in adolescence, but you just opened a survey up online. So when you actually download and examine your data, you realize that you have an age range of something like 3-86, not 12-18. In this case, you want to get rid of the people over 18 – that is, filter()
them out.
data(bfi) # grab the bfi data from the psych package
<- bfi %>% as_tibble()
bfi
summary(bfi$age) # get age descriptives
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.00 20.00 26.00 28.78 35.00 86.00
<- bfi %>% # see a pipe!
bfi2 filter(age <= 18) # filter to age up to 18
summary(bfi2$age) # summary of the new data
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.0 16.0 17.0 16.3 18.0 18.0
But this isn’t quite right. We still have folks below 12. But, the beauty of filter()
is that you can do sequence of OR
and AND
statements when there is more than one condition, such as up to 18 AND
at least 12.
<- bfi %>%
bfi2 filter(age <= 18 & age >= 12) # filter to age up to 18 and at least 12
summary(bfi2$age) # summary of the new data
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.0 16.0 17.0 16.4 18.0 18.0
Got it!
But filter works for more use cases than just conditional <
, >
, <=
, and >=
. It can also be used for cases where we want a single values to match cases with text. Before I demonstrate that, though, I need to convert one of the variables in the bfi
data frame to a string. So let’s change gender (1 = male, 2 = female) to text (we’ll get into factors later).
$education <- plyr::mapvalues(bfi$education, 1:5, c("Below HS", "HS", "Some College", "College", "Higher Degree")) bfi
Now let’s try a few things:
1. Create a data set with only individuals with some college (==
).
<- bfi %>%
bfi2 filter(education == "Some College")
unique(bfi2$education)
[1] "Some College"
2. Create a data set with only people age 18 (==
).
<- bfi %>%
bfi2 filter(age == 18)
summary(bfi2$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18 18 18 18 18 18
3. Create a data set with individuals with some college or above (%in%
).
<- bfi %>%
bfi2 filter(education %in% c("Some College", "College", "Higher Degree"))
unique(bfi2$education)
[1] "Some College" "Higher Degree" "College"
The %in%
operator is wonderful. Instead of comparing a column to a single value, you can compare it to several. So above, when we wanted ages between 12 and 18, we could have done:
<- bfi %>%
bfi2 filter(age %in% 12:18)
summary(bfi2$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.0 16.0 17.0 16.4 18.0 18.0
I’ve been using dplyr
for nearly five years, and I still have to remind myself that when you want to remove rows, you use filter()
.
3. select()
If filter()
is for pulling certain observations (rows), then select()
is for pulling certain variables (columns). Almost without fail, any data that are received for collected are going to have some variables that are not used, not useful, extraneous, etc. In such cases, it’s good practice to remove these columns to stop your environment from becoming cluttered and eating up your RAM.
In our bfi
data, most of these have been pre-removed, so instead, we’ll imagine we don’t want to use any indicators of Agreeableness (A1-A5) and that we aren’t interested in gender.
With select()
, there are few ways choose variables. We can bare quote name the ones we want to keep, bare quote names we want to remove, or use any of a number of select()
helper functions.
1. Bare quote columns we want to keep:
%>%
bfi select(C1, C2, C3, C4, C5)
# A tibble: 2,800 × 5
C1 C2 C3 C4 C5
<int> <int> <int> <int> <int>
1 2 3 3 4 4
2 5 4 4 3 4
3 4 5 4 2 5
4 4 4 3 5 5
5 4 4 5 3 2
6 6 6 6 1 3
7 5 4 4 2 3
8 3 2 4 2 4
9 6 6 3 4 5
10 6 5 6 2 1
# ℹ 2,790 more rows
I’m going to stop there because I don’t want to name the additional 17 columns we want to keep. Instead we’ll use :
to grab a range of columns.
%>%
bfi select(C1:O5, education, age)
# A tibble: 2,800 × 22
C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 2 3 3 4 4 3 3 3 4 4 3 4 2
2 5 4 4 3 4 1 1 6 4 3 3 3 3
3 4 5 4 2 5 2 4 4 4 5 4 5 4
4 4 4 3 5 5 5 3 4 4 4 2 5 2
5 4 4 5 3 2 2 2 5 4 5 2 3 4
6 6 6 6 1 3 2 1 6 5 6 3 5 2
7 5 4 4 2 3 4 3 4 5 5 1 2 2
8 3 2 4 2 4 3 6 4 2 1 6 3 2
9 6 6 3 4 5 5 3 NA 4 3 5 5 2
10 6 5 6 2 1 2 2 4 5 5 5 5 5
# ℹ 2,790 more rows
# ℹ 9 more variables: N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>,
# O4 <int>, O5 <int>, education <chr>, age <int>
2. Bare quote columns we don’t want to keep:
%>%
bfi select(-(A1:A5), -gender)
# A tibble: 2,800 × 22
C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 2 3 3 4 4 3 3 3 4 4 3 4 2
2 5 4 4 3 4 1 1 6 4 3 3 3 3
3 4 5 4 2 5 2 4 4 4 5 4 5 4
4 4 4 3 5 5 5 3 4 4 4 2 5 2
5 4 4 5 3 2 2 2 5 4 5 2 3 4
6 6 6 6 1 3 2 1 6 5 6 3 5 2
7 5 4 4 2 3 4 3 4 5 5 1 2 2
8 3 2 4 2 4 3 6 4 2 1 6 3 2
9 6 6 3 4 5 5 3 NA 4 3 5 5 2
10 6 5 6 2 1 2 2 4 5 5 5 5 5
# ℹ 2,790 more rows
# ℹ 9 more variables: N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>,
# O4 <int>, O5 <int>, education <chr>, age <int>
Note the ()
around the columns. That is necessary when you want to remove a range of columns.
3. Add or remove using select()
helper functions.
starts_with()
: matches names that begin with quoted argument. For example, if we wanted all the Conscientiousness items, we could call the following:
%>%
bfi select(starts_with("C"))
# A tibble: 2,800 × 5
C1 C2 C3 C4 C5
<int> <int> <int> <int> <int>
1 2 3 3 4 4
2 5 4 4 3 4
3 4 5 4 2 5
4 4 4 3 5 5
5 4 4 5 3 2
6 6 6 6 1 3
7 5 4 4 2 3
8 3 2 4 2 4
9 6 6 3 4 5
10 6 5 6 2 1
# ℹ 2,790 more rows
ends_with()
: matches names that end with quoted argument. For example, if we wanted the first item in each Big Five scale, we could call:
%>%
bfi select(ends_with("1"))
# A tibble: 2,800 × 5
A1 C1 E1 N1 O1
<int> <int> <int> <int> <int>
1 2 2 3 3 3
2 2 5 1 3 4
3 5 4 2 4 4
4 4 4 5 2 3
5 2 4 2 2 3
6 6 6 2 3 4
7 2 5 4 1 5
8 4 3 3 6 3
9 4 6 5 5 6
10 2 6 2 5 5
# ℹ 2,790 more rows
contains()
: matches names that contain quote material. This can be any subset of a string, which makes it very useful for a number of contexts we’ll see later. But for now, if I wanted to be lazy or couldn’t remember the name of th education variable, I could call:
%>%
bfi select(contains("edu"))
# A tibble: 2,800 × 1
education
<chr>
1 <NA>
2 <NA>
3 <NA>
4 <NA>
5 <NA>
6 Some College
7 <NA>
8 HS
9 Below HS
10 <NA>
# ℹ 2,790 more rows
matches()
: selects variables that match a regular expression (regex). Regex is tricky. I tend to end up referencing online documentation when I need to use this beyond a few basic expressions that I use very regularly. We’ll start with a simple one, keeping only those variables that either have or do not have numbers:
# contains numbers
%>%
bfi select(matches("[0-9]"))
# A tibble: 2,800 × 25
A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 2 4 3 4 4 2 3 3 4 4 3 3 3
2 2 4 5 2 5 5 4 4 3 4 1 1 6
3 5 4 5 4 4 4 5 4 2 5 2 4 4
4 4 4 6 5 5 4 4 3 5 5 5 3 4
5 2 3 3 4 5 4 4 5 3 2 2 2 5
6 6 6 5 6 5 6 6 6 1 3 2 1 6
7 2 5 5 3 5 5 4 4 2 3 4 3 4
8 4 3 1 5 1 3 2 4 2 4 3 6 4
9 4 3 6 3 3 6 6 3 4 5 5 3 NA
10 2 5 6 6 5 6 5 6 2 1 2 2 4
# ℹ 2,790 more rows
# ℹ 12 more variables: E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>,
# N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>
# does not contain numbers
%>%
bfi select(!matches("[0-9]"))
# A tibble: 2,800 × 3
gender education age
<int> <chr> <int>
1 1 <NA> 16
2 2 <NA> 18
3 2 <NA> 17
4 2 <NA> 17
5 1 <NA> 17
6 2 Some College 21
7 1 <NA> 18
8 1 HS 19
9 1 Below HS 19
10 2 <NA> 17
# ℹ 2,790 more rows
num_range()
: Given a stem and a range of numbers, this selects items in a sequence. This is especially useful when variables of your data set may not be in order.
# select first 2 Extraversion items
%>%
bfi select(num_range("E", 1:2))
# A tibble: 2,800 × 2
E1 E2
<int> <int>
1 3 3
2 1 1
3 2 4
4 5 3
5 2 2
6 2 1
7 4 3
8 3 6
9 5 3
10 2 2
# ℹ 2,790 more rows
one_of()
: select any of a subset of items from a vector. This is one of my favorites, for reasons we’ll see in my tutorial on workflow and data documentation. But for now, let’s say I thought there were six items in each personality when there are actually five. So when I call the following,one_of()
will be forgiving and ignore the fact that I messed up.
%>%
bfi select(one_of(paste0("E", 1:6)))
# A tibble: 2,800 × 5
E1 E2 E3 E4 E5
<int> <int> <int> <int> <int>
1 3 3 3 4 4
2 1 1 6 4 3
3 2 4 4 4 5
4 5 3 4 4 4
5 2 2 5 4 5
6 2 1 6 5 6
7 4 3 4 5 5
8 3 6 4 2 1
9 5 3 NA 4 3
10 2 2 4 5 5
# ℹ 2,790 more rows
all_of()
: select all of a subset of items from a vector. Unlikeone_of()
,all_of()
is less forgiving and will throw an error if we try to call for 6 Extraversion items.
%>%
bfi select(all_of(paste0("E", 1:6)))
Error in `all_of()`:
! Can't subset columns that don't exist.
✖ Column `E6` doesn't exist.
Oops. In this case, we’d then need to modify the code to reflect the correct number of items.
%>%
bfi select(all_of(paste0("E", 1:5)))
# A tibble: 2,800 × 5
E1 E2 E3 E4 E5
<int> <int> <int> <int> <int>
1 3 3 3 4 4
2 1 1 6 4 3
3 2 4 4 4 5
4 5 3 4 4 4
5 2 2 5 4 5
6 2 1 6 5 6
7 4 3 4 5 5
8 3 6 4 2 1
9 5 3 NA 4 3
10 2 2 4 5 5
# ℹ 2,790 more rows
4. arrange()
Sometimes, either in order to get a better sense of our data or in order to well, order our data, we want to sort it. Although there is a base R
sort()
function, the arrange()
function is tidyverse
version that plays nicely with other tidyverse functions
.
So in our previous examples, we could also arrange()
our data by age or education, rather than simply filtering. (Or as we’ll see later, we can do both!)
# sort by age
%>%
bfi select(gender:age) %>%
arrange(age)
# A tibble: 2,800 × 3
gender education age
<int> <chr> <int>
1 1 Higher Degree 3
2 2 <NA> 9
3 2 Some College 11
4 2 <NA> 11
5 2 <NA> 11
6 2 <NA> 12
7 2 <NA> 12
8 2 <NA> 12
9 2 <NA> 12
10 1 <NA> 12
# ℹ 2,790 more rows
# sort by education
%>%
bfi select(gender:age) %>%
arrange(education)
# A tibble: 2,800 × 3
gender education age
<int> <chr> <int>
1 1 Below HS 19
2 1 Below HS 21
3 1 Below HS 17
4 1 Below HS 18
5 1 Below HS 18
6 2 Below HS 18
7 2 Below HS 43
8 1 Below HS 32
9 1 Below HS 18
10 2 Below HS 18
# ℹ 2,790 more rows
We can also arrange by multiple columns, like if we wanted to sort by gender then education:
%>%
bfi select(gender:age) %>%
arrange(gender, education)
# A tibble: 2,800 × 3
gender education age
<int> <chr> <int>
1 1 Below HS 19
2 1 Below HS 21
3 1 Below HS 17
4 1 Below HS 18
5 1 Below HS 18
6 1 Below HS 32
7 1 Below HS 18
8 1 Below HS 18
9 1 Below HS 53
10 1 Below HS 18
# ℹ 2,790 more rows
Bringing it all together: Split-Apply-Combine
Much of the power of dplyr
functions lay in the split-apply-combine method. The method is kind of what it sounds like. A given set of of data are split into smaller chunks, then a function or series of functions are applied to each chunk, and then the chunks are combined back together.
Although all of the dplyr
functions can be used in conjunction with one another, I’m going to highlight the group_by()
, mutate()
, and summarize()
/ summarise()
functions to highlight the core of the split-apply-combine method.
5. group_by()
The group_by()
function is the “split” of the method. It basically implicitly breaks the data set into chunks by whatever bare quoted column(s)/variable(s) are supplied as arguments.
So imagine that we wanted to group_by()
education levels to get average ages at each level. We would simply call:
%>%
bfi select(starts_with("C"), age, gender, education) %>%
group_by(education)
# A tibble: 2,800 × 8
# Groups: education [6]
C1 C2 C3 C4 C5 age gender education
<int> <int> <int> <int> <int> <int> <int> <chr>
1 2 3 3 4 4 16 1 <NA>
2 5 4 4 3 4 18 2 <NA>
3 4 5 4 2 5 17 2 <NA>
4 4 4 3 5 5 17 2 <NA>
5 4 4 5 3 2 17 1 <NA>
6 6 6 6 1 3 21 2 Some College
7 5 4 4 2 3 18 1 <NA>
8 3 2 4 2 4 19 1 HS
9 6 6 3 4 5 19 1 Below HS
10 6 5 6 2 1 17 2 <NA>
# ℹ 2,790 more rows
We can now see that it tells us that we have a tibble with 2,800 rows and 8 columns as well as Groups: education [6]
Importantly, once you group, you must ungroup()
or your data frame will remain “split” and cause you problems. In other words, you must “combine” your data frame back together. This is super easy with the ungroup()
function:
%>%
bfi select(starts_with("C"), age, gender, education) %>%
group_by(education) %>%
ungroup()
# A tibble: 2,800 × 8
C1 C2 C3 C4 C5 age gender education
<int> <int> <int> <int> <int> <int> <int> <chr>
1 2 3 3 4 4 16 1 <NA>
2 5 4 4 3 4 18 2 <NA>
3 4 5 4 2 5 17 2 <NA>
4 4 4 3 5 5 17 2 <NA>
5 4 4 5 3 2 17 1 <NA>
6 6 6 6 1 3 21 2 Some College
7 5 4 4 2 3 18 1 <NA>
8 3 2 4 2 4 19 1 HS
9 6 6 3 4 5 19 1 Below HS
10 6 5 6 2 1 17 2 <NA>
# ℹ 2,790 more rows
You can also overwrite groups by calling group_by()
more than once. We’ll touch more on that in future tutorials, but for now, notice what happens when I call group_by()
twice sequentially:
%>%
bfi select(starts_with("C"), age, gender, education) %>%
group_by(education) %>%
group_by(gender, age)
# A tibble: 2,800 × 8
# Groups: gender, age [115]
C1 C2 C3 C4 C5 age gender education
<int> <int> <int> <int> <int> <int> <int> <chr>
1 2 3 3 4 4 16 1 <NA>
2 5 4 4 3 4 18 2 <NA>
3 4 5 4 2 5 17 2 <NA>
4 4 4 3 5 5 17 2 <NA>
5 4 4 5 3 2 17 1 <NA>
6 6 6 6 1 3 21 2 Some College
7 5 4 4 2 3 18 1 <NA>
8 3 2 4 2 4 19 1 HS
9 6 6 3 4 5 19 1 Below HS
10 6 5 6 2 1 17 2 <NA>
# ℹ 2,790 more rows
Note that the resulting data frame is not grouped by education at all, but by gender and age (i.e. it is not cumulative).
6. mutate()
The mutate()
function is one of a few options for how to “apply” (a) function(s) to your split (i.e. group_by()
) data frame. When you use mutate()
, the resulting data frame will have the same number of rows you started with (which will not be true with summarize()
/ summarise()
). One way to remember is this is that you are directly mutating the existing data frame, either modifying existing columns or creating new ones.
So to continue with the example above, if we were to add a column that indicated average age levels within each age group, we would call:
%>%
bfi select(starts_with("C"), age, gender, education) %>%
arrange(education) %>%
group_by(education) %>%
mutate(age_by_edu = mean(age, na.rm = T))
# A tibble: 2,800 × 9
# Groups: education [6]
C1 C2 C3 C4 C5 age gender education age_by_edu
<int> <int> <int> <int> <int> <int> <int> <chr> <dbl>
1 6 6 3 4 5 19 1 Below HS 25.1
2 4 3 5 3 2 21 1 Below HS 25.1
3 5 5 5 2 2 17 1 Below HS 25.1
4 5 5 4 1 1 18 1 Below HS 25.1
5 4 5 4 3 3 18 1 Below HS 25.1
6 3 2 3 4 6 18 2 Below HS 25.1
7 3 6 3 1 3 43 2 Below HS 25.1
8 5 3 2 4 6 32 1 Below HS 25.1
9 5 5 4 3 4 18 1 Below HS 25.1
10 4 5 5 2 3 18 2 Below HS 25.1
# ℹ 2,790 more rows
As you can see in the resulting data frame, each person (row) with the same education level has the same value in the new age_by_edu
column I just added.
mutate()
is also super useful even when you aren’t grouping. For example, if I wanted to recode gender so that 1 = “male” and 2 = “female,” we could do that like:
%>%
bfi select(starts_with("C"), age, gender, education) %>%
mutate(gender_cat = plyr::mapvalues(gender, c(1,2), c("Male", "Female")))
# A tibble: 2,800 × 9
C1 C2 C3 C4 C5 age gender education gender_cat
<int> <int> <int> <int> <int> <int> <int> <chr> <chr>
1 2 3 3 4 4 16 1 <NA> Male
2 5 4 4 3 4 18 2 <NA> Female
3 4 5 4 2 5 17 2 <NA> Female
4 4 4 3 5 5 17 2 <NA> Female
5 4 4 5 3 2 17 1 <NA> Male
6 6 6 6 1 3 21 2 Some College Female
7 5 4 4 2 3 18 1 <NA> Male
8 3 2 4 2 4 19 1 HS Male
9 6 6 3 4 5 19 1 Below HS Male
10 6 5 6 2 1 17 2 <NA> Female
# ℹ 2,790 more rows
We could also just write over the original gender category like:
%>%
bfi select(starts_with("C"), age, gender, education) %>%
mutate(gender = plyr::mapvalues(gender, c(1,2), c("Male", "Female")))
# A tibble: 2,800 × 8
C1 C2 C3 C4 C5 age gender education
<int> <int> <int> <int> <int> <int> <chr> <chr>
1 2 3 3 4 4 16 Male <NA>
2 5 4 4 3 4 18 Female <NA>
3 4 5 4 2 5 17 Female <NA>
4 4 4 3 5 5 17 Female <NA>
5 4 4 5 3 2 17 Male <NA>
6 6 6 6 1 3 21 Female Some College
7 5 4 4 2 3 18 Male <NA>
8 3 2 4 2 4 19 Male HS
9 6 6 3 4 5 19 Male Below HS
10 6 5 6 2 1 17 Female <NA>
# ℹ 2,790 more rows
7. summarize()
/ summarise()
The summarize()
/ summarise()
functions (choose your spelling as you will) is another of the options for how to “apply” (a) function(s) to your split (i.e. group_by()
) data frame. When you use summarize()
(I made by choice), the resulting data frame will have the number of rows equal to the number of group_by()
categories you provide. So if you provided education
, that will be 6, and if you provided none, it would be one.
# group_by() education
%>%
bfi select(starts_with("C"), age, gender, education) %>%
arrange(education) %>%
group_by(education) %>%
summarize(age_by_edu = mean(age, na.rm = T))
# A tibble: 6 × 2
education age_by_edu
<chr> <dbl>
1 Below HS 25.1
2 College 33.0
3 HS 31.5
4 Higher Degree 35.3
5 Some College 27.2
6 <NA> 18.0
# no groups
%>%
bfi select(starts_with("C"), age, gender, education) %>%
arrange(education) %>%
summarize(age_by_edu = mean(age, na.rm = T))
# A tibble: 1 × 1
age_by_edu
<dbl>
1 28.8
From this, for example, it becomes clear that all the NA
s in the education
variable were because they reflected 18 year olds who had not yet completed high school as well as that the sample is pretty young overall.
Conclusion
This has been a quick and relatively low level introduction into some of the core functions in the dplyr
package. I challenge you to try to take what you learned and apply it to some of your own data.
In the next tutorial, we will touch on data wrangling, introducing the tidyr
packages. This will provide methods for changing the shape for your data, which, in turn, will open up new opportunities to use dplyr
functions in useful ways. We will then take all of that and roll it into additional packages and tools.