dplyr
The dplyr
package is one of several included in
the tidyverse
collection (Wickham et al. 2019).
It provides a ton of functionality to quickly
manipulate and transform datasets. to
the tidyverse, a collection of data science tools within R for
transforming and visualizing data. This is not the only set of tools in
R, but it’s a powerful and popular approach for exploring data. At every
step, you’ll be analyzing a real dataset called
library(dplyr)
This tutorial introduces the key dplyr verbs that can be used (in various combinations) to explore and transform a dataset.
The glimpse()
verb can be used to view the first few values from each variable, along
with the data type, which is a useful first step in understanding the
data.
glimpse(counties)
The select()
verb extracts particular columns (or variables) from a
dataset. You can also select a range of columns as
column5:column10
.
The filter()
verb extracts a subset of rows (or observations) from a
dataset, based on one or multiple conditions, expressed with logical
operators such as ==
, >
, %in%
…
You can concatenate conditions separating them with a coma.
The arrange()
verb sorts the rows (or observations) in a dataset, based on one or more
variables. This is in ascending order by default, but the
desc()
function can be used to invert
this. The arrange
verb is useful, for example, when you
want to know the most extreme values in a dataset.
The mutate()
verb can be used to add new variables or
change existing ones.
Using dplyr
syntax and the pipe
operator (%>%
), you can concatenate verbs and
transform a dataset in a series of steps. A pipe is equivalent to saying
“take whatever is before it, and feed it into the next step”. Use
assignment (<-
) to store the output a new
table.
subset <- counties %>%
# Select the columns
select(state, county, population, unemployment) %>%
# Highest values first
arrange(desc(population)) %>%
# Filter data
filter(state == "New York",
unemployment > 6) %>%
# Create new variable
mutate(unemplyed_population = population * unemployment / 100)
Note
The mutate
verb can also be used to select and transform
columns in one go! Just list columns to select inside mutate, along with
the new column to create. Specifying .keep = "none"
means
that any column not included in the statement is discarded.
counties %>%
# Select and create new variable
mutate(state, county, population, unemployment,
unemplyed_population = population * unemployment / 100, .keep = "none") %>%
# Rearrange
arrange(desc(unemplyed_population))
select
is a very versatile tool for not only choosing
particular columns, but for removing and reordering them. To
remove variables from a table add a -
in
front of the column name to remove. Conveniently, dplyr
also provides a number of “select helper” functions that
specify criteria for choosing columns.
contains()
: to select all columns
containing some string.starts_with()
and
ends_with()
: to select only the columns
that start or end with a particular string / prefix.last_col()
: grabs the last columnmatches()
: selects columns that have a
specified pattern.To discover more about select helpers, check out the dplyr select
helpers documentation, using ?select_helpers
.
subset <- counties %>%
# Select the columns
select(state, county, professional:production,
contains("work"), starts_with("income")) %>%
# Arrange service in descending order
arrange(desc(service))
Often, rather than only selecting columns, you may want to rename
them. This can be done using the rename()
verb or directly inside select
.
counties %>%
# Count the number of counties in each state
count(state) %>%
# Rename the n column to num_counties
rename(num_counties = n)
counties %>%
# Select state, county, and poverty as poverty_rate
select(state, county, poverty_rate = poverty)
The relocate()
verb is used to change
column positions quickly and efficiently. The function takes the column
you wish to move as the first argument. Then use the
.before
and .after
arguments to indicate where
you want to move it.
You can also combine relocate
with the select
helpers to get, for example a column moved to the end of the
dataset (.after = last_col()
)
counties %>%
# Move the density column to the end
relocate(density, .after=last_col()) %>%
# Move the population column to before land_area
relocate(population, .before=land_area)
Aka = summarizing many observations into one row (a common strategy for making datasets manageable and interpretable).
count()
returns how many observations there are in a table. Outputs a table with
one row and one column, called n
.
counties %>%
# Total number of observations
count()
But the real value of count
is when you pass a specific
variable to count. The n
column in the resulting table now
contains the counts per each level of that variable. count
also takes a second argument sort = TRUE
, that sorts rows
from the most common observations to the least.
counties %>%
# Count observations per state and sort in descending order
count(state, sort = TRUE)
You can also add the argument wt
, which
stands for “weight”, to indicate that the n
column
should be weighted by another variable.
counties %>%
# Total population by state state and sort in descending order
count(state, wt = population, sort = TRUE)
count()
is a special case of a more general set of
verbs:
summarize()
takes many observations and turns them into one. You can define multiple
variables in a summarize call, and aggregate each in different ways,
using common summary functions: sum()
, mean()
,
median()
, min()
, max()
, and
n()
(for the size of the group). You can combine these
too.
counties %>%
# Total population and average unemployment
summarize(total_population = sum(population),
average_unemployment = mean(unemployment))
Summarizing the entire table is useful, but, ideally, we want to aggregate within groups.
You can achieve this by piping first into group_by()
,
choose the variable(s) to group on, then piping from that into
summarize
.
counties %>%
# Nest by state
group_by(state) %>%
# Total population and average unemployment
summarize(total_population = sum(population),
average_unemployment = mean(unemployment)) %>%
# Often useful to sort results
arrange(desc(average_unemployment))
To group by multiple columns, pass all the column
names to group_by
(resulting in one row for each
combination of columns). You can now see "Groups: state"
at
the top of the output table.
Note: when you use summarize
on a table that
has multiple groups, remember that this will “peel off” one of the
groups (the last one). For example, if you
group_by(X, Y) %>% summarize()
the result will still be
grouped by X
. This is useful when you want to continue
doing additional summaries or aggregations. If you don’t want to keep
state as a group, you can use ungroup()
.
counties %>%
# Nest by state and metropolitan area
group_by(state, metro) %>%
# Total population
summarize(total_population = sum(population)) %>%
# Ungroup or keep summarizing
ungroup()
Finally, slice_min()
and slice_max()
extract the most extreme observations from each group.
Like summarize
, they can operate on a grouped table, and
return the largest (or smallest) observations in each group. The
function takes two arguments: the column we want to base the ordering
on, and the number of observations n
to extract from each
group.
The slicing verbs are often used when creating visualizations, where we may want to highlight the extreme observations on the plot.
counties %>%
# Nest by state
group_by(state) %>%
# Return the county with the largest population
slice_max(population, n = 1)
Note: you can combine multiple dplyr verbs together. e.g.:
counties %>%
# Group
group_by(region, state) %>%
# Calculate average income
summarize(average_income = mean(income)) %>%
# Find the lowest income state in each region
slice_min(average_income, n=1)
counties %>%
# Find the total population for each combination of state and metro
group_by(state, metro) %>%
summarize(total_pop = sum(population)) %>%
# Extract the most populated row for each state
slice_max(total_pop, n = 1) %>%
# Count the states with more people in Metro or Nonmetro areas
ungroup(state) %>%
count(metro)
Just like group_by
and summarize
work well
together, group_by
and mutate
are a great
pair, that allows to create new columns using “grouped” values. The
pipeline is similar to the one described above. Note that you may want
to ungroup
the data after calling mutate, since this may
affect the following calls and it can make other mutates or filters
slower to run, especially if there are a lot of groups in the table.
babynames %>%
# Calculate the fraction of people born each year with the same name
group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total)
# Find the year each name is most common
group_by(name) %>%
slice_max(fraction, n = 1)
babynames %>%
# Add columns name_total and name_max for each name
group_by(name) %>%
mutate(name_total = sum(number),
name_max = max(number)) %>%
# Ungroup the table
ungroup() %>%
# Add the fraction_max column containing the number by the name maximum
mutate(fraction_max = number / name_max)
A window function takes a vector, and returns another vector of the
same length. For example, the lag()
function moves each item of a vector to the right by one. So if
v <- c(1, 3, 6, 14)
, then lag(v)
will be
c(NA, 1, 3, 6)
.
Why is this useful? By lining up each item in the vector
with the item directly before it, you can compare consecutive steps and
calculate the changes. e.g. v - lag(v)
is a way to
calculate the difference between consecutive values in a vector:
c(NA, 2, 3, 8)
aka “each value once we’ve subtracted the
previous one”
For example, you can use lag
to find the changes in
popularity of one name in consecutive years.
babynames %>%
# Calculate the fraction of people born each year with the same name
group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total)
# Filter one name and order by year
filter(name == "Matthew") %>%
arrange(year) %>%
# Take each fraction, and subtract the "lagged" fraction
mutate(difference = fraction - lag(fraction))
If instead of looking at one name, we wanted to look at the changes within every name, you can do this in a grouped mutate: first group by name before calculating the difference between each year. This ensures you won’t include differences between two different names.
babynames %>%
# Calculate the fraction of people born each year with the same name
group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total)
# Arrange the data in order of name, then year
arrange(name, year) %>%
# Group the data by name
group_by(name) %>%
# Add a ratio column that contains the ratio of fraction between each year
mutate(ratio = fraction / lag(fraction)) %>%
ungroup()
Visualization and data wrangling are often intertwined. The
dplyr
package is very useful for exploring data, but it’s
especially useful when combined with other tidyverse
packages like ggplot2
.
There are three parts to a ggplot graph:
aes()
. An aesthetic is a visual dimension of a graph
that can be used to communicate information. For example, in a
scatterplot, two dimensions are the x- and
y-axis. Other aesthetics can be color
(good way to represent categorical data) and size (good
for representing continuous variables). Note that ggplot
automatically adds a legend to the plot.+ geom_point()
.selected_names <- babynames %>%
# Filter for the names Steven, Thomas, and Matthew
filter(name %in% c("Steven", "Thomas", "Matthew")) %>%
# Plot the names using a different color for each name
ggplot(., aes(x = year, y = number, color = name)) +
geom_line()
By adding + expand_limits(y = 0)
to the end of the
ggplot call, you can specify that you want the y-axis to start at
zero.
A log scale is a scale where each fixed
distance represents a multiplication of the value. For example each unit
on the x-axis represents a change of 10 times the variable. On this
scale, relationships may look more linear: and you can more easily
distinguish observations at the lower end of the spectrum. You can
represent this by adding one additional option to the ggplot call:
+ scale_x_log10()
or + scale_y_log10()
.
There are many types of graphs supported ggplot2
, each
of which is useful for visualizing a particular type of data.
+ geom_point()
], useful
for comparing two variables, where each point represents one
observation.+ geom_line()
], useful for
showing change over time.+ geom_col()
], good at
comparing statistics for each of several categories. There are two
aesthetics in a bar plot: x
is the categorical variable and
y
is the variable that determines the height of the bars.
Unlike scatter plots or line plots, bar plots always start at zero.+ geom_histogram()
], to
describe the shape of the distribution of a
one-dimensional numeric variable. A histogram has only one
aesthetic: the x-axis. The width of each bin can be customized using the
binwidth
option, expressed in the same
scale as the variable (e.g., geom_histogram(binwidth = 5)
),
or you can set the number of bins using
bins
instead. Note: in some
cases, you may need to put the x-axis of a histogram on a log scale for
it to be understandable.+ geom_boxplot()
], to
compare the distribution of a numeric variable among several categories.
This graph has two aesthetics: x
is the category and
y
is the values to compare. The black line in the middle of
each white box is the median of that distribution. The top and
bottom of each box represent the 75th- and 25th- percentile
(meaning half of the distribution lies within that box). The “whiskers”
cover additional values and the dots represent outliers.Faceting consists of dividing your plot into
subplots according to a variable, another powerful way
to communicate relationships within the data. You facet a plot by adding
another option to the end of the call:
+ facet_wrap(~ continent)
. In R, the ~
symbol
typically means “by”, meaning that you are splitting the plot by
continent.