Tutorial: data manipulation with dplyr

The dplyr package is one of several included in the tidyverse collection (Wickham et al. 2019). It provides a ton of functionality to quickly manipulate and transform datasets. to the tidyverse, a collection of data science tools within R for transforming and visualizing data. This is not the only set of tools in R, but it’s a powerful and popular approach for exploring data. At every step, you’ll be analyzing a real dataset called

library(dplyr)

This tutorial introduces the key dplyr verbs that can be used (in various combinations) to explore and transform a dataset.

Exploring data

The glimpse() verb can be used to view the first few values from each variable, along with the data type, which is a useful first step in understanding the data.

glimpse(counties)

Subsetting, arranging and transforming data

The select() verb extracts particular columns (or variables) from a dataset. You can also select a range of columns as column5:column10.

The filter() verb extracts a subset of rows (or observations) from a dataset, based on one or multiple conditions, expressed with logical operators such as ==, >, %in%… You can concatenate conditions separating them with a coma.

The arrange() verb sorts the rows (or observations) in a dataset, based on one or more variables. This is in ascending order by default, but the desc() function can be used to invert this. The arrange verb is useful, for example, when you want to know the most extreme values in a dataset.

The mutate() verb can be used to add new variables or change existing ones.

Using dplyr syntax and the pipe operator (%>%), you can concatenate verbs and transform a dataset in a series of steps. A pipe is equivalent to saying “take whatever is before it, and feed it into the next step”. Use assignment (<-) to store the output a new table.

subset <- counties %>%
  # Select the columns
  select(state, county, population, unemployment) %>%
  # Highest values first
  arrange(desc(population)) %>%
  # Filter data 
  filter(state == "New York", 
         unemployment > 6) %>%
  # Create new variable
  mutate(unemplyed_population = population * unemployment / 100)

Note

The mutate verb can also be used to select and transform columns in one go! Just list columns to select inside mutate, along with the new column to create. Specifying .keep = "none" means that any column not included in the statement is discarded.

counties %>%
  # Select and create new variable
  mutate(state, county, population, unemployment, 
         unemplyed_population = population * unemployment / 100, .keep = "none") %>% 
  # Rearrange
  arrange(desc(unemplyed_population))

Select helpers

select is a very versatile tool for not only choosing particular columns, but for removing and reordering them. To remove variables from a table add a - in front of the column name to remove. Conveniently, dplyr also provides a number of “select helper” functions that specify criteria for choosing columns.

contains(): to select all columns containing some string.
starts_with() and ends_with(): to select only the columns that start or end with a particular string / prefix.
last_col(): grabs the last column
matches(): selects columns that have a specified pattern.

To discover more about select helpers, check out the dplyr select helpers documentation, using ?select_helpers.

subset <- counties %>%
  # Select the columns
  select(state, county, professional:production, 
         contains("work"), starts_with("income")) %>%
  # Arrange service in descending order 
  arrange(desc(service))

Rename

Often, rather than only selecting columns, you may want to rename them. This can be done using the rename() verb or directly inside select.

counties %>%
  # Count the number of counties in each state
  count(state) %>%
  # Rename the n column to num_counties
  rename(num_counties = n)

counties %>%
  # Select state, county, and poverty as poverty_rate
  select(state, county, poverty_rate = poverty)

Relocate

The relocate() verb is used to change column positions quickly and efficiently. The function takes the column you wish to move as the first argument. Then use the .before and .after arguments to indicate where you want to move it.

You can also combine relocate with the select helpers to get, for example a column moved to the end of the dataset (.after = last_col())

counties %>%
  # Move the density column to the end
  relocate(density, .after=last_col()) %>%
  # Move the population column to before land_area
  relocate(population, .before=land_area)

Aggregating data

Aka = summarizing many observations into one row (a common strategy for making datasets manageable and interpretable).

count() returns how many observations there are in a table. Outputs a table with one row and one column, called n.

counties %>%
  # Total number of observations 
  count()

But the real value of count is when you pass a specific variable to count. The n column in the resulting table now contains the counts per each level of that variable. count also takes a second argument sort = TRUE, that sorts rows from the most common observations to the least.

counties %>%
  # Count observations per state and sort in descending order
  count(state, sort = TRUE)

You can also add the argument wt, which stands for “weight”, to indicate that the n column should be weighted by another variable.

counties %>%
  # Total population by state state and sort in descending order
  count(state, wt = population, sort = TRUE)

count() is a special case of a more general set of verbs:

summarize() takes many observations and turns them into one. You can define multiple variables in a summarize call, and aggregate each in different ways, using common summary functions: sum(), mean(), median(), min(), max(), and n() (for the size of the group). You can combine these too.

counties %>%
  # Total population and average unemployment
  summarize(total_population = sum(population),
            average_unemployment = mean(unemployment))

Summarizing the entire table is useful, but, ideally, we want to aggregate within groups.

You can achieve this by piping first into group_by(), choose the variable(s) to group on, then piping from that into summarize.

counties %>%
  # Nest by state
  group_by(state) %>%
  # Total population and average unemployment
  summarize(total_population = sum(population),
            average_unemployment = mean(unemployment)) %>% 
  # Often useful to sort results 
  arrange(desc(average_unemployment))

To group by multiple columns, pass all the column names to group_by (resulting in one row for each combination of columns). You can now see "Groups: state" at the top of the output table.

Note: when you use summarize on a table that has multiple groups, remember that this will “peel off” one of the groups (the last one). For example, if you group_by(X, Y) %>% summarize() the result will still be grouped by X. This is useful when you want to continue doing additional summaries or aggregations. If you don’t want to keep state as a group, you can use ungroup().

counties %>%
  # Nest by state and metropolitan area
  group_by(state, metro) %>%
  # Total population
  summarize(total_population = sum(population)) %>% 
  # Ungroup or keep summarizing
  ungroup()

Finally, slice_min() and slice_max() extract the most extreme observations from each group.

Like summarize, they can operate on a grouped table, and return the largest (or smallest) observations in each group. The function takes two arguments: the column we want to base the ordering on, and the number of observations n to extract from each group.

The slicing verbs are often used when creating visualizations, where we may want to highlight the extreme observations on the plot.

counties %>%
  # Nest by state
  group_by(state) %>%
  # Return the county with the largest population
  slice_max(population, n = 1)

Note: you can combine multiple dplyr verbs together. e.g.:

counties %>%
  # Group
  group_by(region, state) %>%
  # Calculate average income
  summarize(average_income = mean(income)) %>%
  # Find the lowest income state in each region
  slice_min(average_income, n=1)

counties %>%
  # Find the total population for each combination of state and metro
  group_by(state, metro) %>%
  summarize(total_pop = sum(population)) %>%
  # Extract the most populated row for each state
  slice_max(total_pop, n = 1) %>%
  # Count the states with more people in Metro or Nonmetro areas
  ungroup(state) %>%
  count(metro)

Grouped mutate

Just like group_by and summarize work well together, group_by and mutate are a great pair, that allows to create new columns using “grouped” values. The pipeline is similar to the one described above. Note that you may want to ungroup the data after calling mutate, since this may affect the following calls and it can make other mutates or filters slower to run, especially if there are a lot of groups in the table.

babynames %>%
  # Calculate the fraction of people born each year with the same name
  group_by(year) %>%
  mutate(year_total = sum(number)) %>%
  ungroup() %>%
  mutate(fraction = number / year_total)
  # Find the year each name is most common
  group_by(name) %>%
  slice_max(fraction, n = 1)

babynames %>%
  # Add columns name_total and name_max for each name
  group_by(name) %>%
  mutate(name_total = sum(number),
         name_max = max(number)) %>%
  # Ungroup the table 
  ungroup() %>%
  # Add the fraction_max column containing the number by the name maximum 
  mutate(fraction_max = number / name_max)

Window Functions

A window function takes a vector, and returns another vector of the same length. For example, the lag() function moves each item of a vector to the right by one. So if v <- c(1, 3, 6, 14), then lag(v) will be c(NA, 1, 3, 6).

Why is this useful? By lining up each item in the vector with the item directly before it, you can compare consecutive steps and calculate the changes. e.g. v - lag(v) is a way to calculate the difference between consecutive values in a vector: c(NA, 2, 3, 8) aka “each value once we’ve subtracted the previous one”

For example, you can use lag to find the changes in popularity of one name in consecutive years.

babynames %>%
  # Calculate the fraction of people born each year with the same name
  group_by(year) %>%
  mutate(year_total = sum(number)) %>%
  ungroup() %>%
  mutate(fraction = number / year_total)
  # Filter one name and order by year
  filter(name == "Matthew") %>%
  arrange(year) %>%
  # Take each fraction, and subtract the "lagged" fraction
  mutate(difference = fraction - lag(fraction))

If instead of looking at one name, we wanted to look at the changes within every name, you can do this in a grouped mutate: first group by name before calculating the difference between each year. This ensures you won’t include differences between two different names.

babynames %>%
  # Calculate the fraction of people born each year with the same name
  group_by(year) %>%
  mutate(year_total = sum(number)) %>%
  ungroup() %>%
  mutate(fraction = number / year_total)
  # Arrange the data in order of name, then year 
  arrange(name, year) %>%
  # Group the data by name
  group_by(name) %>%
  # Add a ratio column that contains the ratio of fraction between each year 
  mutate(ratio = fraction / lag(fraction)) %>%
  ungroup()

Plotting

Visualization and data wrangling are often intertwined. The dplyr package is very useful for exploring data, but it’s especially useful when combined with other tidyverse packages like ggplot2. There are three parts to a ggplot graph:

The data;
The mapping of variables to grph aesthetics, done with aes(). An aesthetic is a visual dimension of a graph that can be used to communicate information. For example, in a scatterplot, two dimensions are the x- and y-axis. Other aesthetics can be color (good way to represent categorical data) and size (good for representing continuous variables). Note that ggplot automatically adds a legend to the plot.
Specifying the type of graph, done by adding a layer to the graph, e.g. + geom_point().

selected_names <- babynames %>%
  # Filter for the names Steven, Thomas, and Matthew 
  filter(name %in% c("Steven", "Thomas", "Matthew")) %>%
  # Plot the names using a different color for each name
  ggplot(., aes(x = year, y = number, color = name)) +
    geom_line()

Scaling axes

By adding + expand_limits(y = 0) to the end of the ggplot call, you can specify that you want the y-axis to start at zero.

A log scale is a scale where each fixed distance represents a multiplication of the value. For example each unit on the x-axis represents a change of 10 times the variable. On this scale, relationships may look more linear: and you can more easily distinguish observations at the lower end of the spectrum. You can represent this by adding one additional option to the ggplot call: + scale_x_log10() or + scale_y_log10().

Plot types

There are many types of graphs supported ggplot2, each of which is useful for visualizing a particular type of data.

scatter plots [+ geom_point()], useful for comparing two variables, where each point represents one observation.
line plots [+ geom_line()], useful for showing change over time.
bar plots [+ geom_col()], good at comparing statistics for each of several categories. There are two aesthetics in a bar plot: x is the categorical variable and y is the variable that determines the height of the bars. Unlike scatter plots or line plots, bar plots always start at zero.
histograms [+ geom_histogram()], to describe the shape of the distribution of a one-dimensional numeric variable. A histogram has only one aesthetic: the x-axis. The width of each bin can be customized using the binwidth option, expressed in the same scale as the variable (e.g., geom_histogram(binwidth = 5)), or you can set the number of bins using bins instead. Note: in some cases, you may need to put the x-axis of a histogram on a log scale for it to be understandable.
box plots [+ geom_boxplot()], to compare the distribution of a numeric variable among several categories. This graph has two aesthetics: x is the category and y is the values to compare. The black line in the middle of each white box is the median of that distribution. The top and bottom of each box represent the 75th- and 25th- percentile (meaning half of the distribution lies within that box). The “whiskers” cover additional values and the dots represent outliers.

Faceting

Faceting consists of dividing your plot into subplots according to a variable, another powerful way to communicate relationships within the data. You facet a plot by adding another option to the end of the call: + facet_wrap(~ continent). In R, the ~ symbol typically means “by”, meaning that you are splitting the plot by continent.

References

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Tutorial: data manipulation with `dplyr`

Serena Defina