Data Manipulation using tidyverse and dplyr

By Kelvin Kiprono

September 9, 2024

Tidyverse is a collection of R packages designed for data science, providing tools for data manipulation, visualization, and analysis. It includes packages like ggplot2 (for plotting), dplyr (for data manipulation), tidyr (for reshaping data), and more, all built around a consistent design philosophy.

dplyr is a package within the tidyverse focused specifically on data manipulation. It provides a set of intuitive functions like filter(), select(), mutate(), summarize(), and arrange() to perform operations such as filtering, selecting, transforming, summarizing, and sorting data in a straightforward and readable way.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)  # `tidyverse` includes `dplyr` and other helpful packages
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#Create a dummy dataset
data <- data.frame(
  name = c("Alice", "Bob", "Charlie", "David", "Eva", "Frank", "Grace", "Hannah", "Ian", "Jack"),
  age = c(23, 35, 45, 29, 33, 50, 28, 41, 36, 27),
  gender = c("F", "M", "M", "M", "F", "M", "F", "F", "M", "M"),
  income = c(50000, 60000, 80000, 55000, 65000, 90000, 45000, 70000, 75000, 48000)
)
print(head(data))
##      name age gender income
## 1   Alice  23      F  50000
## 2     Bob  35      M  60000
## 3 Charlie  45      M  80000
## 4   David  29      M  55000
## 5     Eva  33      F  65000
## 6   Frank  50      M  90000

Filtering Data

Use filter() to subset rows based on conditions.

# Filter rows where age is greater than 30
filtered_data <- data %>% 
  filter(age > 30)
filtered_data
##      name age gender income
## 1     Bob  35      M  60000
## 2 Charlie  45      M  80000
## 3     Eva  33      F  65000
## 4   Frank  50      M  90000
## 5  Hannah  41      F  70000
## 6     Ian  36      M  75000

Selecting Columns

Use select() to pick specific columns.

# Select only the 'name' and 'age' columns
selected_data <- data %>% 
  select(name, age)

selected_data
##       name age
## 1    Alice  23
## 2      Bob  35
## 3  Charlie  45
## 4    David  29
## 5      Eva  33
## 6    Frank  50
## 7    Grace  28
## 8   Hannah  41
## 9      Ian  36
## 10    Jack  27

Mutating (Adding New Columns)

Use mutate() to create new columns or modify existing ones.

# Add a new column 'age_in_5_years'
mutated_data <- data %>% 
  mutate(age_in_5_years = age + 5)
mutated_data
##       name age gender income age_in_5_years
## 1    Alice  23      F  50000             28
## 2      Bob  35      M  60000             40
## 3  Charlie  45      M  80000             50
## 4    David  29      M  55000             34
## 5      Eva  33      F  65000             38
## 6    Frank  50      M  90000             55
## 7    Grace  28      F  45000             33
## 8   Hannah  41      F  70000             46
## 9      Ian  36      M  75000             41
## 10    Jack  27      M  48000             32

Arranging (Sorting) Data

Use arrange() to sort data.

# Sort data by age in descending order
sorted_data <- data %>% 
  arrange(desc(age))
sorted_data
##       name age gender income
## 1    Frank  50      M  90000
## 2  Charlie  45      M  80000
## 3   Hannah  41      F  70000
## 4      Ian  36      M  75000
## 5      Bob  35      M  60000
## 6      Eva  33      F  65000
## 7    David  29      M  55000
## 8    Grace  28      F  45000
## 9     Jack  27      M  48000
## 10   Alice  23      F  50000

Summarizing Data

Use summarize() to calculate summary statistics, often with group_by() to aggregate data.

# Calculate the average income
average_income <- data %>% 
  summarize(avg_income = mean(income, na.rm = TRUE))
print(average_income)
##   avg_income
## 1      63800
# Group by gender and calculate average income for each gender
grouped_data <- data %>% 
  group_by(gender) %>% 
  summarize(avg_income = mean(income, na.rm = TRUE))
print(grouped_data)
## # A tibble: 2 × 2
##   gender avg_income
##   <chr>       <dbl>
## 1 F           57500
## 2 M           68000

Combining Operations

You can chain multiple operations together using the pipe operator (%>%).

# Filter for age greater than 30, select name and income, and sort by income
combined_data <- data %>%
  filter(age > 30) %>%
  select(name, income) %>%
  arrange(desc(income))
print(combined_data)
##      name income
## 1   Frank  90000
## 2 Charlie  80000
## 3     Ian  75000
## 4  Hannah  70000
## 5     Eva  65000
## 6     Bob  60000

Other Helpful Functions

  • distinct(): Removes duplicate rows.
  • rename(): Renames columns.
  • join(): Combines two data frames by a common column.

Rename a column

renamed_data <- data %>% rename(annual_income = income)
print(renamed_data)
##       name age gender annual_income
## 1    Alice  23      F         50000
## 2      Bob  35      M         60000
## 3  Charlie  45      M         80000
## 4    David  29      M         55000
## 5      Eva  33      F         65000
## 6    Frank  50      M         90000
## 7    Grace  28      F         45000
## 8   Hannah  41      F         70000
## 9      Ian  36      M         75000
## 10    Jack  27      M         48000

Remove duplicates based on specific columns

distinct_data <- data %>% distinct(name, .keep_all = TRUE)
print(distinct_data)
##       name age gender income
## 1    Alice  23      F  50000
## 2      Bob  35      M  60000
## 3  Charlie  45      M  80000
## 4    David  29      M  55000
## 5      Eva  33      F  65000
## 6    Frank  50      M  90000
## 7    Grace  28      F  45000
## 8   Hannah  41      F  70000
## 9      Ian  36      M  75000
## 10    Jack  27      M  48000

Summary

Using tidyverse and dplyr, you can efficiently manipulate data by filtering, transforming, summarizing, and sorting it. These functions are concise, and work seamlessly together, allowing for clean and readable code.