Code
data("mtcars")For the purpose of learning data transformation, lets use a dataset included in the base R. We will begin with a simple example and then dive in into more advanced data manipulation that will be useful in your research.
We can view the first few rows and structure of the dataset as follows.
Until now, we used base R commands to reorder, sort and select data. For advanced data manipulation, tidyverse library opens a whole new avenue for more advanced data transformation. We will work with data frames for this purpose.
The dplyr package in R is a subset of tidyverse package that provides a powerful and elegant grammar for data manipulation. This page introduces the most important dplyr functions using examples.
Install and load the tidyverse package (which includes dplyr) and prepare an example dataset.
The tidyverse library contains a pipe-operator %>%. We can also build our own operator in R as per our need , but we won’t be discussing that right now. Lets understand how the pipe operator works. You can use pipe as if you are connecting a chain of dplyr commands. For example, if you want to select particular columns in a dataset, you could use select(dataset, column1, column2....). Using pipe operator, you can achieve the same result as follows.
Lets say you want the data about different cars regarding just the miles/gallon, number of cylinders and gross horsepower. You donot want other columns.
In the above command, we are simply specifying the dataset, and passing it through the pipe operator and selecting our required columns.
Now, you are looking for cars with miles per gallon greater than 20 and number of cylinders that equal exacly 4. In this case, we want to filter our dataset based on this information.
Alternatively, we can use a pipe operator mtcars %>% filter(mpg > 20, cyl == 4).
dplyr commands using pipe operatorWe could just combine both of the above tasks using the pipe.
We should be aware that we can do the same thing in R in various ways.
Now, lets say you want to sort your data according to miles per gallon in a descending order.
If we want to create a new column or variable in our dataset, we can use mutate command. Suppose we want to see what is the power-to-weight ratio like for each car models.
To generate summary statistics such as mean, standard deviation, percentage et cetera, we can use summarise command. We want to know what’s the average, minimum , maximum and standard deviation of the power to weight ratio. We can calculate those as follows:
Now, lets say we want the same statistics grouped by a category, in this case, the engine. We can use group_by command.
# A tibble: 2 × 5
vs Average `Standard Deviation` Minimum Maximum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 52.2 16.0 38.3 93.8
2 1 36.5 12.2 19.4 74.7
By now, you must be little bit familiar with how R base commands and dplyr commands work. The tidyverse package has a bundle of other packages within itself that are useful:
ggplot2 – generate nearly any quantitative plot you would find in a scientific journaldplyr – shortcuts for subsetting, summarizing, rearranging, and joining datasetstidyr – intuitive functions for changing the layout of your datastringr – necessary functions for working with character datalubridate – tools that make working with dates and times way easierIn the next section, we will be working with titanic dataset to demonstrate the use of powerful tidyverse tools.
Lets load the titanic dataset first.
To start, preview the data using the summary() function.
Survived Pclass Name Sex
Min. :0.0000 Min. :1.000 Length:887 Length:887
1st Qu.:0.0000 1st Qu.:2.000 Class :character Class :character
Median :0.0000 Median :3.000 Mode :character Mode :character
Mean :0.3856 Mean :2.306
3rd Qu.:1.0000 3rd Qu.:3.000
Max. :1.0000 Max. :3.000
Age Siblings.Spouses.Aboard Parents.Children.Aboard
Min. : 0.42 Min. :0.0000 Min. :0.0000
1st Qu.:20.25 1st Qu.:0.0000 1st Qu.:0.0000
Median :28.00 Median :0.0000 Median :0.0000
Mean :29.47 Mean :0.5254 Mean :0.3833
3rd Qu.:38.00 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :80.00 Max. :8.0000 Max. :6.0000
Fare
Min. : 0.000
1st Qu.: 7.925
Median : 14.454
Mean : 32.305
3rd Qu.: 31.137
Max. :512.329
The filter() function is used to remove or keep observations from a dataset based on one or multiple logical conditions.
Survived Pclass Name Sex Age
1 0 3 Mr. Lionel Leonard male 36
2 0 1 Mr. William Harrison male 40
3 1 3 Mr. William Henry Tornquist male 25
4 0 2 Mr. Francis Parkes male 21
5 0 3 Mr. William Cahoone Jr Johnson male 19
6 0 2 Mr. Alfred Fleming Cunningham male 22
7 0 2 Mr. William Campbell male 21
8 0 2 Mr. Anthony Wood Frost male 37
9 0 3 Mr. Alfred Johnson male 49
10 0 1 Mr. William Henry Marsh Parr male 30
11 0 2 Mr. Ennis Hastings Watson male 19
12 0 2 Mr. Robert J Knight male 41
13 0 1 Mr. Thomas Jr Andrews male 39
14 0 1 Mr. Richard Fry male 39
15 0 1 Jonkheer. John George Reuchlin male 38
Siblings.Spouses.Aboard Parents.Children.Aboard Fare
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 0 0
To remove the observations with Fare equal to zero, we will use the != logical operator, which means not equal to. The new dataset will be called titanic_clean.
mutate()mutate() function is used to create new variables. Variables are created using the = sign, and multiple variables can be created by separating statements with a ,.
In the following lines of code, several new variables are created:
age_sq: Age squaredalone: Binary variable that takes a value of 1 if the passenger was travelling alone and 0 otherwise. To do this, the ifelse() statement and | operator are used to output a value of 1 if Siblings.Spouses.Aboard or Parents.Children.Aboard is greater than zero. The OR operator, |, indicates that only one of the logical conditions needs to be true.alone2: Same variable as alone, but created using the AND, &, operator, which requires both logical conditions to be true. In this case both Siblings.Spouses.Aboard and Parents.Children.Aboard need to be equal to zero.age_group: Categorical variable indicating whether the passenger was an infant, toddler, child, teen, adult, middle age adult, or senior adult. This variable is created using the case_when() function, which allows you to “vectorize” multiple ifelse() statements. It essentially allows you to assign different outputs based on different logical conditions without having to nest several ifelse() statements. In the case below, different character outputs are assigned to different age ranges. The variable is then modified using factor() to define an ordinal ranking of the variable from youngest to oldest age group.income_class: Categorical variable indicating the income class of the passenger. For this variable, I assume that passenger class (1st, 2nd, 3rd) is a suitable proxy for upper, middle, and lower income status.Sex: A simple modification of the existing Sex variable to make the first letter capital.# take the titanic data
titanic_clean <- titanic %>% # and then...
# remove "bad" data
filter(Fare != 0) %>% # and then...
# create new variables
mutate(
age_sq = Age^2,
alone = ifelse(
Siblings.Spouses.Aboard > 0 | Parents.Children.Aboard > 0, 0, 1
),
alone2 = ifelse(
Siblings.Spouses.Aboard == 0 & Parents.Children.Aboard == 0, 1, 0
),
age_group = case_when(
Age < 1 ~ "Infant",
Age >= 1 & Age < 4 ~ "Toddler",
Age >= 4 & Age < 13 ~ "Child",
Age >= 13 & Age < 20 ~ "Teen",
Age >= 20 & Age < 40 ~ "Adult",
Age >= 40 & Age < 60 ~ "Middle Age Adult",
Age >= 60 ~ "Senior Adult"
),
age_group = factor(
age_group, levels = c("Infant", "Toddler", "Child", "Teen",
"Adult", "Middle Age Adult", "Senior Adult")
),
income_class = case_when(
Pclass == 1 ~ "Upper Class",
Pclass == 2 ~ "Middle Class",
Pclass == 3 ~ "Lower Class"
),
Sex = ifelse(
Sex == "female", "Female", "Male"
)
)Now, we use select() to keep a set of variables we are interested in, rename() to rename variables, and arrange() to order the data.
# take the titanic data
titanic_clean <- titanic %>% # and then...
# remove "bad" data
filter(Fare != 0) %>% # and then...
# create new variables
mutate(
age_sq = Age^2,
alone = ifelse(
Siblings.Spouses.Aboard > 0 | Parents.Children.Aboard > 0, 0, 1
),
alone2 = ifelse(
Siblings.Spouses.Aboard == 0 & Parents.Children.Aboard == 0, 1, 0
),
age_group = case_when(
Age < 1 ~ "Infant",
Age >= 1 & Age < 4 ~ "Toddler",
Age >= 4 & Age < 13 ~ "Child",
Age >= 13 & Age < 20 ~ "Teen",
Age >= 20 & Age < 40 ~ "Adult",
Age >= 40 & Age < 60 ~ "Middle Age Adult",
Age >= 60 ~ "Senior Adult"
),
age_group = factor(
age_group, levels = c("Infant", "Toddler", "Child", "Teen",
"Adult", "Middle Age Adult", "Senior Adult")
),
income_class = case_when(
Pclass == 1 ~ "Upper Class",
Pclass == 2 ~ "Middle Class",
Pclass == 3 ~ "Lower Class"
),
Sex = ifelse(
Sex == "female", "Female", "Male"
)
) %>% # and then...
# only keep variables of interest
select(Survived, Sex, Age, alone, age_group, income_class) %>% # and then...
# rename variables in a consistent manner (snake_case)
rename(survived = Survived,
sex = Sex,
age = Age) %>% # and then...
# order the data by age
arrange(age)
head(titanic_clean) survived sex age alone age_group income_class
1 1 Male 0.42 0 Infant Lower Class
2 1 Male 0.67 0 Infant Middle Class
3 1 Female 0.75 0 Infant Lower Class
4 1 Female 0.75 0 Infant Lower Class
5 1 Male 0.83 0 Infant Middle Class
6 1 Male 0.83 0 Infant Middle Class
From the tidyr vignette:
“It is often said that 80% of data analysis is spent on the cleaning and preparing data. And it’s not just a first step, but it must be repeated many times over the course of analysis as new problems come to light or new data is collected.”
Now that 80% of the work is done, we can do some analysis. One approach is to look at survival rates by income class, age category, or other relevant groups. To do this, we will use the following two functions:
group_by() – define groups within the data such that all following operations are performed by group rather than by observation. (ungroup() removes grouping)summarize() – performs operations by groups and creates a new data frame with one row per group and one column per summary statistic.Survival rates by income class:
# A tibble: 3 × 2
income_class survival_rate
<chr> <dbl>
1 Lower Class 0.244
2 Middle Class 0.489
3 Upper Class 0.645
There appears to be a consistent increase in survival chances as passengers moved from lower to middle to upper class. What might be the reason?
Survival rates by age group:
# A tibble: 7 × 2
age_group survival_rate
<fct> <dbl>
1 Infant 1
2 Toddler 0.56
3 Child 0.468
4 Teen 0.390
5 Adult 0.373
6 Middle Age Adult 0.401
7 Senior Adult 0.226
What might cause different age groups to have different survival rates?
Survival rates by solo travelers:
# A tibble: 2 × 2
alone survival_rate
<dbl> <dbl>
1 0 0.506
2 1 0.313
Given the variability of survival rates across different variables, it may be interesting to look at survival rates conditional on multiple variables. These next lines of code create a new data frame, titanic_grouped, with survival rates by age group, sex, and income class. It may also be the case that some groups have very few passengers that fit the description (e.g., upper class female toddlers). Since survival rates can be deceptive depending on how many passengers fall into a certain group, we will use the n() function create a count.
`summarise()` has grouped output by 'age_group', 'sex'. You can override using
the `.groups` argument.
# A tibble: 20 × 5
age_group sex income_class count survival_rate
<fct> <chr> <chr> <int> <dbl>
1 Infant Female Lower Class 2 1
2 Infant Male Lower Class 1 1
3 Infant Male Middle Class 3 1
4 Infant Male Upper Class 1 1
5 Toddler Female Middle Class 2 1
6 Toddler Male Middle Class 5 1
7 Child Female Middle Class 6 1
8 Child Male Middle Class 1 1
9 Child Male Upper Class 2 1
10 Teen Female Middle Class 8 1
11 Teen Female Upper Class 13 1
12 Senior Adult Female Upper Class 3 1
13 Adult Female Upper Class 45 0.978
14 Middle Age Adult Female Upper Class 32 0.969
15 Adult Female Middle Class 43 0.907
16 Middle Age Adult Female Middle Class 17 0.882
17 Teen Female Lower Class 30 0.6
18 Adult Female Lower Class 72 0.569
19 Adult Male Upper Class 45 0.511
20 Senior Adult Female Lower Class 2 0.5
Notice that a minus sign, -, can be used to change the ordering with arrange() from ascending (low to high) to descending (high to low).
---
title: "Data Transformation"
format:
html:
toc: true
toc-depth: 2
number-sections: true
code-fold: show
code-tools: true
smooth-scroll: true
---
For the purpose of learning data transformation, lets use a dataset included in the base R. We will begin with a simple example and then dive in into more advanced data manipulation that will be useful in your research.
# Loading and transforming the base R data
```{r}
data("mtcars")
```
## Inspecting the Data
We can view the first few rows and structure of the dataset as follows.
```r
head(mtcars)
glimpse(mtcars)
```
## Data Manipulation
Until now, we used base R commands to reorder, sort and select data. For advanced data manipulation, `tidyverse` library opens a whole new avenue for more advanced data transformation. We will work with data frames for this purpose.
The `dplyr` package in R is a subset of `tidyverse` package that provides a powerful and elegant grammar for data manipulation. This page introduces the most important `dplyr` functions using examples.
## Setup
Install and load the `tidyverse` package (which includes `dplyr`) and prepare an example dataset.
```r
# Install and load tidyverse (which includes dplyr)
install.packages("tidyverse")
library(tidyverse)
```
```{r}
#| echo: false
#| warning: false
library(tidyverse)
```
## The pipe operator
The `tidyverse` library contains a pipe-operator ` %>% `. We can also build our own operator in R as per our need , but we won't be discussing that right now. Lets understand how the pipe operator works. You can use pipe as if you are connecting a chain of `dplyr` commands. For example, if you want to select particular columns in a dataset, you could use `select(dataset, column1, column2....)`. Using pipe operator, you can achieve the same result as follows.
## Select Columns
Lets say you want the data about different cars regarding just the miles/gallon, number of cylinders and gross horsepower. You donot want other columns.
```{r}
#| echo: true
#| warning: false
#| results: false
mydata1 <- mtcars %>%
select(mpg, cyl, hp)
```
In the above command, we are simply specifying the dataset, and passing it through the pipe operator and selecting our required columns.
## Filter Rows
Now, you are looking for cars with miles per gallon greater than 20 and number of cylinders that equal exacly 4. In this case, we want to filter our dataset based on this information.
```{r}
#| echo: true
#| warning: false
#| results: false
mydata1 <- filter(mydata1, mpg > 20, cyl == 4)
```
Alternatively, we can use a pipe operator `mtcars %>% filter(mpg > 20, cyl == 4) `.
## Using multiple `dplyr` commands using pipe operator
We could just combine both of the above tasks using the pipe.
```{r}
#| echo: true
#| warning: false
#| results: false
mydata2 <- mtcars %>%
select(mpg, cyl, hp) %>%
filter(mpg > 20, cyl == 4)
```
We should be aware that we can do the same thing in R in various ways.
## Arrange Rows
Now, lets say you want to sort your data according to miles per gallon in a descending order.
```{r}
#| echo: true
#| warning: false
#| results: false
mydata3 <- mtcars %>%
arrange(desc(mpg))
```
## Mutate: Add New Variables
If we want to create a new column or variable in our dataset, we can use `mutate` command. Suppose we want to see what is the power-to-weight ratio like for each car models.
```{r}
#| echo: true
#| warning: false
#| results: false
mydata3 <- mydata3 %>%
mutate(power_to_weight = hp / wt)
```
## Summarize Data
To generate summary statistics such as mean, standard deviation, percentage et cetera, we can use `summarise` command. We want to know what's the average, minimum , maximum and standard deviation of the power to weight ratio. We can calculate those as follows:
```{r}
#| echo: true
#| warning: false
mydata3 %>%
summarise(Average = mean(power_to_weight),
`Standard Deviation` = sd(power_to_weight),
Minimum = min(power_to_weight),
Maximum = max(power_to_weight))
```
## Grouped Summary
Now, lets say we want the same statistics grouped by a category, in this case, the engine. We can use `group_by` command.
```{r}
#| echo: true
#| warning: false
mydata3 %>%
group_by(vs) %>%
summarise(Average = mean(power_to_weight),
`Standard Deviation` = sd(power_to_weight),
Minimum = min(power_to_weight),
Maximum = max(power_to_weight))
```
By now, you must be little bit familiar with how R base commands and dplyr commands work. The `tidyverse` package has a bundle of other packages within itself that are useful:
* `ggplot2` -- generate nearly any quantitative plot you would find in a scientific journal
* `dplyr` -- shortcuts for subsetting, summarizing, rearranging, and joining datasets
* `tidyr` -- intuitive functions for changing the layout of your data
* `stringr` -- necessary functions for working with character data
* `lubridate` -- tools that make working with dates and times *way* easier
In the next section, we will be working with titanic dataset to demonstrate the use of powerful `tidyverse` tools.
# Manipulating the Titanic Data
Lets load the titanic dataset first.
```{r, echo = TRUE}
# import using a url
titanic <- read.csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")
```
## Overview of the data
To start, preview the data using the `summary()` function.
```{r}
summary(titanic)
```
## Filtering Data
The `filter()` function is used to remove or keep observations from a dataset based on one or multiple logical conditions.
```{r}
titanic %>% filter(Fare == 0)
```
To remove the observations with `Fare` equal to zero, we will use the `!=` logical operator, which means *not* equal to. The new dataset will be called `titanic_clean`.
```{r}
# take the titanic data
titanic_clean <- titanic %>% # and then...
# remove "bad" data
filter(Fare != 0) # filter(Fare > 0) would also work
```
## Creating New Variables using `mutate()`
`mutate()` function is used to create new variables. Variables are created using the `=` sign, and multiple variables can be created by separating statements with a `,`.
In the following lines of code, several new variables are created:
* `age_sq`: Age squared
* `alone`: Binary variable that takes a value of 1 if the passenger was travelling alone and 0 otherwise. To do this, the `ifelse()` statement and `|` operator are used to output a value of 1 if `Siblings.Spouses.Aboard` *or* `Parents.Children.Aboard` is greater than zero. The OR operator, `|`, indicates that only one of the logical conditions needs to be true.
* `alone2`: Same variable as `alone`, but created using the AND, `&`, operator, which requires *both* logical conditions to be true. In this case *both* `Siblings.Spouses.Aboard` *and* `Parents.Children.Aboard` need to be equal to zero.
* `age_group`: Categorical variable indicating whether the passenger was an infant, toddler, child, teen, adult, middle age adult, or senior adult. This variable is created using the `case_when()` function, which allows you to "vectorize" multiple `ifelse()` statements. It essentially allows you to assign different outputs based on different logical conditions without having to nest several `ifelse()` statements. In the case below, different character outputs are assigned to different age ranges. The variable is then modified using `factor()` to define an ordinal ranking of the variable from youngest to oldest age group.
* `income_class`: Categorical variable indicating the income class of the passenger. For this variable, I assume that passenger class (1st, 2nd, 3rd) is a suitable proxy for upper, middle, and lower income status.
* `Sex`: A simple modification of the existing `Sex` variable to make the first letter capital.
```{r}
# take the titanic data
titanic_clean <- titanic %>% # and then...
# remove "bad" data
filter(Fare != 0) %>% # and then...
# create new variables
mutate(
age_sq = Age^2,
alone = ifelse(
Siblings.Spouses.Aboard > 0 | Parents.Children.Aboard > 0, 0, 1
),
alone2 = ifelse(
Siblings.Spouses.Aboard == 0 & Parents.Children.Aboard == 0, 1, 0
),
age_group = case_when(
Age < 1 ~ "Infant",
Age >= 1 & Age < 4 ~ "Toddler",
Age >= 4 & Age < 13 ~ "Child",
Age >= 13 & Age < 20 ~ "Teen",
Age >= 20 & Age < 40 ~ "Adult",
Age >= 40 & Age < 60 ~ "Middle Age Adult",
Age >= 60 ~ "Senior Adult"
),
age_group = factor(
age_group, levels = c("Infant", "Toddler", "Child", "Teen",
"Adult", "Middle Age Adult", "Senior Adult")
),
income_class = case_when(
Pclass == 1 ~ "Upper Class",
Pclass == 2 ~ "Middle Class",
Pclass == 3 ~ "Lower Class"
),
Sex = ifelse(
Sex == "female", "Female", "Male"
)
)
```
## Tidying Data
Now, we use `select()` to keep a set of variables we are interested in, `rename()` to rename variables, and `arrange()` to order the data.
```{r}
# take the titanic data
titanic_clean <- titanic %>% # and then...
# remove "bad" data
filter(Fare != 0) %>% # and then...
# create new variables
mutate(
age_sq = Age^2,
alone = ifelse(
Siblings.Spouses.Aboard > 0 | Parents.Children.Aboard > 0, 0, 1
),
alone2 = ifelse(
Siblings.Spouses.Aboard == 0 & Parents.Children.Aboard == 0, 1, 0
),
age_group = case_when(
Age < 1 ~ "Infant",
Age >= 1 & Age < 4 ~ "Toddler",
Age >= 4 & Age < 13 ~ "Child",
Age >= 13 & Age < 20 ~ "Teen",
Age >= 20 & Age < 40 ~ "Adult",
Age >= 40 & Age < 60 ~ "Middle Age Adult",
Age >= 60 ~ "Senior Adult"
),
age_group = factor(
age_group, levels = c("Infant", "Toddler", "Child", "Teen",
"Adult", "Middle Age Adult", "Senior Adult")
),
income_class = case_when(
Pclass == 1 ~ "Upper Class",
Pclass == 2 ~ "Middle Class",
Pclass == 3 ~ "Lower Class"
),
Sex = ifelse(
Sex == "female", "Female", "Male"
)
) %>% # and then...
# only keep variables of interest
select(Survived, Sex, Age, alone, age_group, income_class) %>% # and then...
# rename variables in a consistent manner (snake_case)
rename(survived = Survived,
sex = Sex,
age = Age) %>% # and then...
# order the data by age
arrange(age)
head(titanic_clean)
```
## Grouping and Summarizing Data
From the `tidyr` [vignette](https://tidyr.tidyverse.org/articles/tidy-data.html):
"It is often said that 80% of data analysis is spent on the cleaning and preparing data. And it’s not just a first step, but it must be repeated many times over the course of analysis as new problems come to light or new data is collected."
Now that 80% of the work is done, we can do some analysis. One approach is to look at survival rates by income class, age category, or other relevant groups. To do this, we will use the following two functions:
* `group_by()` -- define groups within the data such that all following operations are performed *by group* rather than *by observation*. (`ungroup()` removes grouping)
* `summarize()` -- performs operations by groups and creates a new data frame with one row per group and one column per summary statistic.
Survival rates by income class:
```{r}
titanic_clean %>%
# group by income class
group_by(income_class) %>%
# calculate survival rate by group
summarize(survival_rate = mean(survived))
```
There appears to be a consistent increase in survival chances as passengers moved from lower to middle to upper class. What might be the reason?
Survival rates by age group:
```{r}
titanic_clean %>%
group_by(age_group) %>%
summarize(survival_rate = mean(survived))
```
What might cause different age groups to have different survival rates?
Survival rates by solo travelers:
```{r}
titanic_clean %>%
group_by(alone) %>%
summarize(survival_rate = mean(survived))
```
Given the variability of survival rates across different variables, it may be interesting to look at survival rates conditional on multiple variables. These next lines of code create a new data frame, `titanic_grouped`, with survival rates by age group, sex, and income class. It may also be the case that some groups have very few passengers that fit the description (e.g., upper class female toddlers). Since survival *rates* can be deceptive depending on how many passengers fall into a certain group, we will use the `n()` function create a count.
```{r}
titanic_grouped <- titanic_clean %>%
group_by(age_group, sex, income_class) %>%
summarize(count = n(),
survival_rate = mean(survived)) %>%
ungroup() %>%
arrange(-survival_rate, age_group)
head(titanic_grouped, 20)
```
Notice that a minus sign, `-`, can be used to change the ordering with `arrange()` from ascending (low to high) to descending (high to low).