Data Transformation

For the purpose of learning data transformation, lets use a dataset included in the base R. We will begin with a simple example and then dive in into more advanced data manipulation that will be useful in your research.

1 Loading and transforming the base R data

Code

data("mtcars")

1.1 Inspecting the Data

We can view the first few rows and structure of the dataset as follows.

head(mtcars)
glimpse(mtcars)

1.2 Data Manipulation

Until now, we used base R commands to reorder, sort and select data. For advanced data manipulation, tidyverse library opens a whole new avenue for more advanced data transformation. We will work with data frames for this purpose.

The dplyr package in R is a subset of tidyverse package that provides a powerful and elegant grammar for data manipulation. This page introduces the most important dplyr functions using examples.

1.3 Setup

Install and load the tidyverse package (which includes dplyr) and prepare an example dataset.

# Install and load tidyverse (which includes dplyr)
install.packages("tidyverse")
library(tidyverse)

1.4 The pipe operator

The tidyverse library contains a pipe-operator %>%. We can also build our own operator in R as per our need , but we won’t be discussing that right now. Lets understand how the pipe operator works. You can use pipe as if you are connecting a chain of dplyr commands. For example, if you want to select particular columns in a dataset, you could use select(dataset, column1, column2....). Using pipe operator, you can achieve the same result as follows.

1.5 Select Columns

Lets say you want the data about different cars regarding just the miles/gallon, number of cylinders and gross horsepower. You donot want other columns.

Code

mydata1 <- mtcars %>% 
  select(mpg, cyl, hp)

In the above command, we are simply specifying the dataset, and passing it through the pipe operator and selecting our required columns.

1.6 Filter Rows

Now, you are looking for cars with miles per gallon greater than 20 and number of cylinders that equal exacly 4. In this case, we want to filter our dataset based on this information.

Code

mydata1 <- filter(mydata1, mpg > 20, cyl == 4)

Alternatively, we can use a pipe operator mtcars %>% filter(mpg > 20, cyl == 4).

1.7 Using multiple `dplyr` commands using pipe operator

We could just combine both of the above tasks using the pipe.

Code

mydata2 <- mtcars %>% 
  select(mpg, cyl, hp) %>% 
  filter(mpg > 20, cyl == 4)

We should be aware that we can do the same thing in R in various ways.

1.8 Arrange Rows

Now, lets say you want to sort your data according to miles per gallon in a descending order.

Code

mydata3 <- mtcars %>% 
  arrange(desc(mpg))

1.9 Mutate: Add New Variables

If we want to create a new column or variable in our dataset, we can use mutate command. Suppose we want to see what is the power-to-weight ratio like for each car models.

Code

mydata3 <- mydata3 %>% 
  mutate(power_to_weight = hp / wt)

1.10 Summarize Data

To generate summary statistics such as mean, standard deviation, percentage et cetera, we can use summarise command. We want to know what’s the average, minimum , maximum and standard deviation of the power to weight ratio. We can calculate those as follows:

Code

mydata3 %>% 
  summarise(Average = mean(power_to_weight), 
            `Standard Deviation` = sd(power_to_weight),
            Minimum = min(power_to_weight),
            Maximum = max(power_to_weight))

   Average Standard Deviation  Minimum  Maximum
1 45.33466           16.28767 19.43574 93.83754

1.11 Grouped Summary

Now, lets say we want the same statistics grouped by a category, in this case, the engine. We can use group_by command.

Code

mydata3 %>% 
  group_by(vs) %>% 
  summarise(Average = mean(power_to_weight), 
            `Standard Deviation` = sd(power_to_weight),
            Minimum = min(power_to_weight),
            Maximum = max(power_to_weight))

# A tibble: 2 × 5
     vs Average `Standard Deviation` Minimum Maximum
  <dbl>   <dbl>                <dbl>   <dbl>   <dbl>
1     0    52.2                 16.0    38.3    93.8
2     1    36.5                 12.2    19.4    74.7

By now, you must be little bit familiar with how R base commands and dplyr commands work. The tidyverse package has a bundle of other packages within itself that are useful:

ggplot2 – generate nearly any quantitative plot you would find in a scientific journal
dplyr – shortcuts for subsetting, summarizing, rearranging, and joining datasets
tidyr – intuitive functions for changing the layout of your data
stringr – necessary functions for working with character data
lubridate – tools that make working with dates and times way easier

In the next section, we will be working with titanic dataset to demonstrate the use of powerful tidyverse tools.

2 Manipulating the Titanic Data

Lets load the titanic dataset first.

Code

# import using a url
titanic <- read.csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")

2.1 Overview of the data

To start, preview the data using the summary() function.

Code

summary(titanic)

    Survived          Pclass          Name               Sex           
 Min.   :0.0000   Min.   :1.000   Length:887         Length:887        
 1st Qu.:0.0000   1st Qu.:2.000   Class :character   Class :character  
 Median :0.0000   Median :3.000   Mode  :character   Mode  :character  
 Mean   :0.3856   Mean   :2.306                                        
 3rd Qu.:1.0000   3rd Qu.:3.000                                        
 Max.   :1.0000   Max.   :3.000                                        
      Age        Siblings.Spouses.Aboard Parents.Children.Aboard
 Min.   : 0.42   Min.   :0.0000          Min.   :0.0000         
 1st Qu.:20.25   1st Qu.:0.0000          1st Qu.:0.0000         
 Median :28.00   Median :0.0000          Median :0.0000         
 Mean   :29.47   Mean   :0.5254          Mean   :0.3833         
 3rd Qu.:38.00   3rd Qu.:1.0000          3rd Qu.:0.0000         
 Max.   :80.00   Max.   :8.0000          Max.   :6.0000         
      Fare        
 Min.   :  0.000  
 1st Qu.:  7.925  
 Median : 14.454  
 Mean   : 32.305  
 3rd Qu.: 31.137  
 Max.   :512.329

2.2 Filtering Data

The filter() function is used to remove or keep observations from a dataset based on one or multiple logical conditions.

Code

titanic %>% filter(Fare == 0)

   Survived Pclass                           Name  Sex Age
1         0      3             Mr. Lionel Leonard male  36
2         0      1           Mr. William Harrison male  40
3         1      3    Mr. William Henry Tornquist male  25
4         0      2             Mr. Francis Parkes male  21
5         0      3 Mr. William Cahoone Jr Johnson male  19
6         0      2  Mr. Alfred Fleming Cunningham male  22
7         0      2           Mr. William Campbell male  21
8         0      2         Mr. Anthony Wood Frost male  37
9         0      3             Mr. Alfred Johnson male  49
10        0      1   Mr. William Henry Marsh Parr male  30
11        0      2      Mr. Ennis Hastings Watson male  19
12        0      2            Mr. Robert J Knight male  41
13        0      1          Mr. Thomas Jr Andrews male  39
14        0      1                Mr. Richard Fry male  39
15        0      1 Jonkheer. John George Reuchlin male  38
   Siblings.Spouses.Aboard Parents.Children.Aboard Fare
1                        0                       0    0
2                        0                       0    0
3                        0                       0    0
4                        0                       0    0
5                        0                       0    0
6                        0                       0    0
7                        0                       0    0
8                        0                       0    0
9                        0                       0    0
10                       0                       0    0
11                       0                       0    0
12                       0                       0    0
13                       0                       0    0
14                       0                       0    0
15                       0                       0    0

To remove the observations with Fare equal to zero, we will use the != logical operator, which means not equal to. The new dataset will be called titanic_clean.

Code

# take the titanic data
titanic_clean <- titanic %>% # and then...
  # remove "bad" data
  filter(Fare != 0) # filter(Fare > 0) would also work

2.3 Creating New Variables using `mutate()`

mutate() function is used to create new variables. Variables are created using the = sign, and multiple variables can be created by separating statements with a ,.

In the following lines of code, several new variables are created:

age_sq: Age squared
alone: Binary variable that takes a value of 1 if the passenger was travelling alone and 0 otherwise. To do this, the ifelse() statement and | operator are used to output a value of 1 if Siblings.Spouses.Aboard or Parents.Children.Aboard is greater than zero. The OR operator, |, indicates that only one of the logical conditions needs to be true.
alone2: Same variable as alone, but created using the AND, &, operator, which requires both logical conditions to be true. In this case both Siblings.Spouses.Aboard and Parents.Children.Aboard need to be equal to zero.
age_group: Categorical variable indicating whether the passenger was an infant, toddler, child, teen, adult, middle age adult, or senior adult. This variable is created using the case_when() function, which allows you to “vectorize” multiple ifelse() statements. It essentially allows you to assign different outputs based on different logical conditions without having to nest several ifelse() statements. In the case below, different character outputs are assigned to different age ranges. The variable is then modified using factor() to define an ordinal ranking of the variable from youngest to oldest age group.
income_class: Categorical variable indicating the income class of the passenger. For this variable, I assume that passenger class (1st, 2nd, 3rd) is a suitable proxy for upper, middle, and lower income status.
Sex: A simple modification of the existing Sex variable to make the first letter capital.

Code

# take the titanic data
titanic_clean <- titanic %>% # and then...
  # remove "bad" data
  filter(Fare != 0) %>% # and then...
  # create new variables
  mutate( 
    age_sq = Age^2, 
    alone = ifelse(
      Siblings.Spouses.Aboard > 0 | Parents.Children.Aboard > 0, 0, 1
      ),
    alone2 = ifelse(
      Siblings.Spouses.Aboard == 0 & Parents.Children.Aboard == 0, 1, 0
      ),
    age_group = case_when(
      Age <  1             ~ "Infant",
      Age >= 1  & Age < 4  ~ "Toddler",
      Age >= 4  & Age < 13 ~ "Child",
      Age >= 13 & Age < 20 ~ "Teen",
      Age >= 20 & Age < 40 ~ "Adult",
      Age >= 40 & Age < 60 ~ "Middle Age Adult",
      Age >= 60            ~ "Senior Adult"
      ),
    age_group = factor(
      age_group, levels = c("Infant", "Toddler", "Child", "Teen",
                            "Adult", "Middle Age Adult", "Senior Adult")
      ),
    income_class = case_when(
      Pclass == 1 ~ "Upper Class",
      Pclass == 2 ~ "Middle Class",
      Pclass == 3 ~ "Lower Class"
      ),
    Sex = ifelse(
      Sex == "female", "Female", "Male"
      )
    )

2.4 Tidying Data

Now, we use select() to keep a set of variables we are interested in, rename() to rename variables, and arrange() to order the data.

Code

# take the titanic data
titanic_clean <- titanic %>% # and then...
  # remove "bad" data
  filter(Fare != 0) %>% # and then...
  # create new variables
  mutate( 
    age_sq = Age^2, 
    alone = ifelse(
      Siblings.Spouses.Aboard > 0 | Parents.Children.Aboard > 0, 0, 1
      ),
    alone2 = ifelse(
      Siblings.Spouses.Aboard == 0 & Parents.Children.Aboard == 0, 1, 0
      ),
    age_group = case_when(
      Age <  1             ~ "Infant",
      Age >= 1  & Age < 4  ~ "Toddler",
      Age >= 4  & Age < 13 ~ "Child",
      Age >= 13 & Age < 20 ~ "Teen",
      Age >= 20 & Age < 40 ~ "Adult",
      Age >= 40 & Age < 60 ~ "Middle Age Adult",
      Age >= 60            ~ "Senior Adult"
      ),
    age_group = factor(
      age_group, levels = c("Infant", "Toddler", "Child", "Teen",
                            "Adult", "Middle Age Adult", "Senior Adult")
      ),
    income_class = case_when(
      Pclass == 1 ~ "Upper Class",
      Pclass == 2 ~ "Middle Class",
      Pclass == 3 ~ "Lower Class"
      ),
    Sex = ifelse(
      Sex == "female", "Female", "Male"
      )
    ) %>%  # and then...
  # only keep variables of interest
  select(Survived, Sex, Age, alone, age_group, income_class) %>% # and then...
  # rename variables in a consistent manner (snake_case)
  rename(survived = Survived,
         sex = Sex,
         age = Age) %>% # and then...
  # order the data by age
  arrange(age)

head(titanic_clean)

  survived    sex  age alone age_group income_class
1        1   Male 0.42     0    Infant  Lower Class
2        1   Male 0.67     0    Infant Middle Class
3        1 Female 0.75     0    Infant  Lower Class
4        1 Female 0.75     0    Infant  Lower Class
5        1   Male 0.83     0    Infant Middle Class
6        1   Male 0.83     0    Infant Middle Class

2.5 Grouping and Summarizing Data

From the tidyr vignette:

“It is often said that 80% of data analysis is spent on the cleaning and preparing data. And it’s not just a first step, but it must be repeated many times over the course of analysis as new problems come to light or new data is collected.”

Now that 80% of the work is done, we can do some analysis. One approach is to look at survival rates by income class, age category, or other relevant groups. To do this, we will use the following two functions:

group_by() – define groups within the data such that all following operations are performed by group rather than by observation. (ungroup() removes grouping)
summarize() – performs operations by groups and creates a new data frame with one row per group and one column per summary statistic.

Survival rates by income class:

Code

titanic_clean %>% 
  # group by income class
  group_by(income_class) %>%
  # calculate survival rate by group
  summarize(survival_rate = mean(survived))

# A tibble: 3 × 2
  income_class survival_rate
  <chr>                <dbl>
1 Lower Class          0.244
2 Middle Class         0.489
3 Upper Class          0.645

There appears to be a consistent increase in survival chances as passengers moved from lower to middle to upper class. What might be the reason?

Survival rates by age group:

Code

titanic_clean %>% 
  group_by(age_group) %>% 
  summarize(survival_rate = mean(survived))

# A tibble: 7 × 2
  age_group        survival_rate
  <fct>                    <dbl>
1 Infant                   1    
2 Toddler                  0.56 
3 Child                    0.468
4 Teen                     0.390
5 Adult                    0.373
6 Middle Age Adult         0.401
7 Senior Adult             0.226

What might cause different age groups to have different survival rates?

Survival rates by solo travelers:

Code

titanic_clean %>% 
  group_by(alone) %>% 
  summarize(survival_rate = mean(survived))

# A tibble: 2 × 2
  alone survival_rate
  <dbl>         <dbl>
1     0         0.506
2     1         0.313

Given the variability of survival rates across different variables, it may be interesting to look at survival rates conditional on multiple variables. These next lines of code create a new data frame, titanic_grouped, with survival rates by age group, sex, and income class. It may also be the case that some groups have very few passengers that fit the description (e.g., upper class female toddlers). Since survival rates can be deceptive depending on how many passengers fall into a certain group, we will use the n() function create a count.

Code

titanic_grouped <- titanic_clean %>% 
  group_by(age_group, sex, income_class) %>% 
  summarize(count = n(),
            survival_rate = mean(survived)) %>% 
  ungroup() %>% 
  arrange(-survival_rate, age_group)

`summarise()` has grouped output by 'age_group', 'sex'. You can override using
the `.groups` argument.

Code

head(titanic_grouped, 20)

# A tibble: 20 × 5
   age_group        sex    income_class count survival_rate
   <fct>            <chr>  <chr>        <int>         <dbl>
 1 Infant           Female Lower Class      2         1    
 2 Infant           Male   Lower Class      1         1    
 3 Infant           Male   Middle Class     3         1    
 4 Infant           Male   Upper Class      1         1    
 5 Toddler          Female Middle Class     2         1    
 6 Toddler          Male   Middle Class     5         1    
 7 Child            Female Middle Class     6         1    
 8 Child            Male   Middle Class     1         1    
 9 Child            Male   Upper Class      2         1    
10 Teen             Female Middle Class     8         1    
11 Teen             Female Upper Class     13         1    
12 Senior Adult     Female Upper Class      3         1    
13 Adult            Female Upper Class     45         0.978
14 Middle Age Adult Female Upper Class     32         0.969
15 Adult            Female Middle Class    43         0.907
16 Middle Age Adult Female Middle Class    17         0.882
17 Teen             Female Lower Class     30         0.6  
18 Adult            Female Lower Class     72         0.569
19 Adult            Male   Upper Class     45         0.511
20 Senior Adult     Female Lower Class      2         0.5

Notice that a minus sign, -, can be used to change the ordering with arrange() from ascending (low to high) to descending (high to low).

1 Loading and transforming the base R data

1.1 Inspecting the Data

1.2 Data Manipulation

1.3 Setup

1.4 The pipe operator

1.5 Select Columns

1.6 Filter Rows

1.7 Using multiple dplyr commands using pipe operator

1.8 Arrange Rows

1.9 Mutate: Add New Variables

1.10 Summarize Data

1.11 Grouped Summary

2 Manipulating the Titanic Data

2.1 Overview of the data

2.2 Filtering Data

2.3 Creating New Variables using mutate()

2.4 Tidying Data

2.5 Grouping and Summarizing Data

1.7 Using multiple `dplyr` commands using pipe operator

2.3 Creating New Variables using `mutate()`