Data Visualization in R

This page introduces common plots using Base R and ggplot2. Data visualization is a great way to understand trends and structure. It also allows us to see outliers and do a data-quality check. Lets start with how plots work using base R commands.

1 Plots in R

Lets see how plots work using base R function. We load mtcars data again for this one.

Code

library(tidyverse)
data <- mtcars

1.1 Scatter Plot: MPG vs HP

Lets see how mileage of a car changes as the horsepower increases.

Code

plot(data$hp, data$mpg)

We see that there is negative relationship between mileage and horsepower. A car with more horses gives less mileage and vice-versa.

Can we create same plot using one of the packages inside the tidyverse library? We can.

Code

ggplot(data, aes(x = hp, y = mpg)) + 
  geom_point(color = "steelblue") +
  labs(title = "MPG vs Horsepower", x = "Horsepower", y = "Miles per Gallon") +
  theme_classic()

2 Visualization using `ggplot()`

You can create almost any kind of plot using ggplot() function. You can create a simple plot using ggplot(aes(x = ..., y = ...)) and then add elements like text, fill, facets et. cetera to your plot using the + sign.

2.1 Basic Plotting

First, create a basic bar chart of which passengers survived and which did not. For the purpose of these plots, a numerical/binary variable for survival is less understandable to a layman than a character variable that spells out whether passengers lived or died. A general rule of thumb for graphs and figures is that they should be interpretable by a viewer who has little to no understanding of the data. In other words, figures should be entirely self-sufficient.

Code

titanic_clean <- titanic_clean %>%
                             mutate(survived_char = ifelse(survived == 1, "Survived", "Perished"),
                             survived_char = factor(survived_char,
                                                                    levels = c("Survived", "Perished")

                               )

                             )

 ggplot(data = titanic_clean,
            mapping = aes(x = survived_char,
                                      fill = sex)) +
           geom_bar(width = 0.5) +
            labs(x = NULL,
                    y = "Count",
                  title = "Survival Counts of Titanic Passengers by Sex",
                  fill = NULL)

2.2 Intermediate Plotting

These next several plots are merely meant to highlight a few capabilities of ggplot(). In the following example, geom_text() is used to impose the passenger counts associated with each colored bar, scale_y_continuous() is used to specify where the ticks occur along the y-axis, and theme_classic() is a built-in theme that removes some of the plot borders and grid lines.

Code

ggplot(titanic_clean,
       aes(x = survived_char, fill = sex)) +
  geom_bar(width = 0.5) +
  # add text to the plot that shows the size of each bar
  geom_text(aes(label = after_stat(count)), 
            stat = "count", 
            color = "white",
            position = position_stack(vjust = 0.5)) +
  labs(x = NULL,
       y = "Count",
       title = "Survival Counts of Titanic Passengers by Sex",
       fill = NULL) +
  # specify axis breaks
  scale_y_continuous(breaks = seq(0,500,100)) +
  # use classic theme
  theme_classic()

In this next graph, facet_wrap() is used to split the graph into three panels based on the income_class variable, theme_bw() is another built-in theme, and theme() further modifies aspects of the graph. Within theme(), element_blank() removes things, element_text() modifies text, element_rect() modifies borders and backgrounds, and element_line() modifies lines.

Code

ggplot(titanic_clean,
       aes(x = survived_char, fill = sex)) +
  geom_bar(width = 0.5) +
  labs(x = NULL,
       y = "Count",
       title = "Survival Counts of Titanic Passengers by Sex and Socioeconomic Status",
       fill = NULL) +
  scale_y_continuous(breaks = seq(0,400,50)) +
  # split graph by socioeconomic status
  facet_wrap(~income_class) +
  # use the black and white theme
  theme_bw() +
  # eliminate most grid lines
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank())

You can manually specify colors within scale_fill_manual() or scale_color_manual() either by name (e.g., “red”) or by hexadecimal code (e.g., “#FF1234”). Check out https://coolors.co/ for some great user-generated palettes (with hex codes!). The viridis package also provides great pre-made color palettes.

Code

ggplot(titanic_clean,
       aes(x = age, fill = survived_char)) +
  geom_histogram() +
  # specify colors with hex codes
  scale_fill_manual(values = c("#E69F00", "#999999")) + 
  scale_x_continuous(breaks = seq(0,90,10)) +
  labs(y = "Count",
       x = "Age",
       title = "Survival Counts of Titanic Passengers by Age", 
       fill = NULL) +
  theme_bw() +
  theme(panel.border = element_blank(),
        panel.grid = element_blank(),
        # adjust the legend position
        legend.position = c(0.8,0.8))

Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A density plot visualizes the distribution of a variable (basically a smoothed histogram). Be wary, the area under any density plot equals 1, so these types of plots communicate nothing about sample sizes.

Code

ggplot(titanic_clean,
       aes(x = age, fill = survived_char, linetype = survived_char)) +
  # alpha changes how transparent a color is
  geom_density(alpha = 0.5, linewidth = 0.6) +
  scale_fill_manual(values = c("#E69F00", "#999999")) + 
  scale_x_continuous(breaks = seq(0,90,10)) +
  labs(y = "Density",
       x = "Age",
       title = "Age Distributions of Titanic Passengers", 
       fill = NULL,
       linetype = NULL) +
  theme_bw() +
  theme(panel.border = element_blank(),
        panel.grid = element_blank(),
        legend.position = c(0.8,0.8))

For actual research, all or none of these examples may be appropriate or suitable. The “best” way to visualize data is highly dependent on the type of data and the audience you are communicating with. These two sources can be very helpful:

Data to Viz – Helps you find appropriate graphs for your data
The R Graph Gallery – Provides R code for every graph imaginable

2.3 Saving Plots

Plots in R can be saved as objects, which can then be exported as images. With the ggsave() function, you can specify the image’s height, width, and resolution (dpi). You can save your plots to different to different file formats. For now, lets save it in the .png format.

Code

p <- ggplot(titanic_clean,
       aes(x = age, fill = survived_char, linetype = survived_char)) +
  # alpha changes how transparent a color is
  geom_density(alpha = 0.5, linewidth = 0.6) +
  scale_fill_manual(values = c("#E69F00", "#999999")) + 
  scale_x_continuous(breaks = seq(0,90,10)) +
  labs(y = "Density",
       x = "Age",
       title = "Age Distributions of Titanic Passengers", 
       fill = NULL,
       linetype = NULL) +
  theme_bw() +
  theme(panel.border = element_blank(),
        panel.grid = element_blank(),
        legend.position = c(0.8,0.8))

ggsave(plot = p,
       filename = "D:/R_projects/darecodecamp/plots/titanic_density.png",
       height = 6,
       width = 9,
       dpi = 300)

--- title: "Data Visualization in R" format: html: toc: true toc-depth: 2 number-sections: true code-fold: show code-tools: true smooth-scroll: true --- This page introduces common plots using Base R and `ggplot2`. Data visualization is a great way to understand trends and structure. It also allows us to see outliers and do a data-quality check. Lets start with how plots work using base R commands. # Plots in R Lets see how plots work using base R function. We load `mtcars` data again for this one. ```{r} #| echo: true #| warning: false library(tidyverse) data <- mtcars ``` ## Scatter Plot: MPG vs HP Lets see how mileage of a car changes as the horsepower increases. ```{r} #| echo: true #| warning: false #| fig-cap: "MPG vs Horsepower" plot(data$hp, data$mpg) ``` We see that there is negative relationship between mileage and horsepower. A car with more horses gives less mileage and vice-versa. Can we create same plot using one of the packages inside the `tidyverse` library? We can. ```{r} ggplot(data, aes(x = hp, y = mpg)) + geom_point(color = "steelblue") + labs(title = "MPG vs Horsepower", x = "Horsepower", y = "Miles per Gallon") + theme_classic() ``` # Visualization using `ggplot()` You can create almost any kind of plot using `ggplot()` function. You can create a simple plot using `ggplot(aes(x = ..., y = ...))` and then add elements like text, fill, facets et. cetera to your plot using the `+` sign. ```{r} #| echo: false titanic <- read.csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv") # take the titanic data titanic_clean <- titanic %>% # and then... # remove "bad" data filter(Fare != 0) %>% # and then... # create new variables mutate( age_sq = Age^2, alone = ifelse( Siblings.Spouses.Aboard > 0 | Parents.Children.Aboard > 0, 0, 1 ), alone2 = ifelse( Siblings.Spouses.Aboard == 0 & Parents.Children.Aboard == 0, 1, 0 ), age_group = case_when( Age < 1 ~ "Infant", Age >= 1 & Age < 4 ~ "Toddler", Age >= 4 & Age < 13 ~ "Child", Age >= 13 & Age < 20 ~ "Teen", Age >= 20 & Age < 40 ~ "Adult", Age >= 40 & Age < 60 ~ "Middle Age Adult", Age >= 60 ~ "Senior Adult" ), age_group = factor( age_group, levels = c("Infant", "Toddler", "Child", "Teen", "Adult", "Middle Age Adult", "Senior Adult") ), income_class = case_when( Pclass == 1 ~ "Upper Class", Pclass == 2 ~ "Middle Class", Pclass == 3 ~ "Lower Class" ), Sex = ifelse( Sex == "female", "Female", "Male" ) ) %>% # and then... # only keep variables of interest select(Survived, Sex, Age, alone, age_group, income_class) %>% # and then... # rename variables in a consistent manner (snake_case) rename(survived = Survived, sex = Sex, age = Age) %>% # and then... # order the data by age arrange(age) ``` ## Basic Plotting First, create a basic bar chart of which passengers survived and which did not. For the purpose of these plots, a numerical/binary variable for survival is less understandable to a layman than a character variable that spells out whether passengers lived or died. A general rule of thumb for graphs and figures is that they should be interpretable by a viewer who has little to no understanding of the data. In other words, figures should be entirely self-sufficient. ```{r} titanic_clean <- titanic_clean %>% mutate(survived_char = ifelse(survived == 1, "Survived", "Perished"), survived_char = factor(survived_char, levels = c("Survived", "Perished") ) ) ggplot(data = titanic_clean, mapping = aes(x = survived_char, fill = sex)) + geom_bar(width = 0.5) + labs(x = NULL, y = "Count", title = "Survival Counts of Titanic Passengers by Sex", fill = NULL) ``` ## Intermediate Plotting These next several plots are merely meant to highlight a few capabilities of `ggplot()`. In the following example, `geom_text()` is used to impose the passenger counts associated with each colored bar, `scale_y_continuous()` is used to specify where the ticks occur along the y-axis, and `theme_classic()` is a built-in theme that removes some of the plot borders and grid lines. ```{r} ggplot(titanic_clean, aes(x = survived_char, fill = sex)) + geom_bar(width = 0.5) + # add text to the plot that shows the size of each bar geom_text(aes(label = after_stat(count)), stat = "count", color = "white", position = position_stack(vjust = 0.5)) + labs(x = NULL, y = "Count", title = "Survival Counts of Titanic Passengers by Sex", fill = NULL) + # specify axis breaks scale_y_continuous(breaks = seq(0,500,100)) + # use classic theme theme_classic() ``` In this next graph, `facet_wrap()` is used to split the graph into three panels based on the `income_class` variable, `theme_bw()` is another built-in theme, and `theme()` further modifies aspects of the graph. Within `theme()`, `element_blank()` removes things, `element_text()` modifies text, `element_rect()` modifies borders and backgrounds, and `element_line()` modifies lines. ```{r, fig.width=10, fig.height=4} ggplot(titanic_clean, aes(x = survived_char, fill = sex)) + geom_bar(width = 0.5) + labs(x = NULL, y = "Count", title = "Survival Counts of Titanic Passengers by Sex and Socioeconomic Status", fill = NULL) + scale_y_continuous(breaks = seq(0,400,50)) + # split graph by socioeconomic status facet_wrap(~income_class) + # use the black and white theme theme_bw() + # eliminate most grid lines theme(panel.grid.minor = element_blank(), panel.grid.major.x = element_blank()) ``` You can manually specify colors within scale_fill_manual() or scale_color_manual() either by name (e.g., “red”) or by hexadecimal code (e.g., “#FF1234”). Check out https://coolors.co/ for some great user-generated palettes (with hex codes!). The viridis package also provides great pre-made color palettes. ```{r} ggplot(titanic_clean, aes(x = age, fill = survived_char)) + geom_histogram() + # specify colors with hex codes scale_fill_manual(values = c("#E69F00", "#999999")) + scale_x_continuous(breaks = seq(0,90,10)) + labs(y = "Count", x = "Age", title = "Survival Counts of Titanic Passengers by Age", fill = NULL) + theme_bw() + theme(panel.border = element_blank(), panel.grid = element_blank(), # adjust the legend position legend.position = c(0.8,0.8)) ``` A density plot visualizes the distribution of a variable (basically a smoothed histogram). Be wary, the area under any density plot equals 1, so these types of plots communicate nothing about sample sizes. ```{r} ggplot(titanic_clean, aes(x = age, fill = survived_char, linetype = survived_char)) + # alpha changes how transparent a color is geom_density(alpha = 0.5, linewidth = 0.6) + scale_fill_manual(values = c("#E69F00", "#999999")) + scale_x_continuous(breaks = seq(0,90,10)) + labs(y = "Density", x = "Age", title = "Age Distributions of Titanic Passengers", fill = NULL, linetype = NULL) + theme_bw() + theme(panel.border = element_blank(), panel.grid = element_blank(), legend.position = c(0.8,0.8)) ``` For actual research, all or none of these examples may be appropriate or suitable. The "best" way to visualize data is highly dependent on the type of data and the audience you are communicating with. These two sources can be very helpful: 1. [Data to Viz](https://www.data-to-viz.com/) -- Helps you find appropriate graphs for your data 2. [The R Graph Gallery](https://r-graph-gallery.com/index.html) -- Provides R code for every graph imaginable ## Saving Plots Plots in R can be saved as objects, which can then be exported as images. With the `ggsave()` function, you can specify the image's height, width, and resolution (dpi). You can save your plots to different to different file formats. For now, lets save it in the `.png` format. ```{r} p <- ggplot(titanic_clean, aes(x = age, fill = survived_char, linetype = survived_char)) + # alpha changes how transparent a color is geom_density(alpha = 0.5, linewidth = 0.6) + scale_fill_manual(values = c("#E69F00", "#999999")) + scale_x_continuous(breaks = seq(0,90,10)) + labs(y = "Density", x = "Age", title = "Age Distributions of Titanic Passengers", fill = NULL, linetype = NULL) + theme_bw() + theme(panel.border = element_blank(), panel.grid = element_blank(), legend.position = c(0.8,0.8)) ``` ```r ggsave(plot = p, filename = "D:/R_projects/darecodecamp/plots/titanic_density.png", height = 6, width = 9, dpi = 300) ```

1 Plots in R

1.1 Scatter Plot: MPG vs HP

2 Visualization using ggplot()

2.1 Basic Plotting

2.2 Intermediate Plotting

2.3 Saving Plots

2 Visualization using `ggplot()`