Data Visualization in R

This page introduces common plots using Base R and ggplot2. Data visualization is a great way to understand trends and structure. It also allows us to see outliers and do a data-quality check. Lets start with how plots work using base R commands.

1 Plots in R

Lets see how plots work using base R function. We load mtcars data again for this one.

Code
library(tidyverse)
data <- mtcars

1.1 Scatter Plot: MPG vs HP

Lets see how mileage of a car changes as the horsepower increases.

Code
plot(data$hp, data$mpg)

MPG vs Horsepower

We see that there is negative relationship between mileage and horsepower. A car with more horses gives less mileage and vice-versa.

Can we create same plot using one of the packages inside the tidyverse library? We can.

Code
ggplot(data, aes(x = hp, y = mpg)) + 
  geom_point(color = "steelblue") +
  labs(title = "MPG vs Horsepower", x = "Horsepower", y = "Miles per Gallon") +
  theme_classic()

2 Visualization using ggplot()

You can create almost any kind of plot using ggplot() function. You can create a simple plot using ggplot(aes(x = ..., y = ...)) and then add elements like text, fill, facets et. cetera to your plot using the + sign.

2.1 Basic Plotting

First, create a basic bar chart of which passengers survived and which did not. For the purpose of these plots, a numerical/binary variable for survival is less understandable to a layman than a character variable that spells out whether passengers lived or died. A general rule of thumb for graphs and figures is that they should be interpretable by a viewer who has little to no understanding of the data. In other words, figures should be entirely self-sufficient.

Code
titanic_clean <- titanic_clean %>%
                             mutate(survived_char = ifelse(survived == 1, "Survived", "Perished"),
                             survived_char = factor(survived_char,
                                                                    levels = c("Survived", "Perished")

                               )

                             )

 ggplot(data = titanic_clean,
            mapping = aes(x = survived_char,
                                      fill = sex)) +
           geom_bar(width = 0.5) +
            labs(x = NULL,
                    y = "Count",
                  title = "Survival Counts of Titanic Passengers by Sex",
                  fill = NULL)

2.2 Intermediate Plotting

These next several plots are merely meant to highlight a few capabilities of ggplot(). In the following example, geom_text() is used to impose the passenger counts associated with each colored bar, scale_y_continuous() is used to specify where the ticks occur along the y-axis, and theme_classic() is a built-in theme that removes some of the plot borders and grid lines.

Code
ggplot(titanic_clean,
       aes(x = survived_char, fill = sex)) +
  geom_bar(width = 0.5) +
  # add text to the plot that shows the size of each bar
  geom_text(aes(label = after_stat(count)), 
            stat = "count", 
            color = "white",
            position = position_stack(vjust = 0.5)) +
  labs(x = NULL,
       y = "Count",
       title = "Survival Counts of Titanic Passengers by Sex",
       fill = NULL) +
  # specify axis breaks
  scale_y_continuous(breaks = seq(0,500,100)) +
  # use classic theme
  theme_classic()

In this next graph, facet_wrap() is used to split the graph into three panels based on the income_class variable, theme_bw() is another built-in theme, and theme() further modifies aspects of the graph. Within theme(), element_blank() removes things, element_text() modifies text, element_rect() modifies borders and backgrounds, and element_line() modifies lines.

Code
ggplot(titanic_clean,
       aes(x = survived_char, fill = sex)) +
  geom_bar(width = 0.5) +
  labs(x = NULL,
       y = "Count",
       title = "Survival Counts of Titanic Passengers by Sex and Socioeconomic Status",
       fill = NULL) +
  scale_y_continuous(breaks = seq(0,400,50)) +
  # split graph by socioeconomic status
  facet_wrap(~income_class) +
  # use the black and white theme
  theme_bw() +
  # eliminate most grid lines
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank())

You can manually specify colors within scale_fill_manual() or scale_color_manual() either by name (e.g., “red”) or by hexadecimal code (e.g., “#FF1234”). Check out https://coolors.co/ for some great user-generated palettes (with hex codes!). The viridis package also provides great pre-made color palettes.

Code
ggplot(titanic_clean,
       aes(x = age, fill = survived_char)) +
  geom_histogram() +
  # specify colors with hex codes
  scale_fill_manual(values = c("#E69F00", "#999999")) + 
  scale_x_continuous(breaks = seq(0,90,10)) +
  labs(y = "Count",
       x = "Age",
       title = "Survival Counts of Titanic Passengers by Age", 
       fill = NULL) +
  theme_bw() +
  theme(panel.border = element_blank(),
        panel.grid = element_blank(),
        # adjust the legend position
        legend.position = c(0.8,0.8))
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A density plot visualizes the distribution of a variable (basically a smoothed histogram). Be wary, the area under any density plot equals 1, so these types of plots communicate nothing about sample sizes.

Code
ggplot(titanic_clean,
       aes(x = age, fill = survived_char, linetype = survived_char)) +
  # alpha changes how transparent a color is
  geom_density(alpha = 0.5, linewidth = 0.6) +
  scale_fill_manual(values = c("#E69F00", "#999999")) + 
  scale_x_continuous(breaks = seq(0,90,10)) +
  labs(y = "Density",
       x = "Age",
       title = "Age Distributions of Titanic Passengers", 
       fill = NULL,
       linetype = NULL) +
  theme_bw() +
  theme(panel.border = element_blank(),
        panel.grid = element_blank(),
        legend.position = c(0.8,0.8))

For actual research, all or none of these examples may be appropriate or suitable. The “best” way to visualize data is highly dependent on the type of data and the audience you are communicating with. These two sources can be very helpful:

  1. Data to Viz – Helps you find appropriate graphs for your data
  2. The R Graph Gallery – Provides R code for every graph imaginable

2.3 Saving Plots

Plots in R can be saved as objects, which can then be exported as images. With the ggsave() function, you can specify the image’s height, width, and resolution (dpi). You can save your plots to different to different file formats. For now, lets save it in the .png format.

Code
p <- ggplot(titanic_clean,
       aes(x = age, fill = survived_char, linetype = survived_char)) +
  # alpha changes how transparent a color is
  geom_density(alpha = 0.5, linewidth = 0.6) +
  scale_fill_manual(values = c("#E69F00", "#999999")) + 
  scale_x_continuous(breaks = seq(0,90,10)) +
  labs(y = "Density",
       x = "Age",
       title = "Age Distributions of Titanic Passengers", 
       fill = NULL,
       linetype = NULL) +
  theme_bw() +
  theme(panel.border = element_blank(),
        panel.grid = element_blank(),
        legend.position = c(0.8,0.8))
ggsave(plot = p,
       filename = "D:/R_projects/darecodecamp/plots/titanic_density.png",
       height = 6,
       width = 9,
       dpi = 300)