Code
library(tidyverse)
data <- mtcarsThis page introduces common plots using Base R and ggplot2. Data visualization is a great way to understand trends and structure. It also allows us to see outliers and do a data-quality check. Lets start with how plots work using base R commands.
Lets see how plots work using base R function. We load mtcars data again for this one.
Lets see how mileage of a car changes as the horsepower increases.
We see that there is negative relationship between mileage and horsepower. A car with more horses gives less mileage and vice-versa.
Can we create same plot using one of the packages inside the tidyverse library? We can.
ggplot()You can create almost any kind of plot using ggplot() function. You can create a simple plot using ggplot(aes(x = ..., y = ...)) and then add elements like text, fill, facets et. cetera to your plot using the + sign.
First, create a basic bar chart of which passengers survived and which did not. For the purpose of these plots, a numerical/binary variable for survival is less understandable to a layman than a character variable that spells out whether passengers lived or died. A general rule of thumb for graphs and figures is that they should be interpretable by a viewer who has little to no understanding of the data. In other words, figures should be entirely self-sufficient.
titanic_clean <- titanic_clean %>%
mutate(survived_char = ifelse(survived == 1, "Survived", "Perished"),
survived_char = factor(survived_char,
levels = c("Survived", "Perished")
)
)
ggplot(data = titanic_clean,
mapping = aes(x = survived_char,
fill = sex)) +
geom_bar(width = 0.5) +
labs(x = NULL,
y = "Count",
title = "Survival Counts of Titanic Passengers by Sex",
fill = NULL)These next several plots are merely meant to highlight a few capabilities of ggplot(). In the following example, geom_text() is used to impose the passenger counts associated with each colored bar, scale_y_continuous() is used to specify where the ticks occur along the y-axis, and theme_classic() is a built-in theme that removes some of the plot borders and grid lines.
ggplot(titanic_clean,
aes(x = survived_char, fill = sex)) +
geom_bar(width = 0.5) +
# add text to the plot that shows the size of each bar
geom_text(aes(label = after_stat(count)),
stat = "count",
color = "white",
position = position_stack(vjust = 0.5)) +
labs(x = NULL,
y = "Count",
title = "Survival Counts of Titanic Passengers by Sex",
fill = NULL) +
# specify axis breaks
scale_y_continuous(breaks = seq(0,500,100)) +
# use classic theme
theme_classic()In this next graph, facet_wrap() is used to split the graph into three panels based on the income_class variable, theme_bw() is another built-in theme, and theme() further modifies aspects of the graph. Within theme(), element_blank() removes things, element_text() modifies text, element_rect() modifies borders and backgrounds, and element_line() modifies lines.
ggplot(titanic_clean,
aes(x = survived_char, fill = sex)) +
geom_bar(width = 0.5) +
labs(x = NULL,
y = "Count",
title = "Survival Counts of Titanic Passengers by Sex and Socioeconomic Status",
fill = NULL) +
scale_y_continuous(breaks = seq(0,400,50)) +
# split graph by socioeconomic status
facet_wrap(~income_class) +
# use the black and white theme
theme_bw() +
# eliminate most grid lines
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank())You can manually specify colors within scale_fill_manual() or scale_color_manual() either by name (e.g., “red”) or by hexadecimal code (e.g., “#FF1234”). Check out https://coolors.co/ for some great user-generated palettes (with hex codes!). The viridis package also provides great pre-made color palettes.
ggplot(titanic_clean,
aes(x = age, fill = survived_char)) +
geom_histogram() +
# specify colors with hex codes
scale_fill_manual(values = c("#E69F00", "#999999")) +
scale_x_continuous(breaks = seq(0,90,10)) +
labs(y = "Count",
x = "Age",
title = "Survival Counts of Titanic Passengers by Age",
fill = NULL) +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid = element_blank(),
# adjust the legend position
legend.position = c(0.8,0.8))Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A density plot visualizes the distribution of a variable (basically a smoothed histogram). Be wary, the area under any density plot equals 1, so these types of plots communicate nothing about sample sizes.
ggplot(titanic_clean,
aes(x = age, fill = survived_char, linetype = survived_char)) +
# alpha changes how transparent a color is
geom_density(alpha = 0.5, linewidth = 0.6) +
scale_fill_manual(values = c("#E69F00", "#999999")) +
scale_x_continuous(breaks = seq(0,90,10)) +
labs(y = "Density",
x = "Age",
title = "Age Distributions of Titanic Passengers",
fill = NULL,
linetype = NULL) +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid = element_blank(),
legend.position = c(0.8,0.8))For actual research, all or none of these examples may be appropriate or suitable. The “best” way to visualize data is highly dependent on the type of data and the audience you are communicating with. These two sources can be very helpful:
Plots in R can be saved as objects, which can then be exported as images. With the ggsave() function, you can specify the image’s height, width, and resolution (dpi). You can save your plots to different to different file formats. For now, lets save it in the .png format.
p <- ggplot(titanic_clean,
aes(x = age, fill = survived_char, linetype = survived_char)) +
# alpha changes how transparent a color is
geom_density(alpha = 0.5, linewidth = 0.6) +
scale_fill_manual(values = c("#E69F00", "#999999")) +
scale_x_continuous(breaks = seq(0,90,10)) +
labs(y = "Density",
x = "Age",
title = "Age Distributions of Titanic Passengers",
fill = NULL,
linetype = NULL) +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid = element_blank(),
legend.position = c(0.8,0.8))---
title: "Data Visualization in R"
format:
html:
toc: true
toc-depth: 2
number-sections: true
code-fold: show
code-tools: true
smooth-scroll: true
---
This page introduces common plots using Base R and `ggplot2`. Data visualization is a great way to understand trends and structure. It also allows us to see outliers and do a data-quality check. Lets start with how plots work using base R commands.
# Plots in R
Lets see how plots work using base R function. We load `mtcars` data again for this one.
```{r}
#| echo: true
#| warning: false
library(tidyverse)
data <- mtcars
```
## Scatter Plot: MPG vs HP
Lets see how mileage of a car changes as the horsepower increases.
```{r}
#| echo: true
#| warning: false
#| fig-cap: "MPG vs Horsepower"
plot(data$hp, data$mpg)
```
We see that there is negative relationship between mileage and horsepower. A car with more horses gives less mileage and vice-versa.
Can we create same plot using one of the packages inside the `tidyverse` library? We can.
```{r}
ggplot(data, aes(x = hp, y = mpg)) +
geom_point(color = "steelblue") +
labs(title = "MPG vs Horsepower", x = "Horsepower", y = "Miles per Gallon") +
theme_classic()
```
# Visualization using `ggplot()`
You can create almost any kind of plot using `ggplot()` function. You can create a simple plot using `ggplot(aes(x = ..., y = ...))` and then add elements like text, fill, facets et. cetera to your plot using the `+` sign.
```{r}
#| echo: false
titanic <- read.csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")
# take the titanic data
titanic_clean <- titanic %>% # and then...
# remove "bad" data
filter(Fare != 0) %>% # and then...
# create new variables
mutate(
age_sq = Age^2,
alone = ifelse(
Siblings.Spouses.Aboard > 0 | Parents.Children.Aboard > 0, 0, 1
),
alone2 = ifelse(
Siblings.Spouses.Aboard == 0 & Parents.Children.Aboard == 0, 1, 0
),
age_group = case_when(
Age < 1 ~ "Infant",
Age >= 1 & Age < 4 ~ "Toddler",
Age >= 4 & Age < 13 ~ "Child",
Age >= 13 & Age < 20 ~ "Teen",
Age >= 20 & Age < 40 ~ "Adult",
Age >= 40 & Age < 60 ~ "Middle Age Adult",
Age >= 60 ~ "Senior Adult"
),
age_group = factor(
age_group, levels = c("Infant", "Toddler", "Child", "Teen",
"Adult", "Middle Age Adult", "Senior Adult")
),
income_class = case_when(
Pclass == 1 ~ "Upper Class",
Pclass == 2 ~ "Middle Class",
Pclass == 3 ~ "Lower Class"
),
Sex = ifelse(
Sex == "female", "Female", "Male"
)
) %>% # and then...
# only keep variables of interest
select(Survived, Sex, Age, alone, age_group, income_class) %>% # and then...
# rename variables in a consistent manner (snake_case)
rename(survived = Survived,
sex = Sex,
age = Age) %>% # and then...
# order the data by age
arrange(age)
```
## Basic Plotting
First, create a basic bar chart of which passengers survived and which did not. For the purpose of these plots, a numerical/binary variable for survival is less understandable to a layman than a character variable that spells out whether passengers lived or died. A general rule of thumb for graphs and figures is that they should be interpretable by a viewer who has little to no understanding of the data. In other words, figures should be entirely self-sufficient.
```{r}
titanic_clean <- titanic_clean %>%
mutate(survived_char = ifelse(survived == 1, "Survived", "Perished"),
survived_char = factor(survived_char,
levels = c("Survived", "Perished")
)
)
ggplot(data = titanic_clean,
mapping = aes(x = survived_char,
fill = sex)) +
geom_bar(width = 0.5) +
labs(x = NULL,
y = "Count",
title = "Survival Counts of Titanic Passengers by Sex",
fill = NULL)
```
## Intermediate Plotting
These next several plots are merely meant to highlight a few capabilities of `ggplot()`. In the following example, `geom_text()` is used to impose the passenger counts associated with each colored bar, `scale_y_continuous()` is used to specify where the ticks occur along the y-axis, and `theme_classic()` is a built-in theme that removes some of the plot borders and grid lines.
```{r}
ggplot(titanic_clean,
aes(x = survived_char, fill = sex)) +
geom_bar(width = 0.5) +
# add text to the plot that shows the size of each bar
geom_text(aes(label = after_stat(count)),
stat = "count",
color = "white",
position = position_stack(vjust = 0.5)) +
labs(x = NULL,
y = "Count",
title = "Survival Counts of Titanic Passengers by Sex",
fill = NULL) +
# specify axis breaks
scale_y_continuous(breaks = seq(0,500,100)) +
# use classic theme
theme_classic()
```
In this next graph, `facet_wrap()` is used to split the graph into three panels based on the `income_class` variable, `theme_bw()` is another built-in theme, and `theme()` further modifies aspects of the graph. Within `theme()`, `element_blank()` removes things, `element_text()` modifies text, `element_rect()` modifies borders and backgrounds, and `element_line()` modifies lines.
```{r, fig.width=10, fig.height=4}
ggplot(titanic_clean,
aes(x = survived_char, fill = sex)) +
geom_bar(width = 0.5) +
labs(x = NULL,
y = "Count",
title = "Survival Counts of Titanic Passengers by Sex and Socioeconomic Status",
fill = NULL) +
scale_y_continuous(breaks = seq(0,400,50)) +
# split graph by socioeconomic status
facet_wrap(~income_class) +
# use the black and white theme
theme_bw() +
# eliminate most grid lines
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank())
```
You can manually specify colors within scale_fill_manual() or scale_color_manual() either by name (e.g., “red”) or by hexadecimal code (e.g., “#FF1234”). Check out https://coolors.co/ for some great user-generated palettes (with hex codes!). The viridis package also provides great pre-made color palettes.
```{r}
ggplot(titanic_clean,
aes(x = age, fill = survived_char)) +
geom_histogram() +
# specify colors with hex codes
scale_fill_manual(values = c("#E69F00", "#999999")) +
scale_x_continuous(breaks = seq(0,90,10)) +
labs(y = "Count",
x = "Age",
title = "Survival Counts of Titanic Passengers by Age",
fill = NULL) +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid = element_blank(),
# adjust the legend position
legend.position = c(0.8,0.8))
```
A density plot visualizes the distribution of a variable (basically a smoothed histogram). Be wary, the area under any density plot equals 1, so these types of plots communicate nothing about sample sizes.
```{r}
ggplot(titanic_clean,
aes(x = age, fill = survived_char, linetype = survived_char)) +
# alpha changes how transparent a color is
geom_density(alpha = 0.5, linewidth = 0.6) +
scale_fill_manual(values = c("#E69F00", "#999999")) +
scale_x_continuous(breaks = seq(0,90,10)) +
labs(y = "Density",
x = "Age",
title = "Age Distributions of Titanic Passengers",
fill = NULL,
linetype = NULL) +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid = element_blank(),
legend.position = c(0.8,0.8))
```
For actual research, all or none of these examples may be appropriate or suitable. The "best" way to visualize data is highly dependent on the type of data and the audience you are communicating with. These two sources can be very helpful:
1. [Data to Viz](https://www.data-to-viz.com/) -- Helps you find appropriate graphs for your data
2. [The R Graph Gallery](https://r-graph-gallery.com/index.html) -- Provides R code for every graph imaginable
## Saving Plots
Plots in R can be saved as objects, which can then be exported as images. With the `ggsave()` function, you can specify the image's height, width, and resolution (dpi). You can save your plots to different to different file formats. For now, lets save it in the `.png` format.
```{r}
p <- ggplot(titanic_clean,
aes(x = age, fill = survived_char, linetype = survived_char)) +
# alpha changes how transparent a color is
geom_density(alpha = 0.5, linewidth = 0.6) +
scale_fill_manual(values = c("#E69F00", "#999999")) +
scale_x_continuous(breaks = seq(0,90,10)) +
labs(y = "Density",
x = "Age",
title = "Age Distributions of Titanic Passengers",
fill = NULL,
linetype = NULL) +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid = element_blank(),
legend.position = c(0.8,0.8))
```
```r
ggsave(plot = p,
filename = "D:/R_projects/darecodecamp/plots/titanic_density.png",
height = 6,
width = 9,
dpi = 300)
```