Code
typeof(1.234)[1] "double"
We learned about some objects and their classes in the introduction section. Note that data are stored as objects in R, so data and objects are synonymous. Lets walk through all kinds of data and their classes.
Doubles are real numbers with a decimal value.
The typeof() function can also be applied to vectors.
Integers are real whole numbers.
Character data stores text. This is equivalent to strings in STATA or microsoft excel. You cannot apply mathematical operations to text.
[1] "character"
If you try to apply a mathematical operation to diff_data_vector, you will receive an error because of "darecodecamp".
Logicals are data that can only take two values: TRUE and FALSE (or T and F). Check out https://www.statmethods.net/management/operators.html for a nice list of logical operators such as greater than, less than or equal to, not equal to, etc.
You will inevitably work with dates in your research. In R, dates are internally stored as doubles, but with a date object class. Dates can be entered or converted using R’s canned as.Date() function.
Since dates are stored as doubles, you can perform some mathematical operations, such as adding days.
Now that you have a good understanding of the way R treats different types of data, the next step is understanding how to store different types of data. Atomic vectors, matrices, and arrays are used for storing homogeneous data. When you save an object of one of these data structures, all the data within will be saved as one data type. Data frames and lists are structures used for storing different types of data in one object.
The atomic vector is the most fundamental data structure in R. Contrasting a [1 x n] matrix vector or a one-dimensional array, atomic vectors have no dimension. I.e., atomic vectors cannot be classified as row or column vectors.
A matrix is an extension of the atomic vector. Matrices are essentially atomic vectors with a specified number of rows and columns. Similar to atomic vectors, the elements of a matrix must be the same data type. We worked with matrices briefly in the last section.
Arrays are objects that can store data in more than two dimensions. Matrices only have two dimensions: rows and columns. The easiest way to conceptualize an array is to picture a data cube. Think of a Rubik’s Cube, where a number is stored within each of the little colored boxes.
, , 1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
, , 2
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 11 14 17
[3,] 12 15 18
, , 3
[,1] [,2] [,3]
[1,] 19 22 25
[2,] 20 23 26
[3,] 21 24 27
Similar to atomic vectors and matrices, you can select certain elements (or vectors/matrices) by indexing. Recall that leaving an index slot blank returns everything from that dimension.
[1] 1
[1] 4 13 22
[,1] [,2] [,3]
[1,] 19 22 25
[2,] 20 23 26
[3,] 21 24 27
Factors store categorical information. For example, you may need to use some data that is ordinal in nature. Take for example Likert scale data. You cannot quantify how much greater strongly agree is than somewhat agree, but the ordering has meaning. By converting something to a factor using the factor() function, R will store it as a vector of integers with a corresponding set of character values.
[1] "character"
[1] "integer"
$levels
[1] "strongly agree" "agree" "somewhat agree"
[4] "neutral" "somewhat disagree" "disagree"
[7] "strongly disagree"
$class
[1] "factor"
A data frame is comprised of equal length vectors with unique attributes for each vector, making it a rectangular 2-dimensional (rows and columns). In other words, a data frame is a matrix with column names. You can create a data frame using the data.frame() function, or by importing data with read.csv() or read.table().
numbers letters logicals dates
1 1 a TRUE 1999-01-01
2 2 b FALSE 2000-01-01
3 3 c FALSE 2001-01-01
4 4 d TRUE 2002-01-01
Like matrices, you can select certain elements of a data frame using brackets. However, since our columns now have names, you can select columns by their name.
[1] 1 2 3 4
letters dates
1 a 1999-01-01
2 b 2000-01-01
3 c 2001-01-01
4 d 2002-01-01
You can also reference columns using the $ operator.
Here is an example of how you might create a new variable:
numbers letters logicals dates new_date new_var
1 1 a TRUE 1999-01-01 1999-01-08 1
2 2 b FALSE 2000-01-01 2000-01-08 0
3 3 c FALSE 2001-01-01 2001-01-08 0
4 4 d TRUE 2002-01-01 2002-01-08 4
In R, lists act as storage bins. Not only can you include different data types, you can store different data structures as well. At first, they will seem useless. As you start doing more advanced research and more advanced programming, lists will be your best friend. You can create lists using the list() function.
[[1]]
[1] "darecodecamp"
[[2]]
[1] "1999-01-01"
[[3]]
[,1] [,2]
[1,] 10 8
[2,] 5 12
[[4]]
, , 1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
, , 2
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 11 14 17
[3,] 12 15 18
, , 3
[,1] [,2] [,3]
[1,] 19 22 25
[2,] 20 23 26
[3,] 21 24 27
[[5]]
[1] strongly disagree strongly agree disagree agree
[5] somewhat disagree somewhat agree neutral
7 Levels: strongly agree agree somewhat agree neutral ... strongly disagree
[[6]]
numbers letters logicals dates new_date new_var
1 1 a TRUE 1999-01-01 1999-01-08 1
2 2 b FALSE 2000-01-01 2000-01-08 0
3 3 c FALSE 2001-01-01 2001-01-08 0
4 4 d TRUE 2002-01-01 2002-01-08 4
Indexing lists uses a double bracket [[]] versus the single bracket used for everything else.
We can either use dataset included in different library or import external dataset that are in different formats such as .csv, .rds, .dta, .txt etc.
As an example, lets try importing a dataset from web. You can import similarly from other sources (web or locally). You just need to specify the path correctly. Note that we use / instead of \ in R to specify sub-directories.
Lets use the data for time period 1 only for simplicity. We can use indexing like we learned previously to filter the time period that is equal to 1.
Now we can do a little analysis. Lets check how many work in a manufacturing industry and how many don’t.
We see that less people work in manufacturing industry as compared to their counterpart in year 1. It would make more sense for us to know the percentage. Let’s do that.
About 40% of total sample work in manufacturing industry and the remaining 60% belong to other industries. We know that ind is a dummy variable. What happens if we take its mean? It will return proportion of 1’s, i.e., proportion of those working in the manufacturing industry.
Now, what’s the proportion of those not working in manufacturing industry? We simply substract the proportion of those working in manufacturing industry by 1.
About 40% work in manufacturing industry. Lets see if this varies across gender. We can see what % of those working in manufacturing industry are female by multiplying ind with fem.
0 1
585 10
We find that 10 out of 595 people are female who work in the manufacturing industry in the year 1. What proportion is that out of total workforce in manufacturing industry?
4.3% of those working in manufacturing industry are female. That means 95.7% of workforce in the manufacturing industry is male.
You can export your data to different file types. If you want to export to .csv, you can use write.csv() function. Make sure that you save your transformed data, not the raw data you imported in the first place.
If you only work in R, consider saving your data in .RData or .RDS format. These R-specific file types have advantages over other formats such as compressed file size, faster import and export as well as retention of different attributes and variables.
---
title: "Data and Objects"
format:
html:
toc: true
toc-depth: 2
number-sections: true
code-fold: show
code-tools: true
smooth-scroll: true
---
We learned about some objects and their classes in the introduction section. Note that data are stored as objects in R, so data and objects are synonymous. Lets walk through all kinds of data and their classes.
# Types of data
## Doubles
Doubles are real numbers with a decimal value.
```{r, echo = TRUE}
typeof(1.234)
```
The `typeof()` function can also be applied to vectors.
```{r, echo = TRUE}
typeof(seq(0,3,.5))
```
## Integers
Integers are real whole numbers.
```{r, echo = TRUE}
# R will default to storing numerical data as doubles
typeof(1)
# you can override this with the as.integer() function
typeof(as.integer(1))
# or you can use the : operator to create an integer sequence
typeof(1:4)
```
## Characters
Character data stores text. This is equivalent to strings in STATA or microsoft excel. You cannot apply mathematical operations to text.
```{r, echo = TRUE}
typeof("darecodecamp")
```
```{r, echo = TRUE}
diff_data_vector <- c(1:10, seq(0,5,.5), "darecodecamp")
typeof(diff_data_vector)
```
If you try to apply a mathematical operation to `diff_data_vector`, you will receive an error because of `"darecodecamp"`.
## Logicals
Logicals are data that can only take two values: `TRUE` and `FALSE` (or `T` and `F`). Check out <https://www.statmethods.net/management/operators.html> for a nice list of logical operators such as greater than, less than or equal to, not equal to, etc.
```{r, echo = TRUE}
2 > 45
2 + 1 == 3
typeof(2 + 1 == 3)
```
## Dates
You will inevitably work with dates in your research. In R, dates are internally stored as doubles, but with a date object `class`. Dates can be entered or converted using R's canned `as.Date()` function.
```{r, echo = TRUE}
date <- as.Date("12/31/99", "%m/%d/%y")
date
```
```{r, echo = TRUE}
typeof(date)
class(date)
```
Since dates are stored as doubles, you can perform *some* mathematical operations, such as adding days.
```{r, echo = TRUE}
date + 1
date + 31
```
# Data Structures
Now that you have a good understanding of the way R treats different types of data, the next step is understanding how to *store* different types of data. Atomic vectors, matrices, and arrays are used for storing homogeneous data. When you save an object of one of these data structures, all the data within will be saved as one data type. Data frames and lists are structures used for storing different types of data in one object.
## Atomic Vectors
The atomic vector is the most fundamental data structure in R. Contrasting a [1 x n] matrix vector or a one-dimensional array, atomic vectors have no dimension. I.e., atomic vectors cannot be classified as row or column vectors.
```{r, echo = TRUE}
dim(diff_data_vector)
```
## Matrices
A matrix is an extension of the atomic vector. Matrices are essentially atomic vectors with a specified number of rows and columns. Similar to atomic vectors, the elements of a matrix must be the same data type. We worked with matrices briefly in the last section.
```{r, echo=TRUE}
A <- matrix(c(10, 8,
5, 12), ncol = 2, byrow = TRUE)
typeof(A)
dim(A)
```
## Arrays
Arrays are objects that can store data in more than two dimensions. Matrices only have two dimensions: rows and columns. The easiest way to conceptualize an array is to picture a data cube. Think of a Rubik's Cube, where a number is stored within each of the little colored boxes.
```{r}
multiarray <- array(c(1:27), dim = c(row_Size = 3,
column_Size = 3,
matrices = 3))
multiarray
```
Similar to atomic vectors and matrices, you can select certain elements (or vectors/matrices) by indexing. Recall that leaving an index slot blank returns everything from that dimension.
```{r}
# select the first element of the first row, first column, and first matrix
multiarray[1,1,1]
# select the first element of the second column for every matrix
multiarray[1,2, ]
# select all elements the third matrix
multiarray[ , ,3]
```
## Factors
Factors store categorical information. For example, you may need to use some data that is ordinal in nature. Take for example Likert scale data. You cannot quantify how much greater *strongly agree* is than *somewhat agree*, but the ordering has meaning. By converting something to a factor using the `factor()` function, R will store it as a vector of integers with a corresponding set of character values.
```{r, echo = TRUE}
likert_levels <- c("strongly disagree", "strongly agree", "disagree", "agree",
"somewhat disagree", "somewhat agree", "neutral")
typeof(likert_levels)
likert_levels <- factor(likert_levels,
# Specify the ordering using levels
levels = c("strongly agree", "agree", "somewhat agree",
"neutral", "somewhat disagree", "disagree",
"strongly disagree"))
typeof(likert_levels)
# use the attributes function to see the levels
attributes(likert_levels)
```
## Data Frames
A data frame is comprised of equal length vectors with unique attributes for each vector, making it a rectangular 2-dimensional (rows and columns). In other words, a data frame is a matrix with column names. You can create a data frame using the `data.frame()` function, or by importing data with `read.csv()` or `read.table()`.
```{r, echo = TRUE}
my_first_df <- data.frame(numbers = 1:4,
letters = c("a", "b", "c", "d"),
logicals = c(TRUE, FALSE, FALSE, TRUE),
# Woah you can make sequences with dates?!
dates = seq(as.Date("01/01/99", "%m/%d/%y"),
as.Date("01/01/02", "%m/%d/%y"),
"years")
)
my_first_df
```
Like matrices, you can select certain elements of a data frame using brackets. However, since our columns now have names, you can select columns by their name.
```{r, echo = TRUE}
# If you select only one column, R will return an atomic vector
my_first_df[,"numbers"]
# If you select multiple columns, R will return another data frame
my_first_df[,c("letters", "dates")]
```
You can also reference columns using the `$` operator.
```{r, echo = TRUE}
my_first_df$logicals
my_first_df$logicals[1]
```
Here is an example of how you might create a new variable:
```{r, echo = TRUE}
# First, let's move our dates up a week
my_first_df$new_date <- my_first_df$dates + 7
# Next, let's multiply two columns together
my_first_df$new_var <- my_first_df$numbers * my_first_df$logicals
# Remember how logicals take values of 1 and 0?
my_first_df
```
## Lists
In R, lists act as storage bins. Not only can you include different data types, you can store different data structures as well. At first, they will seem useless. As you start doing more advanced research and more advanced programming, lists will be your best friend. You can create lists using the `list()` function.
```{r, echo = TRUE}
my_first_list <- list("darecodecamp", as.Date("01/01/99", "%m/%d/%y"), A,
multiarray, likert_levels, my_first_df)
my_first_list
```
Indexing lists uses a double bracket `[[]]` versus the single bracket used for everything else.
```{r, echo = TRUE}
my_first_list[[6]]
```
# Importing Data
We can either use dataset included in different library or import external dataset that are in different formats such as .csv, .rds, .dta, .txt etc.
As an example, lets try importing a dataset from web. You can import similarly from other sources (web or locally). You just need to specify the path correctly. Note that we use `/` instead of `\` in R to specify sub-directories.
```{r}
library(haven) # A library to import foreign format files
dat_mus08 <- read_dta("https://cameron.econ.ucdavis.edu/bgpe2011/mus08psidextract.dta")
```
Lets use the data for time period 1 only for simplicity. We can use indexing like we learned previously to filter the time period that is equal to 1.
```{r}
# Lets analyse the data for t == 1
dat_mus08 <- dat_mus08[dat_mus08$t == 1, ]
```
Now we can do a little analysis. Lets check how many work in a manufacturing industry and how many don't.
```{r}
table(dat_mus08$ind)
```
We see that less people work in manufacturing industry as compared to their counterpart in year 1. It would make more sense for us to know the percentage. Let's do that.
```{r}
table(dat_mus08$ind) / nrow(dat_mus08)
```
About 40\% of total sample work in manufacturing industry and the remaining 60\% belong to other industries. We know that `ind` is a dummy variable. What happens if we take its mean? It will return proportion of 1's, i.e., proportion of those working in the manufacturing industry.
```{r}
mean(dat_mus08$ind)
```
Now, what's the proportion of those not working in manufacturing industry? We simply substract the proportion of those working in manufacturing industry by 1.
```{r}
1 - mean(dat_mus08$ind)
```
About 40% work in manufacturing industry. Lets see if this varies across gender. We can see what \% of those working in manufacturing industry are female by multiplying `ind` with `fem`.
```{r}
dat_mus08$female_workers <- dat_mus08$fem * dat_mus08$ind
table(dat_mus08$female_workers)
```
We find that 10 out of 595 people are female who work in the manufacturing industry in the year 1. What proportion is that out of total workforce in manufacturing industry?
```{r}
table(dat_mus08$female_workers)[2] / table(dat_mus08$ind)[2]
```
4.3\% of those working in manufacturing industry are female. That means 95.7\% of workforce in the manufacturing industry is male.
# Exporting and saving data
You can export your data to different file types. If you want to export to `.csv`, you can use `write.csv()` function. Make sure that you save your transformed data, not the raw data you imported in the first place.
```r
write.csv(dat_mus08,
"D:/R_projects/darecodecamp/data/dat_mus08_v1.csv")
```
If you only work in R, consider saving your data in `.RData` or `.RDS` format. These R-specific file types have advantages over other formats such as compressed file size, faster import and export as well as retention of different attributes and variables.