Data and Objects

We learned about some objects and their classes in the introduction section. Note that data are stored as objects in R, so data and objects are synonymous. Lets walk through all kinds of data and their classes.

1 Types of data

1.1 Doubles

Doubles are real numbers with a decimal value.

Code

typeof(1.234)

[1] "double"

The typeof() function can also be applied to vectors.

Code

typeof(seq(0,3,.5))

[1] "double"

1.2 Integers

Integers are real whole numbers.

Code

# R will default to storing numerical data as doubles
typeof(1)

[1] "double"

Code

# you can override this with the as.integer() function
typeof(as.integer(1))

[1] "integer"

Code

# or you can use the : operator to create an integer sequence
typeof(1:4)

[1] "integer"

1.3 Characters

Character data stores text. This is equivalent to strings in STATA or microsoft excel. You cannot apply mathematical operations to text.

Code

typeof("darecodecamp")

[1] "character"

Code

diff_data_vector <- c(1:10, seq(0,5,.5), "darecodecamp")
typeof(diff_data_vector)

[1] "character"

If you try to apply a mathematical operation to diff_data_vector, you will receive an error because of "darecodecamp".

1.4 Logicals

Logicals are data that can only take two values: TRUE and FALSE (or T and F). Check out https://www.statmethods.net/management/operators.html for a nice list of logical operators such as greater than, less than or equal to, not equal to, etc.

Code

2 > 45

[1] FALSE

Code

2 + 1 == 3

[1] TRUE

Code

typeof(2 + 1 == 3)

[1] "logical"

1.5 Dates

You will inevitably work with dates in your research. In R, dates are internally stored as doubles, but with a date object class. Dates can be entered or converted using R’s canned as.Date() function.

Code

date <- as.Date("12/31/99", "%m/%d/%y")
date

[1] "1999-12-31"

Code

typeof(date)

[1] "double"

Code

class(date)

[1] "Date"

Since dates are stored as doubles, you can perform some mathematical operations, such as adding days.

Code

date + 1

[1] "2000-01-01"

Code

date + 31

[1] "2000-01-31"

2 Data Structures

Now that you have a good understanding of the way R treats different types of data, the next step is understanding how to store different types of data. Atomic vectors, matrices, and arrays are used for storing homogeneous data. When you save an object of one of these data structures, all the data within will be saved as one data type. Data frames and lists are structures used for storing different types of data in one object.

2.1 Atomic Vectors

The atomic vector is the most fundamental data structure in R. Contrasting a [1 x n] matrix vector or a one-dimensional array, atomic vectors have no dimension. I.e., atomic vectors cannot be classified as row or column vectors.

Code

dim(diff_data_vector)

NULL

2.2 Matrices

A matrix is an extension of the atomic vector. Matrices are essentially atomic vectors with a specified number of rows and columns. Similar to atomic vectors, the elements of a matrix must be the same data type. We worked with matrices briefly in the last section.

Code

A <- matrix(c(10, 8,
              5, 12), ncol = 2, byrow = TRUE)
typeof(A)

[1] "double"

Code

dim(A)

[1] 2 2

2.3 Arrays

Arrays are objects that can store data in more than two dimensions. Matrices only have two dimensions: rows and columns. The easiest way to conceptualize an array is to picture a data cube. Think of a Rubik’s Cube, where a number is stored within each of the little colored boxes.

Code

multiarray <- array(c(1:27), dim = c(row_Size = 3, 
                                     column_Size = 3, 
                                     matrices = 3))
multiarray

, , 1

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

, , 2

     [,1] [,2] [,3]
[1,]   10   13   16
[2,]   11   14   17
[3,]   12   15   18

, , 3

     [,1] [,2] [,3]
[1,]   19   22   25
[2,]   20   23   26
[3,]   21   24   27

Similar to atomic vectors and matrices, you can select certain elements (or vectors/matrices) by indexing. Recall that leaving an index slot blank returns everything from that dimension.

Code

# select the first element of the first row, first column, and first matrix
multiarray[1,1,1]

[1] 1

Code

# select the first element of the second column for every matrix 
multiarray[1,2, ]

[1]  4 13 22

Code

# select all elements the third matrix
multiarray[ , ,3]

     [,1] [,2] [,3]
[1,]   19   22   25
[2,]   20   23   26
[3,]   21   24   27

2.4 Factors

Factors store categorical information. For example, you may need to use some data that is ordinal in nature. Take for example Likert scale data. You cannot quantify how much greater strongly agree is than somewhat agree, but the ordering has meaning. By converting something to a factor using the factor() function, R will store it as a vector of integers with a corresponding set of character values.

Code

likert_levels <- c("strongly disagree", "strongly agree", "disagree", "agree",
                   "somewhat disagree", "somewhat agree", "neutral")

typeof(likert_levels)

[1] "character"

Code

likert_levels <- factor(likert_levels,
                        # Specify the ordering using levels
                        levels = c("strongly agree", "agree", "somewhat agree", 
                                   "neutral", "somewhat disagree", "disagree",
                                   "strongly disagree"))

typeof(likert_levels)

[1] "integer"

Code

# use the attributes function to see the levels
attributes(likert_levels)

$levels
[1] "strongly agree"    "agree"             "somewhat agree"   
[4] "neutral"           "somewhat disagree" "disagree"         
[7] "strongly disagree"

$class
[1] "factor"

2.5 Data Frames

A data frame is comprised of equal length vectors with unique attributes for each vector, making it a rectangular 2-dimensional (rows and columns). In other words, a data frame is a matrix with column names. You can create a data frame using the data.frame() function, or by importing data with read.csv() or read.table().

Code

my_first_df <- data.frame(numbers = 1:4, 
                          letters = c("a", "b", "c", "d"), 
                          logicals = c(TRUE, FALSE, FALSE, TRUE),
                          # Woah you can make sequences with dates?!
                          dates = seq(as.Date("01/01/99", "%m/%d/%y"), 
                                      as.Date("01/01/02", "%m/%d/%y"), 
                                      "years")
                          )


my_first_df

  numbers letters logicals      dates
1       1       a     TRUE 1999-01-01
2       2       b    FALSE 2000-01-01
3       3       c    FALSE 2001-01-01
4       4       d     TRUE 2002-01-01

Like matrices, you can select certain elements of a data frame using brackets. However, since our columns now have names, you can select columns by their name.

Code

# If you select only one column, R will return an atomic vector
my_first_df[,"numbers"]

[1] 1 2 3 4

Code

# If you select multiple columns, R will return another data frame
my_first_df[,c("letters", "dates")]

  letters      dates
1       a 1999-01-01
2       b 2000-01-01
3       c 2001-01-01
4       d 2002-01-01

You can also reference columns using the $ operator.

Code

my_first_df$logicals

[1]  TRUE FALSE FALSE  TRUE

Code

my_first_df$logicals[1]

[1] TRUE

Here is an example of how you might create a new variable:

Code

# First, let's move our dates up a week
my_first_df$new_date <- my_first_df$dates + 7

# Next, let's multiply two columns together
my_first_df$new_var <- my_first_df$numbers * my_first_df$logicals

# Remember how logicals take values of 1 and 0?
my_first_df

  numbers letters logicals      dates   new_date new_var
1       1       a     TRUE 1999-01-01 1999-01-08       1
2       2       b    FALSE 2000-01-01 2000-01-08       0
3       3       c    FALSE 2001-01-01 2001-01-08       0
4       4       d     TRUE 2002-01-01 2002-01-08       4

2.6 Lists

In R, lists act as storage bins. Not only can you include different data types, you can store different data structures as well. At first, they will seem useless. As you start doing more advanced research and more advanced programming, lists will be your best friend. You can create lists using the list() function.

Code

my_first_list <- list("darecodecamp",  as.Date("01/01/99", "%m/%d/%y"), A, 
                      multiarray, likert_levels, my_first_df)

my_first_list

[[1]]
[1] "darecodecamp"

[[2]]
[1] "1999-01-01"

[[3]]
     [,1] [,2]
[1,]   10    8
[2,]    5   12

[[4]]
, , 1

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

, , 2

     [,1] [,2] [,3]
[1,]   10   13   16
[2,]   11   14   17
[3,]   12   15   18

, , 3

     [,1] [,2] [,3]
[1,]   19   22   25
[2,]   20   23   26
[3,]   21   24   27


[[5]]
[1] strongly disagree strongly agree    disagree          agree            
[5] somewhat disagree somewhat agree    neutral          
7 Levels: strongly agree agree somewhat agree neutral ... strongly disagree

[[6]]
  numbers letters logicals      dates   new_date new_var
1       1       a     TRUE 1999-01-01 1999-01-08       1
2       2       b    FALSE 2000-01-01 2000-01-08       0
3       3       c    FALSE 2001-01-01 2001-01-08       0
4       4       d     TRUE 2002-01-01 2002-01-08       4

Indexing lists uses a double bracket [[]] versus the single bracket used for everything else.

Code

my_first_list[[6]]

  numbers letters logicals      dates   new_date new_var
1       1       a     TRUE 1999-01-01 1999-01-08       1
2       2       b    FALSE 2000-01-01 2000-01-08       0
3       3       c    FALSE 2001-01-01 2001-01-08       0
4       4       d     TRUE 2002-01-01 2002-01-08       4

3 Importing Data

We can either use dataset included in different library or import external dataset that are in different formats such as .csv, .rds, .dta, .txt etc.

As an example, lets try importing a dataset from web. You can import similarly from other sources (web or locally). You just need to specify the path correctly. Note that we use / instead of \ in R to specify sub-directories.

Code

library(haven) # A library to import foreign format files
dat_mus08 <- read_dta("https://cameron.econ.ucdavis.edu/bgpe2011/mus08psidextract.dta")

Lets use the data for time period 1 only for simplicity. We can use indexing like we learned previously to filter the time period that is equal to 1.

Code

# Lets analyse the data for t == 1
dat_mus08 <- dat_mus08[dat_mus08$t == 1, ]

Now we can do a little analysis. Lets check how many work in a manufacturing industry and how many don’t.

Code

table(dat_mus08$ind)


  0   1 
362 233

We see that less people work in manufacturing industry as compared to their counterpart in year 1. It would make more sense for us to know the percentage. Let’s do that.

Code

table(dat_mus08$ind) / nrow(dat_mus08)


        0         1 
0.6084034 0.3915966

About 40% of total sample work in manufacturing industry and the remaining 60% belong to other industries. We know that ind is a dummy variable. What happens if we take its mean? It will return proportion of 1’s, i.e., proportion of those working in the manufacturing industry.

Code

mean(dat_mus08$ind)

[1] 0.3915966

Now, what’s the proportion of those not working in manufacturing industry? We simply substract the proportion of those working in manufacturing industry by 1.

Code

1 - mean(dat_mus08$ind)

[1] 0.6084034

About 40% work in manufacturing industry. Lets see if this varies across gender. We can see what % of those working in manufacturing industry are female by multiplying ind with fem.

Code

dat_mus08$female_workers <- dat_mus08$fem * dat_mus08$ind
table(dat_mus08$female_workers)


  0   1 
585  10

We find that 10 out of 595 people are female who work in the manufacturing industry in the year 1. What proportion is that out of total workforce in manufacturing industry?

Code

table(dat_mus08$female_workers)[2] / table(dat_mus08$ind)[2]

         1 
0.04291845

4.3% of those working in manufacturing industry are female. That means 95.7% of workforce in the manufacturing industry is male.

4 Exporting and saving data

You can export your data to different file types. If you want to export to .csv, you can use write.csv() function. Make sure that you save your transformed data, not the raw data you imported in the first place.

write.csv(dat_mus08,
          "D:/R_projects/darecodecamp/data/dat_mus08_v1.csv")

If you only work in R, consider saving your data in .RData or .RDS format. These R-specific file types have advantages over other formats such as compressed file size, faster import and export as well as retention of different attributes and variables.