The purpose of this module is to cover some more advanced data manipulation techniques (combining datasets and reshaping data), show how to run basic regressions, and to provide an overview of other R capabilities that you may find useful in the future.

R has several built-in datasets for educational purposes, which we will use for the following exercises.

R has several built in datasets for educational purposes, which we will use for the following exercises.

discoveries – Yearly Numbers of Important Discoveries (inventions/scientific), 1860-1959
LakeHuron – Water Level of Lake Huron, 1875-1972
lynx – Annual Canadian Lynx Trappings, 1821-1934

Code
discoveries
Time Series:
Start = 1860 
End = 1959 
Frequency = 1 
  [1]  5  3  0  2  0  3  2  3  6  1  2  1  2  1  3  3  3  5  2  4  4  0  2  3  7
 [26] 12  3 10  9  2  3  7  7  2  3  3  6  2  4  3  5  2  2  4  0  4  2  5  2  3
 [51]  3  6  5  8  3  6  6  0  5  2  2  2  6  3  4  4  2  2  4  7  5  3  3  0  2
 [76]  2  2  1  3  4  2  2  1  1  1  2  1  4  4  3  2  1  4  1  1  1  0  0  2  0
Code
LakeHuron
Time Series:
Start = 1875 
End = 1972 
Frequency = 1 
 [1] 580.38 581.86 580.97 580.80 579.79 580.39 580.42 580.82 581.40 581.32
[11] 581.44 581.68 581.17 580.53 580.01 579.91 579.14 579.16 579.55 579.67
[21] 578.44 578.24 579.10 579.09 579.35 578.82 579.32 579.01 579.00 579.80
[31] 579.83 579.72 579.89 580.01 579.37 578.69 578.19 578.67 579.55 578.92
[41] 578.09 579.37 580.13 580.14 579.51 579.24 578.66 578.86 578.05 577.79
[51] 576.75 576.75 577.82 578.64 580.58 579.48 577.38 576.90 576.94 576.24
[61] 576.84 576.85 576.90 577.79 578.18 577.51 577.23 578.42 579.61 579.05
[71] 579.26 579.22 579.38 579.10 577.95 578.12 579.75 580.85 580.41 579.96
[81] 579.61 578.76 578.18 577.21 577.13 579.10 578.25 577.91 576.89 575.96
[91] 576.80 577.68 578.38 578.52 579.74 579.31 579.89 579.96
Code
lynx
Time Series:
Start = 1821 
End = 1934 
Frequency = 1 
  [1]  269  321  585  871 1475 2821 3928 5943 4950 2577  523   98  184  279  409
 [16] 2285 2685 3409 1824  409  151   45   68  213  546 1033 2129 2536  957  361
 [31]  377  225  360  731 1638 2725 2871 2119  684  299  236  245  552 1623 3311
 [46] 6721 4254  687  255  473  358  784 1594 1676 2251 1426  756  299  201  229
 [61]  469  736 2042 2811 4431 2511  389   73   39   49   59  188  377 1292 4031
 [76] 3495  587  105  153  387  758 1307 3465 6991 6313 3794 1836  345  382  808
 [91] 1388 2713 3800 3091 2985 3790  674   81   80  108  229  399 1132 2432 3574
[106] 2935 1537  529  485  662 1000 1590 2657 3396

Since these data are stored as vectors, we first need to transform them into data frames.

Code
discovery <- data.frame(
  year = 1860:1959,
  num_disc = discoveries
  )

head(discovery)
  year num_disc
1 1860        5
2 1861        3
3 1862        0
4 1863        2
5 1864        0
6 1865        3
Code
huron_levels <- data.frame(
   year = 1875:1972,
   huron_level = LakeHuron
   )

head(huron_levels)
  year huron_level
1 1875      580.38
2 1876      581.86
3 1877      580.97
4 1878      580.80
5 1879      579.79
6 1880      580.39
Code
lynx_trappings <- data.frame(
  year = 1821:1934,
  num_trap = lynx
  )

head(lynx_trappings)
  year num_trap
1 1821      269
2 1822      321
3 1823      585
4 1824      871
5 1825     1475
6 1826     2821

1 Combining (Joining) Datasets

More often than not, you will want to combine datasets from several different sources. For example, you may want to combine population data by county from the US Census with county GDP data from the Bureau of Economic Analysis to create a final dataset with columns: county, population, GDP. To do this, the dplyr package has a series of functions called joins (based on the SQL language). Join commands are used to combine rows from two or more datasets, based on a related column between them. This resource provides nice visual representations for how joins work.

1.1 Inner Join

The most commonly used join is an inner_join(), which only keeps rows that match between two datasets (based on your matching variable). For the datasets we created above, each has the year variable that can be used to match observations. Since each of the datasets spans a different sequence of years, some observations will be dropped.

Code
# Create a new dataset with discoveries and levels of Lake Huron by year
disc_huron <- inner_join(discovery, huron_levels, by = "year")

head(disc_huron, 10)
   year num_disc huron_level
1  1875        3      580.38
2  1876        3      581.86
3  1877        5      580.97
4  1878        2      580.80
5  1879        4      579.79
6  1880        4      580.39
7  1881        0      580.42
8  1882        2      580.82
9  1883        3      581.40
10 1884        7      581.32

If you want to join more than two datasets, you can tack on extra join statements with pipes (%>%).

Code
# Join all three datasets
disc_huron_lynx <- inner_join(discovery, huron_levels, by = "year") %>% 
  inner_join(lynx_trappings, by = "year")

head(disc_huron_lynx, 10)
   year num_disc huron_level num_trap
1  1875        3      580.38     2251
2  1876        3      581.86     1426
3  1877        5      580.97      756
4  1878        2      580.80      299
5  1879        4      579.79      201
6  1880        4      580.39      229
7  1881        0      580.42      469
8  1882        2      580.82      736
9  1883        3      581.40     2042
10 1884        7      581.32     2811

You can also join datasets using multiple matching variables. An example may look like this:

Code
county_data <- inner_join(county_gdp, county_pop, by = c("year", "county"))

1.2 Left Join

A left_join() will return all rows from the left (first listed) dataset and the matched rows from the right dataset. R will automatically fill the “missing” (unmatched rows from the left) data with NA.

Code
disc_huron_left <- left_join(discovery, huron_levels, by = "year")

head(disc_huron_left, 20)
   year num_disc huron_level
1  1860        5          NA
2  1861        3          NA
3  1862        0          NA
4  1863        2          NA
5  1864        0          NA
6  1865        3          NA
7  1866        2          NA
8  1867        3          NA
9  1868        6          NA
10 1869        1          NA
11 1870        2          NA
12 1871        1          NA
13 1872        2          NA
14 1873        1          NA
15 1874        3          NA
16 1875        3      580.38
17 1876        3      581.86
18 1877        5      580.97
19 1878        2      580.80
20 1879        4      579.79

1.3 Full Join

A full_join() is similar to the left_join, however all rows from both datasets are preserved. Again, the missing data is filled with NA.

Code
disc_huron_full <- full_join(discovery, huron_levels, by = "year")

head(disc_huron_full, 10)
   year num_disc huron_level
1  1860        5          NA
2  1861        3          NA
3  1862        0          NA
4  1863        2          NA
5  1864        0          NA
6  1865        3          NA
7  1866        2          NA
8  1867        3          NA
9  1868        6          NA
10 1869        1          NA
Code
tail(disc_huron_full, 10)
    year num_disc huron_level
104 1963       NA      576.89
105 1964       NA      575.96
106 1965       NA      576.80
107 1966       NA      577.68
108 1967       NA      578.38
109 1968       NA      578.52
110 1969       NA      579.74
111 1970       NA      579.31
112 1971       NA      579.89
113 1972       NA      579.96

1.4 Anti Join

Sometimes you will drop observations with inner_join() and not know why. To figure out why certain observations are being dropped, you can use anti_join(). An anti_join() will return only the observations from the left dataset that do not match any in the right dataset. This is a good method for figuring out things like: if the matching variables are stored differently (e.g., numeric vs. character) in the two datasets or there is a typo somewhere.

Code
anti_join(huron_levels, discovery, by = "year")
   year huron_level
1  1960      579.10
2  1961      578.25
3  1962      577.91
4  1963      576.89
5  1964      575.96
6  1965      576.80
7  1966      577.68
8  1967      578.38
9  1968      578.52
10 1969      579.74
11 1970      579.31
12 1971      579.89
13 1972      579.96

2 Reshaping Datasets

Datasets are typically available in two different formats: wide and long. A wide format dataset contains values that do not repeat in the in the subject/identifier column. A long format dataset contains values that do repeat in the subject/identifier column.

Depending on your end goal, one format may be preferable to other. In general, the long format is better for storing data, running statistical models, and graphing in ggplot. Wide format data is easier to read and is generally better used for spreadsheets or tables in reports.

2.1 Pivot wider

The us_rent_income dataset has median yearly income and median monthly rate by state. It is stored in R in a long format, where estimate corresponds to the variable column and moe is a 90% margin of error.

Code
head(us_rent_income, 20)
# A tibble: 20 × 5
   GEOID NAME                 variable estimate   moe
   <chr> <chr>                <chr>       <dbl> <dbl>
 1 01    Alabama              income      24476   136
 2 01    Alabama              rent          747     3
 3 02    Alaska               income      32940   508
 4 02    Alaska               rent         1200    13
 5 04    Arizona              income      27517   148
 6 04    Arizona              rent          972     4
 7 05    Arkansas             income      23789   165
 8 05    Arkansas             rent          709     5
 9 06    California           income      29454   109
10 06    California           rent         1358     3
11 08    Colorado             income      32401   109
12 08    Colorado             rent         1125     5
13 09    Connecticut          income      35326   195
14 09    Connecticut          rent         1123     5
15 10    Delaware             income      31560   247
16 10    Delaware             rent         1076    10
17 11    District of Columbia income      43198   681
18 11    District of Columbia rent         1424    17
19 12    Florida              income      25952    70
20 12    Florida              rent         1077     3

The pivot_wider() function transforms datasets from long format to wide format.

Code
us_rent_income_wide <- us_rent_income %>%
  pivot_wider(
    names_from = variable,
    values_from = c(estimate, moe)
  )

head(us_rent_income_wide, 20)
# A tibble: 20 × 6
   GEOID NAME                 estimate_income estimate_rent moe_income moe_rent
   <chr> <chr>                          <dbl>         <dbl>      <dbl>    <dbl>
 1 01    Alabama                        24476           747        136        3
 2 02    Alaska                         32940          1200        508       13
 3 04    Arizona                        27517           972        148        4
 4 05    Arkansas                       23789           709        165        5
 5 06    California                     29454          1358        109        3
 6 08    Colorado                       32401          1125        109        5
 7 09    Connecticut                    35326          1123        195        5
 8 10    Delaware                       31560          1076        247       10
 9 11    District of Columbia           43198          1424        681       17
10 12    Florida                        25952          1077         70        3
11 13    Georgia                        27024           927        106        3
12 15    Hawaii                         32453          1507        218       18
13 16    Idaho                          25298           792        208        7
14 17    Illinois                       30684           952         83        3
15 18    Indiana                        27247           782        117        3
16 19    Iowa                           30002           740        143        4
17 20    Kansas                         29126           801        208        5
18 21    Kentucky                       24702           713        159        4
19 22    Louisiana                      25086           825        155        4
20 23    Maine                          26841           808        187        7

2.2 Pivot longer

The relig_income dataset is in wide format, and it contains information on the number of individuals that fall within a certain income range based on their religion.

Code
head(relig_income, 20)
# A tibble: 18 × 11
   religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
   <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
 1 Agnostic      27        34        60        81        76       137        122
 2 Atheist       12        27        37        52        35        70         73
 3 Buddhist      27        21        30        34        33        58         62
 4 Catholic     418       617       732       670       638      1116        949
 5 Don’t k…      15        14        15        11        10        35         21
 6 Evangel…     575       869      1064       982       881      1486        949
 7 Hindu          1         9         7         9        11        34         47
 8 Histori…     228       244       236       238       197       223        131
 9 Jehovah…      20        27        24        24        21        30         15
10 Jewish        19        19        25        25        30        95         69
11 Mainlin…     289       495       619       655       651      1107        939
12 Mormon        29        40        48        51        56       112         85
13 Muslim         6         7         9        10         9        23         16
14 Orthodox      13        17        23        32        32        47         38
15 Other C…       9         7        11        13        13        14         18
16 Other F…      20        33        40        46        49        63         46
17 Other W…       5         2         3         4         2         7          3
18 Unaffil…     217       299       374       365       341       528        407
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#   `Don't know/refused` <dbl>

The pivot_longer() function transforms datasets from wide format to long format.

Code
relig_income_long <- relig_income %>%
  pivot_longer(
    cols = -religion,
    names_to = "income",
    values_to = "count"
    )

head(relig_income_long, 20)
# A tibble: 20 × 3
   religion income             count
   <chr>    <chr>              <dbl>
 1 Agnostic <$10k                 27
 2 Agnostic $10-20k               34
 3 Agnostic $20-30k               60
 4 Agnostic $30-40k               81
 5 Agnostic $40-50k               76
 6 Agnostic $50-75k              137
 7 Agnostic $75-100k             122
 8 Agnostic $100-150k            109
 9 Agnostic >150k                 84
10 Agnostic Don't know/refused    96
11 Atheist  <$10k                 12
12 Atheist  $10-20k               27
13 Atheist  $20-30k               37
14 Atheist  $30-40k               52
15 Atheist  $40-50k               35
16 Atheist  $50-75k               70
17 Atheist  $75-100k              73
18 Atheist  $100-150k             59
19 Atheist  >150k                 74
20 Atheist  Don't know/refused    76

Sometimes pivot_wider() and pivot_longer() are tricky to work with. Be sure to read the documentation for each function.

3 Linear Regression

The lm() function is used to fit linear regression models. Regressions are typically saved as list objects that contain all the relevant information for model performance and inner workings.

Below is a simple example of how you would run a regression where you believe that the number of world scientific discoveries (dependent variable) is a function of the water levels of Lake Huron and the number of Canada lynx trappings (independent variables) in the same year.

Code
discovery_regression <- lm(
  # number of discoveries is a function of Lake Huron levels and lynx trappings
  formula = num_disc ~ huron_level + num_trap,
  # specify the dataset we created earlier
  data = disc_huron_lynx
)

# print regression results
summary(discovery_regression)

Call:
lm(formula = num_disc ~ huron_level + num_trap, data = disc_huron_lynx)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7715 -1.6706 -0.5306  1.2394  6.7945 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -2.687e+02  1.403e+02  -1.915   0.0606 .
huron_level  4.701e-01  2.421e-01   1.941   0.0572 .
num_trap     1.316e-04  1.923e-04   0.684   0.4965  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.361 on 57 degrees of freedom
Multiple R-squared:  0.06419,   Adjusted R-squared:  0.03136 
F-statistic: 1.955 on 2 and 57 DF,  p-value: 0.1509
Code
# save regression results as a data frame
regression_results <- data.frame(summary(discovery_regression)$coef)
regression_results <- round(regression_results, digits = 5)
names(regression_results) <- c("coef", "stderr", "tval", "pval")
regression_results
                  coef    stderr     tval    pval
(Intercept) -268.68672 140.33032 -1.91467 0.06056
huron_level    0.47006   0.24213  1.94130 0.05717
num_trap       0.00013   0.00019  0.68442 0.49649
Code
library(car)
help("Prestige")
#View(Prestige)
reg1<-lm(prestige ~ education, data = Prestige) 
summary(reg1)

Call:
lm(formula = prestige ~ education, data = Prestige)

Residuals:
     Min       1Q   Median       3Q      Max 
-26.0397  -6.5228   0.6611   6.7430  18.1636 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -10.732      3.677  -2.919  0.00434 ** 
education      5.361      0.332  16.148  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.103 on 100 degrees of freedom
Multiple R-squared:  0.7228,    Adjusted R-squared:   0.72 
F-statistic: 260.8 on 1 and 100 DF,  p-value: < 2.2e-16
Code
reg2 <- lm(prestige ~ education+log(income)+type, data = Prestige)

summary(reg2)

Call:
lm(formula = prestige ~ education + log(income) + type, data = Prestige)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.511  -3.746   1.011   4.356  18.438 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -81.2019    13.7431  -5.909 5.63e-08 ***
education     3.2845     0.6081   5.401 5.06e-07 ***
log(income)  10.4875     1.7167   6.109 2.31e-08 ***
typeprof      6.7509     3.6185   1.866   0.0652 .  
typewc       -1.4394     2.3780  -0.605   0.5465    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.637 on 93 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.8555,    Adjusted R-squared:  0.8493 
F-statistic: 137.6 on 4 and 93 DF,  p-value: < 2.2e-16
Code
regression_results<-data.frame(summary(reg2)$coef)
regression_results <-round(regression_results,digit=3)

names(regression_results) <- c("coef", "stderr", "tval", "pval")

regression_results
               coef stderr   tval  pval
(Intercept) -81.202 13.743 -5.909 0.000
education     3.284  0.608  5.401 0.000
log(income)  10.487  1.717  6.109 0.000
typeprof      6.751  3.618  1.866 0.065
typewc       -1.439  2.378 -0.605 0.546
Code
library(stargazer)

# Assuming reg1 and reg2 are your regression models
stargazer(reg1, reg2, 
          type = "text",
          title = "Regression Results", 
          align = TRUE, 
          dep.var.labels = c("Prestige"), 
          covariate.labels = c("Education", "Log(Income)", "Type (Blue Collar)", "Type (Prof/Manager)"),
          omit.stat = c("f", "ser"))

Regression Results
================================================
                        Dependent variable:     
                    ----------------------------
                              Prestige          
                         (1)            (2)     
------------------------------------------------
Education              5.361***      3.284***   
                       (0.332)        (0.608)   
                                                
Log(Income)                          10.487***  
                                      (1.717)   
                                                
Type (Blue Collar)                    6.751*    
                                      (3.618)   
                                                
Type (Prof/Manager)                   -1.439    
                                      (2.378)   
                                                
Constant              -10.732***    -81.202***  
                       (3.677)       (13.743)   
                                                
------------------------------------------------
Observations             102            98      
R2                      0.723          0.855    
Adjusted R2             0.720          0.849    
================================================
Note:                *p<0.1; **p<0.05; ***p<0.01

4 R Capabilities Beyond These Modules

Overall, these modules intended to provide a surface-level introduction to R and the functions that you will use most frequently as a graduate student in economics (if you choose to use R). Below are some examples of R’s other capabilities that may or may not be useful for you in the future.

4.1 Statistical Modeling

R is first and foremost a programming language for statistical computing. R has many built-in regression models, however some very smart people have created some very useful packages for implementing complex statistical modeling techniques. I have used all of the following packages at some point in my research:

The fixest package: Fast and user-friendly estimation of econometric models with multiple fixed-effects. Includes ordinary least squares (OLS), generalized linear models (GLM) and the negative binomial.

The ranger package: A fast implementation of Random Forests, particularly suited for high dimensional data. Ensembles of classification, regression, survival and probability prediction trees are supported.

The micEconAids package: Functions and tools for analyzing consumer demand with the Almost Ideal Demand System (AIDS) suggested by Deaton and Muellbauer (1980).

The cluster package: Methods for Cluster analysis. Much extended the original from Peter Rousseeuw, Anja Struyf and Mia Hubert, based on Kaufman and Rousseeuw (1990) “Finding Groups in Data”.

The MatchIt package: Selects matched samples of the original treated and control groups with similar covariate distributions – can be used to match exactly on covariates, to match on propensity scores, or perform a variety of other matching procedures.

4.2 GIS

R is also capable of handling spatial analysis and visualization. This is more important to agricultural, environmental and applied economists who often need to analyse spatial data in their research. This book is a great resource to learn spatial analysis in R. Below is an example of a visualization using GIS that I did during my masters.

library(mapsf)
library(sf)
library(tmap)
library(ggspatial)

# Loading demographic health survey pooled data set (Nepal) for the years 2011, 2016 and 2022
mdat <- readRDS("../data/clean/merged.RDS")

# Calculating proportion of married women facing domestic violence by district and year
prevalence<-  mdat %>% 
    group_by(DISTRICT, time) %>% 
    summarise(total_violence = round(mean(total_violence)*100, 2))
 
# Importing shape file with district boundaries    
sdat <- st_read("../data/shp_districts/districts.shp")    

# Merge the calculated proportion with the shape file
pdat <- merge(sdat, prevalence, by = "DISTRICT")

# Plotting
pdat %>% 
  mutate(FIRST_DIST = DISTRICT) %>% 
  ggplot() + 
  geom_sf(data = sdat) +
  geom_sf(aes(fill = total_violence)) +
  facet_wrap(~time, ncol = 2) +
  theme_minimal() +
  xlab("Latitude") +
  ylab("Longtitude") +
  scale_fill_continuous(name = "IPV Prevalence (%)") +
  theme(legend.position = c(0.8,0.2),
        legend.direction = "horizontal",
        strip.background = element_rect(fill = "grey"))
        
ggsave(filename = "fig_dv.png",
       plot = last_plot(),
       path = "../output/",
       dpi = 600)

Domestic violence prevalence

4.3 Accessing databases

RStudio can interact with local/remote databases in a manner similar to SQL using the odbc package. See this link for more information.

4.4 Accessing APIs

At some point you may be interested in using one of many publicly available data resources. In some cases, these data resources will have an application programming interface (API) that allows you to access/query data. This is a nice introduction to accessing API’s in R.

If you are lucky, someone has already created an easy to use package for accessing the API you are interested in. This is the case for NASS Quick Stats, a comprehensive database for agricultural data. The package rnassqs was built specifically for accessing NASS data in R. But you need to get your own API key from here.

4.5 Batch Downloading / Scraping data

Unfortunately, not all publicly available data will have an API to query from. Oftentimes data is hosted on a website through several different file links or dashboards, and you want all of it. R can be used to batch download data from several links or to scrape data from websites. The following is an example from Joey.

“The most comprehensive county-level datasets for crop acreage in the state of California that I have found are the annual crop reports compiled by the California County Agricultural Commissioners. They are readily available by year in .pdf, .xls, and .csv formats. However, downloading all the files individually and combining them manually is both time consuming and bad for reproducibility. The following batch of code should download all the currently available data (as of right now, 2020 is the most recent available year). This should take at most 5 minutes to run (depending on internet connection), but it usually runs in about 30 seconds. It will combine all the data into one object called county_crop_data.”

#These are the required packages
install.packages(c("RCurl", "XML"))

library(RCurl)
library(XML)
# First need to create a list of all .csv links on the webpage
url <- <- "https://www.nass.usda.gov/Statistics_by_State/California/Publications/AgComm/index.php"

html <- getURL(url)

doc <- htmlParse(html)

links <- xpathSApply(doc, "//a/@href")

free(doc)
csv_links<- links[str_detect(links, ".csv")]
# Clean up the links so the data can be easily downloaded
get_me <- paste("https://www.nass.usda.gov", csv_links, sep = "")

get_me <- unique(get_me)
# remove link for 2021, data is incomplete
get_me <- get_me[!grepl("AgComm/2021/", get_me)]

# Create a loop to import all the data one .csv at a time
crop_data <- list() # dump results here

for (i in get_me) {
          temp <- read.csv(i)
          names(temp) <- c( "year", "commodity", "cropname", "countycode", "county",
                            "acres", "yield", "production", "price_per", "unit",
                            "value")
          crop_data[[length(crop_data)+1]] <- temp
   }

# Append all the data and clean it
county_crop_data <- do.call(bind_rows, crop_data) %>%
   # remove white space and correct misspelled county names
               mutate(county = trimws(county),
                             county = ifelse(county == "San Luis Obisp", "San Luis Obispo", county)) %>%
             # for now remove all years prior to 1989, remove state totals, and remove missing data
               filter(str_detect(county, "State|Sum of") == FALSE,
                        is.na(acres) == FALSE)

head(county_crop_data)

All the data is readily available in tables in this case. You may come across unorganized data which aren’t even available in tables. In such case, python has more powerful engine than R and it can scrape much more unorganized data as well.

4.6 Version control using git

R also supports version control system like git which you can use for version control, to back up your work / work collaboratively. You can think of it as a track change feature in Microsoft word as well as working with multiple people on the same project with git on R. See this link for how you can set up a github account and connect your R studio with git.