The purpose of this module is to cover some more advanced data manipulation techniques (combining datasets and reshaping data), show how to run basic regressions, and to provide an overview of other R capabilities that you may find useful in the future.
R has several built-in datasets for educational purposes, which we will use for the following exercises.
R has several built in datasets for educational purposes, which we will use for the following exercises.
discoveries – Yearly Numbers of Important Discoveries (inventions/scientific), 1860-1959 LakeHuron – Water Level of Lake Huron, 1875-1972 lynx – Annual Canadian Lynx Trappings, 1821-1934
More often than not, you will want to combine datasets from several different sources. For example, you may want to combine population data by county from the US Census with county GDP data from the Bureau of Economic Analysis to create a final dataset with columns: county, population, GDP. To do this, the dplyr package has a series of functions called joins (based on the SQL language). Join commands are used to combine rows from two or more datasets, based on a related column between them. This resource provides nice visual representations for how joins work.
1.1 Inner Join
The most commonly used join is an inner_join(), which only keeps rows that match between two datasets (based on your matching variable). For the datasets we created above, each has the year variable that can be used to match observations. Since each of the datasets spans a different sequence of years, some observations will be dropped.
Code
# Create a new dataset with discoveries and levels of Lake Huron by yeardisc_huron <-inner_join(discovery, huron_levels, by ="year")head(disc_huron, 10)
If you want to join more than two datasets, you can tack on extra join statements with pipes (%>%).
Code
# Join all three datasetsdisc_huron_lynx <-inner_join(discovery, huron_levels, by ="year") %>%inner_join(lynx_trappings, by ="year")head(disc_huron_lynx, 10)
You can also join datasets using multiple matching variables. An example may look like this:
Code
county_data <-inner_join(county_gdp, county_pop, by =c("year", "county"))
1.2 Left Join
A left_join() will return all rows from the left (first listed) dataset and the matched rows from the right dataset. R will automatically fill the “missing” (unmatched rows from the left) data with NA.
Code
disc_huron_left <-left_join(discovery, huron_levels, by ="year")head(disc_huron_left, 20)
year num_disc huron_level
1 1860 5 NA
2 1861 3 NA
3 1862 0 NA
4 1863 2 NA
5 1864 0 NA
6 1865 3 NA
7 1866 2 NA
8 1867 3 NA
9 1868 6 NA
10 1869 1 NA
11 1870 2 NA
12 1871 1 NA
13 1872 2 NA
14 1873 1 NA
15 1874 3 NA
16 1875 3 580.38
17 1876 3 581.86
18 1877 5 580.97
19 1878 2 580.80
20 1879 4 579.79
1.3 Full Join
A full_join() is similar to the left_join, however all rows from both datasets are preserved. Again, the missing data is filled with NA.
Code
disc_huron_full <-full_join(discovery, huron_levels, by ="year")head(disc_huron_full, 10)
year num_disc huron_level
1 1860 5 NA
2 1861 3 NA
3 1862 0 NA
4 1863 2 NA
5 1864 0 NA
6 1865 3 NA
7 1866 2 NA
8 1867 3 NA
9 1868 6 NA
10 1869 1 NA
Code
tail(disc_huron_full, 10)
year num_disc huron_level
104 1963 NA 576.89
105 1964 NA 575.96
106 1965 NA 576.80
107 1966 NA 577.68
108 1967 NA 578.38
109 1968 NA 578.52
110 1969 NA 579.74
111 1970 NA 579.31
112 1971 NA 579.89
113 1972 NA 579.96
1.4 Anti Join
Sometimes you will drop observations with inner_join() and not know why. To figure out why certain observations are being dropped, you can use anti_join(). An anti_join() will return only the observations from the left dataset that do not match any in the right dataset. This is a good method for figuring out things like: if the matching variables are stored differently (e.g., numeric vs. character) in the two datasets or there is a typo somewhere.
Datasets are typically available in two different formats: wide and long. A wide format dataset contains values that do not repeat in the in the subject/identifier column. A long format dataset contains values that do repeat in the subject/identifier column.
Depending on your end goal, one format may be preferable to other. In general, the long format is better for storing data, running statistical models, and graphing in ggplot. Wide format data is easier to read and is generally better used for spreadsheets or tables in reports.
2.1 Pivot wider
The us_rent_income dataset has median yearly income and median monthly rate by state. It is stored in R in a long format, where estimate corresponds to the variable column and moe is a 90% margin of error.
Code
head(us_rent_income, 20)
# A tibble: 20 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01 Alabama income 24476 136
2 01 Alabama rent 747 3
3 02 Alaska income 32940 508
4 02 Alaska rent 1200 13
5 04 Arizona income 27517 148
6 04 Arizona rent 972 4
7 05 Arkansas income 23789 165
8 05 Arkansas rent 709 5
9 06 California income 29454 109
10 06 California rent 1358 3
11 08 Colorado income 32401 109
12 08 Colorado rent 1125 5
13 09 Connecticut income 35326 195
14 09 Connecticut rent 1123 5
15 10 Delaware income 31560 247
16 10 Delaware rent 1076 10
17 11 District of Columbia income 43198 681
18 11 District of Columbia rent 1424 17
19 12 Florida income 25952 70
20 12 Florida rent 1077 3
The pivot_wider() function transforms datasets from long format to wide format.
The relig_income dataset is in wide format, and it contains information on the number of individuals that fall within a certain income range based on their religion.
Sometimes pivot_wider() and pivot_longer() are tricky to work with. Be sure to read the documentation for each function.
3 Linear Regression
The lm() function is used to fit linear regression models. Regressions are typically saved as list objects that contain all the relevant information for model performance and inner workings.
Below is a simple example of how you would run a regression where you believe that the number of world scientific discoveries (dependent variable) is a function of the water levels of Lake Huron and the number of Canada lynx trappings (independent variables) in the same year.
Code
discovery_regression <-lm(# number of discoveries is a function of Lake Huron levels and lynx trappingsformula = num_disc ~ huron_level + num_trap,# specify the dataset we created earlierdata = disc_huron_lynx)# print regression resultssummary(discovery_regression)
Call:
lm(formula = num_disc ~ huron_level + num_trap, data = disc_huron_lynx)
Residuals:
Min 1Q Median 3Q Max
-4.7715 -1.6706 -0.5306 1.2394 6.7945
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.687e+02 1.403e+02 -1.915 0.0606 .
huron_level 4.701e-01 2.421e-01 1.941 0.0572 .
num_trap 1.316e-04 1.923e-04 0.684 0.4965
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.361 on 57 degrees of freedom
Multiple R-squared: 0.06419, Adjusted R-squared: 0.03136
F-statistic: 1.955 on 2 and 57 DF, p-value: 0.1509
Code
# save regression results as a data frameregression_results <-data.frame(summary(discovery_regression)$coef)regression_results <-round(regression_results, digits =5)names(regression_results) <-c("coef", "stderr", "tval", "pval")regression_results
Overall, these modules intended to provide a surface-level introduction to R and the functions that you will use most frequently as a graduate student in economics (if you choose to use R). Below are some examples of R’s other capabilities that may or may not be useful for you in the future.
4.1 Statistical Modeling
R is first and foremost a programming language for statistical computing. R has many built-in regression models, however some very smart people have created some very useful packages for implementing complex statistical modeling techniques. I have used all of the following packages at some point in my research:
The fixest package: Fast and user-friendly estimation of econometric models with multiple fixed-effects. Includes ordinary least squares (OLS), generalized linear models (GLM) and the negative binomial.
The ranger package: A fast implementation of Random Forests, particularly suited for high dimensional data. Ensembles of classification, regression, survival and probability prediction trees are supported.
The micEconAids package: Functions and tools for analyzing consumer demand with the Almost Ideal Demand System (AIDS) suggested by Deaton and Muellbauer (1980).
The cluster package: Methods for Cluster analysis. Much extended the original from Peter Rousseeuw, Anja Struyf and Mia Hubert, based on Kaufman and Rousseeuw (1990) “Finding Groups in Data”.
The MatchIt package: Selects matched samples of the original treated and control groups with similar covariate distributions – can be used to match exactly on covariates, to match on propensity scores, or perform a variety of other matching procedures.
4.2 GIS
R is also capable of handling spatial analysis and visualization. This is more important to agricultural, environmental and applied economists who often need to analyse spatial data in their research. This book is a great resource to learn spatial analysis in R. Below is an example of a visualization using GIS that I did during my masters.
library(mapsf)library(sf)library(tmap)library(ggspatial)# Loading demographic health survey pooled data set (Nepal) for the years 2011, 2016 and 2022mdat <-readRDS("../data/clean/merged.RDS")# Calculating proportion of married women facing domestic violence by district and yearprevalence<- mdat %>%group_by(DISTRICT, time) %>%summarise(total_violence =round(mean(total_violence)*100, 2))# Importing shape file with district boundaries sdat <-st_read("../data/shp_districts/districts.shp") # Merge the calculated proportion with the shape filepdat <-merge(sdat, prevalence, by ="DISTRICT")# Plottingpdat %>%mutate(FIRST_DIST = DISTRICT) %>%ggplot() +geom_sf(data = sdat) +geom_sf(aes(fill = total_violence)) +facet_wrap(~time, ncol =2) +theme_minimal() +xlab("Latitude") +ylab("Longtitude") +scale_fill_continuous(name ="IPV Prevalence (%)") +theme(legend.position =c(0.8,0.2),legend.direction ="horizontal",strip.background =element_rect(fill ="grey"))ggsave(filename ="fig_dv.png",plot =last_plot(),path ="../output/",dpi =600)
Domestic violence prevalence
4.3 Accessing databases
RStudio can interact with local/remote databases in a manner similar to SQL using the odbc package. See this link for more information.
4.4 Accessing APIs
At some point you may be interested in using one of many publicly available data resources. In some cases, these data resources will have an application programming interface (API) that allows you to access/query data. This is a nice introduction to accessing API’s in R.
If you are lucky, someone has already created an easy to use package for accessing the API you are interested in. This is the case for NASS Quick Stats, a comprehensive database for agricultural data. The package rnassqs was built specifically for accessing NASS data in R. But you need to get your own API key from here.
4.5 Batch Downloading / Scraping data
Unfortunately, not all publicly available data will have an API to query from. Oftentimes data is hosted on a website through several different file links or dashboards, and you want all of it. R can be used to batch download data from several links or to scrape data from websites. The following is an example from Joey.
“The most comprehensive county-level datasets for crop acreage in the state of California that I have found are the annual crop reports compiled by the California County Agricultural Commissioners. They are readily available by year in .pdf, .xls, and .csv formats. However, downloading all the files individually and combining them manually is both time consuming and bad for reproducibility. The following batch of code should download all the currently available data (as of right now, 2020 is the most recent available year). This should take at most 5 minutes to run (depending on internet connection), but it usually runs in about 30 seconds. It will combine all the data into one object called county_crop_data.”
#These are the required packagesinstall.packages(c("RCurl", "XML"))library(RCurl)library(XML)# First need to create a list of all .csv links on the webpageurl <-<-"https://www.nass.usda.gov/Statistics_by_State/California/Publications/AgComm/index.php"html <-getURL(url)doc <-htmlParse(html)links <-xpathSApply(doc, "//a/@href")free(doc)csv_links<- links[str_detect(links, ".csv")]# Clean up the links so the data can be easily downloadedget_me <-paste("https://www.nass.usda.gov", csv_links, sep ="")get_me <-unique(get_me)# remove link for 2021, data is incompleteget_me <- get_me[!grepl("AgComm/2021/", get_me)]# Create a loop to import all the data one .csv at a timecrop_data <-list() # dump results herefor (i in get_me) { temp <-read.csv(i)names(temp) <-c( "year", "commodity", "cropname", "countycode", "county","acres", "yield", "production", "price_per", "unit","value") crop_data[[length(crop_data)+1]] <- temp }# Append all the data and clean itcounty_crop_data <-do.call(bind_rows, crop_data) %>%# remove white space and correct misspelled county namesmutate(county =trimws(county),county =ifelse(county =="San Luis Obisp", "San Luis Obispo", county)) %>%# for now remove all years prior to 1989, remove state totals, and remove missing datafilter(str_detect(county, "State|Sum of") ==FALSE,is.na(acres) ==FALSE)head(county_crop_data)
All the data is readily available in tables in this case. You may come across unorganized data which aren’t even available in tables. In such case, python has more powerful engine than R and it can scrape much more unorganized data as well.
4.6 Version control using git
R also supports version control system like git which you can use for version control, to back up your work / work collaboratively. You can think of it as a track change feature in Microsoft word as well as working with multiple people on the same project with git on R. See this link for how you can set up a github account and connect your R studio with git.
Source Code
---title: "Other Capabilities"format: html: toc: true toc-depth: 2 number-sections: true code-fold: show code-tools: true smooth-scroll: true---The purpose of this module is to cover some more advanced data manipulation techniques (combining datasets and reshaping data), show how to run basic regressions, and to provide an overview of other R capabilities that you may find useful in the future.R has several built-in datasets for educational purposes, which we will use for the following exercises.```{r}#| echo: false#| message: false#| warning: falselibrary(tidyverse)library(stringr)library(scales)```R has several built in datasets for educational purposes, which we will use for the following exercises.`discoveries` -- Yearly Numbers of Important Discoveries (inventions/scientific), 1860-1959\`LakeHuron` -- Water Level of Lake Huron, 1875-1972\`lynx` -- Annual Canadian Lynx Trappings, 1821-1934```{r}discoveriesLakeHuronlynx```Since these data are stored as vectors, we first need to transform them into data frames.```{r}discovery <-data.frame(year =1860:1959,num_disc = discoveries )head(discovery)huron_levels <-data.frame(year =1875:1972,huron_level = LakeHuron )head(huron_levels)lynx_trappings <-data.frame(year =1821:1934,num_trap = lynx )head(lynx_trappings)```# Combining (Joining) DatasetsMore often than not, you will want to combine datasets from several different sources. For example, you may want to combine population data by county from the US Census with county GDP data from the Bureau of Economic Analysis to create a final dataset with columns: county, population, GDP. To do this, the `dplyr` package has a series of functions called **joins** (based on the SQL language). Join commands are used to combine rows from two or more datasets, based on a related column between them. [This resource](https://www.datasciencemadesimple.com/join-in-r-merge-in-r/) provides nice visual representations for how joins work.## Inner JoinThe most commonly used join is an `inner_join()`, which only keeps rows that match between two datasets (based on your matching variable). For the datasets we created above, each has the `year` variable that can be used to match observations. Since each of the datasets spans a different sequence of years, some observations will be dropped.```{r}# Create a new dataset with discoveries and levels of Lake Huron by yeardisc_huron <-inner_join(discovery, huron_levels, by ="year")head(disc_huron, 10)```If you want to join more than two datasets, you can tack on extra join statements with pipes (`%>%`).```{r}# Join all three datasetsdisc_huron_lynx <-inner_join(discovery, huron_levels, by ="year") %>%inner_join(lynx_trappings, by ="year")head(disc_huron_lynx, 10)```You can also join datasets using multiple matching variables. An example may look like this:```{r, eval = FALSE}county_data <-inner_join(county_gdp, county_pop, by =c("year", "county"))```## Left JoinA `left_join()` will return all rows from the *left* (first listed) dataset and the matched rows from the right dataset. R will automatically fill the "missing" (unmatched rows from the left) data with `NA`.```{r}disc_huron_left <-left_join(discovery, huron_levels, by ="year")head(disc_huron_left, 20)```## Full JoinA `full_join()` is similar to the `left_join`, however all rows from both datasets are preserved. Again, the missing data is filled with `NA`.```{r}disc_huron_full <-full_join(discovery, huron_levels, by ="year")head(disc_huron_full, 10)tail(disc_huron_full, 10)```## Anti JoinSometimes you will drop observations with `inner_join()` and not know why. To figure out why certain observations are being dropped, you can use `anti_join()`. An `anti_join()` will return only the observations from the *left* dataset that do not match any in the *right* dataset. This is a good method for figuring out things like: if the matching variables are stored differently (e.g., numeric vs. character) in the two datasets or there is a typo somewhere.```{r}anti_join(huron_levels, discovery, by ="year")```# Reshaping DatasetsDatasets are typically available in two different formats: wide and long. A wide format dataset contains values that *do not* repeat in the in the subject/identifier column. A long format dataset contains values that *do* repeat in the subject/identifier column.Depending on your end goal, one format may be preferable to other. In general, the long format is better for storing data, running statistical models, and graphing in `ggplot`. Wide format data is easier to read and is generally better used for spreadsheets or tables in reports.## Pivot widerThe `us_rent_income` dataset has median yearly income and median monthly rate by state. It is stored in R in a *long* format, where `estimate` corresponds to the `variable` column and `moe` is a 90% margin of error.```{r}head(us_rent_income, 20)```The `pivot_wider()` function transforms datasets from long format to wide format.```{r}us_rent_income_wide <- us_rent_income %>%pivot_wider(names_from = variable,values_from =c(estimate, moe) )head(us_rent_income_wide, 20)```## Pivot longerThe `relig_income` dataset is in *wide* format, and it contains information on the number of individuals that fall within a certain income range based on their religion.```{r}head(relig_income, 20)```The `pivot_longer()` function transforms datasets from wide format to long format.```{r}relig_income_long <- relig_income %>%pivot_longer(cols =-religion,names_to ="income",values_to ="count" )head(relig_income_long, 20)```Sometimes `pivot_wider()` and `pivot_longer()` are tricky to work with. Be sure to read the documentation for each function.# Linear RegressionThe `lm()` function is used to fit linear regression models. Regressions are typically saved as `list` objects that contain all the relevant information for model performance and inner workings.Below is a simple example of how you would run a regression where you believe that the number of world scientific discoveries (dependent variable) is a function of the water levels of Lake Huron and the number of Canada lynx trappings (independent variables) in the same year.```{r}discovery_regression <-lm(# number of discoveries is a function of Lake Huron levels and lynx trappingsformula = num_disc ~ huron_level + num_trap,# specify the dataset we created earlierdata = disc_huron_lynx)# print regression resultssummary(discovery_regression)# save regression results as a data frameregression_results <-data.frame(summary(discovery_regression)$coef)regression_results <-round(regression_results, digits =5)names(regression_results) <-c("coef", "stderr", "tval", "pval")regression_results``````{r}#| echo: true#| warning: falselibrary(car)help("Prestige")#View(Prestige)reg1<-lm(prestige ~ education, data = Prestige) summary(reg1)reg2 <-lm(prestige ~ education+log(income)+type, data = Prestige)summary(reg2)regression_results<-data.frame(summary(reg2)$coef)regression_results <-round(regression_results,digit=3)names(regression_results) <-c("coef", "stderr", "tval", "pval")regression_resultslibrary(stargazer)# Assuming reg1 and reg2 are your regression modelsstargazer(reg1, reg2, type ="text",title ="Regression Results", align =TRUE, dep.var.labels =c("Prestige"), covariate.labels =c("Education", "Log(Income)", "Type (Blue Collar)", "Type (Prof/Manager)"),omit.stat =c("f", "ser"))```# R Capabilities Beyond These ModulesOverall, these modules intended to provide a surface-level introduction to R and the functions that you will use most frequently as a graduate student in economics (if you choose to use R). Below are some examples of R's other capabilities that may or may not be useful for you in the future.## Statistical ModelingR is first and foremost a programming language for statistical computing. R has many built-in regression models, however some very smart people have created some very useful packages for implementing complex statistical modeling techniques. I have used all of the following packages at some point in my research:The [fixest](https://cran.r-project.org/web/packages/fixest/fixest.pdf) package: Fast and user-friendly estimation of econometric models with multiple fixed-effects. Includes ordinary least squares (OLS), generalized linear models (GLM) and the negative binomial.The [ranger](https://cran.r-project.org/web/packages/ranger/ranger.pdf) package: A fast implementation of Random Forests, particularly suited for high dimensional data. Ensembles of classification, regression, survival and probability prediction trees are supported.The [micEconAids](https://cran.r-project.org/web/packages/micEconAids/micEconAids.pdf) package: Functions and tools for analyzing consumer demand with the Almost Ideal Demand System (AIDS) suggested by Deaton and Muellbauer (1980).The [cluster](https://cran.r-project.org/web/packages/cluster/cluster.pdf) package: Methods for Cluster analysis. Much extended the original from Peter Rousseeuw, Anja Struyf and Mia Hubert, based on Kaufman and Rousseeuw (1990) "Finding Groups in Data".The [MatchIt](https://cran.r-project.org/web/packages/MatchIt/MatchIt.pdf) package: Selects matched samples of the original treated and control groups with similar covariate distributions -- can be used to match exactly on covariates, to match on propensity scores, or perform a variety of other matching procedures.## GISR is also capable of handling spatial analysis and visualization. This is more important to agricultural, environmental and applied economists who often need to analyse spatial data in their research. This [book](https://www.spatialanalysisonline.com/An%20Introduction%20to%20Spatial%20Data%20Analysis%20in%20R.pdf) is a great resource to learn spatial analysis in R. Below is an example of a visualization using GIS that I did during my masters. ```rlibrary(mapsf)library(sf)library(tmap)library(ggspatial)# Loading demographic health survey pooled data set (Nepal) for the years 2011, 2016 and 2022mdat <-readRDS("../data/clean/merged.RDS")# Calculating proportion of married women facing domestic violence by district and yearprevalence<- mdat %>%group_by(DISTRICT, time) %>%summarise(total_violence =round(mean(total_violence)*100, 2))# Importing shape file with district boundaries sdat <-st_read("../data/shp_districts/districts.shp") # Merge the calculated proportion with the shape filepdat <-merge(sdat, prevalence, by ="DISTRICT")# Plottingpdat %>%mutate(FIRST_DIST = DISTRICT) %>%ggplot() +geom_sf(data = sdat) +geom_sf(aes(fill = total_violence)) +facet_wrap(~time, ncol =2) +theme_minimal() +xlab("Latitude") +ylab("Longtitude") +scale_fill_continuous(name ="IPV Prevalence (%)") +theme(legend.position =c(0.8,0.2),legend.direction ="horizontal",strip.background =element_rect(fill ="grey"))ggsave(filename ="fig_dv.png",plot =last_plot(),path ="../output/",dpi =600)```## Accessing databasesRStudio can interact with local/remote databases in a manner similar to SQL using the [odbc](https://cran.r-project.org/web/packages/odbc/odbc.pdf) package. See this link for more information. ## Accessing APIsAt some point you may be interested in using one of many publicly available data resources. In some cases, these data resources will have an application programming interface (API) that allows you to access/query data. This is a nice [introduction](https://www.dataquest.io/blog/r-api-tutorial/) to accessing API’s in R.If you are lucky, someone has already created an easy to use package for accessing the API you are interested in. This is the case for NASS Quick Stats, a comprehensive database for agricultural data. The package [rnassqs](https://cran.r-project.org/web/packages/rnassqs/rnassqs.pdf) was built specifically for accessing NASS data in R. But you need to get your own API key from [here](https://quickstats.nass.usda.gov/api).## Batch Downloading / Scraping dataUnfortunately, not all publicly available data will have an API to query from. Oftentimes data is hosted on a website through several different file links or dashboards, and you want all of it. R can be used to batch download data from several links or to scrape data from websites. The following is an example from Joey."The most comprehensive county-level datasets for crop acreage in the state of California that I have found are the annual crop reports compiled by the [California County Agricultural Commissioners](https://www.nass.usda.gov/Statistics_by_State/California/Publications/AgComm/index.php). They are readily available by year in `.pdf`, `.xls`, and `.csv` formats. However, downloading all the files individually and combining them manually is both time consuming and bad for reproducibility. The following batch of code should download all the currently available data (as of right now, 2020 is the most recent available year). This should take at most 5 minutes to run (depending on internet connection), but it usually runs in about 30 seconds. It will combine all the data into one object called `county_crop_data`."```r#These are the required packagesinstall.packages(c("RCurl", "XML"))library(RCurl)library(XML)# First need to create a list of all .csv links on the webpageurl <-<-"https://www.nass.usda.gov/Statistics_by_State/California/Publications/AgComm/index.php"html <-getURL(url)doc <-htmlParse(html)links <-xpathSApply(doc, "//a/@href")free(doc)csv_links<- links[str_detect(links, ".csv")]# Clean up the links so the data can be easily downloadedget_me <-paste("https://www.nass.usda.gov", csv_links, sep ="")get_me <-unique(get_me)# remove link for 2021, data is incompleteget_me <- get_me[!grepl("AgComm/2021/", get_me)]# Create a loop to import all the data one .csv at a timecrop_data <-list() # dump results herefor (i in get_me) { temp <-read.csv(i)names(temp) <-c( "year", "commodity", "cropname", "countycode", "county","acres", "yield", "production", "price_per", "unit","value") crop_data[[length(crop_data)+1]] <- temp }# Append all the data and clean itcounty_crop_data <-do.call(bind_rows, crop_data) %>%# remove white space and correct misspelled county namesmutate(county =trimws(county),county =ifelse(county =="San Luis Obisp", "San Luis Obispo", county)) %>%# for now remove all years prior to 1989, remove state totals, and remove missing datafilter(str_detect(county, "State|Sum of") ==FALSE,is.na(acres) ==FALSE)head(county_crop_data)```All the data is readily available in tables in this case. You may come across unorganized data which aren't even available in tables. In such case, python has more powerful engine than R and it can scrape much more unorganized data as well. ## Version control using gitR also supports version control system like git which you can use for version control, to back up your work / work collaboratively. You can think of it as a track change feature in Microsoft word as well as working with multiple people on the same project with git on R. See this [link](https://happygitwithr.com/big-picture) for how you can set up a github account and connect your R studio with git.