Importing data



Environmental Data Analysis and Visualization

Reading rectangular data into R

Reading rectangular data into R

Functions for reading rectangular data into R

readr

  • read_csv() - comma delimited files
  • read_csv2() - semicolon separated files (common in countries where , is used as the decimal place)
  • read_tsv() - tab delimited files
  • read_delim() - reads in files with any delimiter
  • read_fwf() - fixed width files

readxl

  • read_excel() - read xls or xlsx files

Reading data

nobel <- read_csv(file = "data/nobel.csv")
nobel
# A tibble: 935 × 26
      id firstname   surname  year category affiliation city  country born_date 
   <dbl> <chr>       <chr>   <dbl> <chr>    <chr>       <chr> <chr>   <date>    
 1     1 Wilhelm Co… Röntgen  1901 Physics  Munich Uni… Muni… Germany 1845-03-27
 2     2 Hendrik A.  Lorentz  1902 Physics  Leiden Uni… Leid… Nether… 1853-07-18
 3     3 Pieter      Zeeman   1902 Physics  Amsterdam … Amst… Nether… 1865-05-25
 4     4 Henri       Becque…  1903 Physics  École Poly… Paris France  1852-12-15
 5     5 Pierre      Curie    1903 Physics  École muni… Paris France  1859-05-15
 6     6 Marie       Curie    1903 Physics  <NA>        <NA>  <NA>    1867-11-07
 7     6 Marie       Curie    1911 Chemist… Sorbonne U… Paris France  1867-11-07
 8     8 Lord        Raylei…  1904 Physics  Royal Inst… Lond… United… 1842-11-12
 9     9 Philipp     Lenard   1905 Physics  Kiel Unive… Kiel  Germany 1862-06-07
10    10 J.J.        Thomson  1906 Physics  University… Camb… United… 1856-12-18
# ℹ 925 more rows
# ℹ 17 more variables: died_date <date>, gender <chr>, born_city <chr>,
#   born_country <chr>, born_country_code <chr>, died_city <chr>,
#   died_country <chr>, died_country_code <chr>, overall_motivation <chr>,
#   share <dbl>, motivation <chr>, born_country_original <chr>,
#   born_city_original <chr>, died_country_original <chr>,
#   died_city_original <chr>, city_original <chr>, country_original <chr>

Writing data

  • Write a file
df <- tribble(
  ~x, ~y,
  1,  "a",
  2,  "b",
  3,  "c"
)

write_csv(df, file = "data/df.csv")
  • Read it back in to inspect
read_csv("data/df.csv")
# A tibble: 3 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 b    
3     3 c    

Take charge of your files

Be aware that these things exist:

  • Working directory
  • File folders
  • RStudio projects

Working directory

  • Where R will look, by default, for files you ask it to load
  • Where files you write to disk will go, by default
  • You can type getwd() into the console to check the working directory


getwd()
[1] "/Users/cbgurbisz/Documents/Teaching/MRNE222-sp25/slides/10-data-import"

File folders

  • “file folders” are like a physical folder in a filing cabinet - they are containers to store multiple files together
  • organized by topic or project
  • USE THEM and know where to find them on your computer

RStudio projects

  • Feature for keeping all files associated with project together
  • Our labs and exercises are each associated with a project
  • Always start a new analysis by creating a new project. Keep all data and .qmd files associated with that project in your project folder.
  • Open the project by double-clicking on your project file - R now knows that the folder containing your project file is the working directory.

Project organization

  • Save .Rproj file in main project folder
  • Make folders for data, scripts (.qmd files), figures, and docs (reports, etc.)

Projects are key to reproducibility

Without .Rproj, you’d need to read and write files using your entire directory structure. Is this reproducible? Also, who wants to type all of that out?


read_csv("/users/cbgurbisz/documents/project/data/biomass.csv")

Projects are key to reproducibility

  • .Rproj tells R that your base project folder is the working directory.
  • You (and other users who clone your entire project and associated files and folders) now only need to tell R how to access files and folders within your project folder.


read_csv("data/biomass.csv")

Try it

ae-09 nobels-csv.qmd

  • Posit Cloud > open nobels-csv.qmd
  • Read in the nobels.csv file from the data-raw/ folder.
  • Split into two (STEM and non-STEM):
    • Create a new data frame, nobel_stem, that filters for the STEM fields (Physics, Medicine, Chemistry, and Economics).
    • Create another data frame, nobel_nonstem, that filters for the remaining fields.
  • Write the two data frames to nobel-stem.csv and nobel-nonstem.csv, respectively, to data/.

Hint: Use the %in% operator when filter()ing.

Filtering with %in%

The %in% operator is used to filter rows that contain a value in a vector


penguins |> 
  distinct(island)
# A tibble: 3 × 1
  island   
  <fct>    
1 Torgersen
2 Biscoe   
3 Dream    

Filtering with %in%

Keep rows containing values of island are in the vector of islands supplied - here just “Torgersen”

penguins |> 
  filter(island %in% "Torgersen") |> 
  distinct(island)
# A tibble: 1 × 1
  island   
  <fct>    
1 Torgersen

Filtering with %in%

Keep rows where values of island are in the vector of islands supplied - “Torgersen” and “Dream”.


penguins |> 
  filter(island %in% c("Torgersen", "Dream")) |> 
  distinct(island)
# A tibble: 2 × 1
  island   
  <fct>    
1 Torgersen
2 Dream    

Filtering with %in%

This can also be done with the == and | operators but %in% becomes useful when you have a longer list of values that you want to include.


penguins |> 
  filter(island == "Torgersen" |
         island == "Dream") |> 
  distinct(island)
# A tibble: 2 × 1
  island   
  <fct>    
1 Torgersen
2 Dream    

Variable names

Data with bad names

edibnb_badnames <- read_csv("data/edibnb-badnames.csv")
names(edibnb_badnames)
 [1] "ID"                   "Price"                "neighbourhood"       
 [4] "accommodates"         "Number of bathrooms"  "Number of Bedrooms"  
 [7] "n beds"               "Review Scores Rating" "Number of reviews"   
[10] "listing_url"         

R doesn’t allow spaces in variable names

ggplot(edibnb_badnames, aes(x = Number of bathrooms, y = Price)) +
  geom_point()
Error: <text>:1:40: unexpected symbol
1: ggplot(edibnb_badnames, aes(x = Number of
                                           ^

Option 1 - You define the column names when you read in the data

edibnb_col_names <- read_csv("data/edibnb-badnames.csv",
                             col_names = c("id", "price", 
                                           "neighborhood", "accommodates",
                                           "bathroom", "bedroom", 
                                           "bed", "review_scores_rating", 
                                           "n_reviews", "url"))

names(edibnb_col_names)
 [1] "id"                   "price"                "neighborhood"        
 [4] "accommodates"         "bathroom"             "bedroom"             
 [7] "bed"                  "review_scores_rating" "n_reviews"           
[10] "url"                 

Option 2 - Format text to snake_case using the clean_names function (from the janitor package)

library(janitor)

edibnb_clean_names <- read_csv("data/edibnb-badnames.csv") |>
  clean_names()

names(edibnb_clean_names)
 [1] "id"                   "price"                "neighbourhood"       
 [4] "accommodates"         "number_of_bathrooms"  "number_of_bedrooms"  
 [7] "n_beds"               "review_scores_rating" "number_of_reviews"   
[10] "listing_url"         

Variable types

Which type is x? Why?

# A tibble: 9 × 3
  x     y              z     
  <chr> <chr>          <chr> 
1 1     a              hi    
2 <NA>  b              hello 
3 3     Not applicable 9999  
4 4     d              ola   
5 5     e              hola  
6 .     f              whatup
7 7     g              wassup
8 8     h              sup   
9 9     i              <NA>  

Option 1. Be explicit about observations that are actually NAs

read_csv("data/df-na.csv", 
         na = c("", "NA", ".", "9999", "Not applicable"))
# A tibble: 9 × 3
      x y     z     
  <dbl> <chr> <chr> 
1     1 a     hi    
2    NA b     hello 
3     3 <NA>  <NA>  
4     4 d     ola   
5     5 e     hola  
6    NA f     whatup
7     7 g     wassup
8     8 h     sup   
9     9 i     <NA>  

Option 2. Specify column types

read_csv("data/df-na.csv", col_types = list(col_double(), 
                                            col_character(), 
                                            col_character()))
# A tibble: 9 × 3
      x y              z     
  <dbl> <chr>          <chr> 
1     1 a              hi    
2    NA b              hello 
3     3 Not applicable 9999  
4     4 d              ola   
5     5 e              hola  
6    NA f              whatup
7     7 g              wassup
8     8 h              sup   
9     9 i              <NA>  

Column types

type function data type
col_character() character
col_date() date
col_datetime() POSIXct (date-time)
col_double() double (numeric)
col_factor() factor
col_guess() let readr guess (default)
col_integer() integer
col_logical() logical
col_number() numbers mixed with non-number characters
col_numeric() double or integer
col_skip() do not read
col_time() time

Case study: favorite foods

Favorite foods

Favorite foods

fav_food <- read_excel("data/favourite-food.xlsx")

fav_food
# A tibble: 5 × 6
  `Student ID` `Full Name`      favourite.food     mealPlan          AGE   SES  
         <dbl> <chr>            <chr>              <chr>             <chr> <chr>
1            1 Sunil Huffmann   Strawberry yoghurt Lunch only        4     High 
2            2 Barclay Lynn     French fries       Lunch only        5     Midd…
3            3 Jayendra Lyne    N/A                Breakfast and lu… 7     Low  
4            4 Leon Rossini     Anchovies          Lunch only        99999 Midd…
5            5 Chidiegwu Dunkel Pizza              Breakfast and lu… five  High 

Variable names

Clean variable names

fav_food <- read_excel("data/favourite-food.xlsx") |>
  clean_names() 

fav_food 
# A tibble: 5 × 6
  student_id full_name        favourite_food     meal_plan           age   ses  
       <dbl> <chr>            <chr>              <chr>               <chr> <chr>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4     High 
2          2 Barclay Lynn     French fries       Lunch only          5     Midd…
3          3 Jayendra Lyne    N/A                Breakfast and lunch 7     Low  
4          4 Leon Rossini     Anchovies          Lunch only          99999 Midd…
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five  High 

Handling NAs

Handling NAs

fav_food <- read_excel("data/favourite-food.xlsx",
                       na = c("N/A", "99999")) |> 
  clean_names()

fav_food 
# A tibble: 5 × 6
  student_id full_name        favourite_food     meal_plan           age   ses  
       <dbl> <chr>            <chr>              <chr>               <chr> <chr>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4     High 
2          2 Barclay Lynn     French fries       Lunch only          5     Midd…
3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7     Low  
4          4 Leon Rossini     Anchovies          Lunch only          <NA>  Midd…
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five  High 

Make age numeric

fav_food <- fav_food |>
  mutate( 
    age = if_else(age == "five", "5", age), 
    age = as.numeric(age) 
    ) 

fav_food
# A tibble: 5 × 6
  student_id full_name        favourite_food     meal_plan             age ses  
       <dbl> <chr>            <chr>              <chr>               <dbl> <chr>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4 High 
2          2 Barclay Lynn     French fries       Lunch only              5 Midd…
3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7 Low  
4          4 Leon Rossini     Anchovies          Lunch only             NA Midd…
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5 High 

Make ses factor

fav_food |>
  count(ses)
# A tibble: 3 × 2
  ses        n
  <chr>  <int>
1 High       2
2 Low        1
3 Middle     2

Make ses factor

fav_food <- fav_food |>
  mutate(ses = fct_relevel(ses, "Low", "Middle", "High")) #<<

fav_food |>
  count(ses)
# A tibble: 3 × 2
  ses        n
  <fct>  <int>
1 Low        1
2 Middle     2
3 High       2

Putting it altogether

fav_food <- read_excel("data/favourite-food.xlsx", na = c("N/A", "99999")) |>
  clean_names() |>
  mutate(
    age = if_else(age == "five", "5", age), 
    age = as.numeric(age),
    ses = fct_relevel(ses, "Low", "Middle", "High")
  )

fav_food
# A tibble: 5 × 6
  student_id full_name        favourite_food     meal_plan             age ses  
       <dbl> <chr>            <chr>              <chr>               <dbl> <fct>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4 High 
2          2 Barclay Lynn     French fries       Lunch only              5 Midd…
3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7 Low  
4          4 Leon Rossini     Anchovies          Lunch only             NA Midd…
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5 High 

Out and back in

Write data to .csv and read back in.

write_csv(fav_food, file = "data/fav-food-clean.csv")

fav_food_clean <- read_csv("data/fav-food-clean.csv")

Often, we do data processing in one or more “processing” .qmd files, we write the output to file, and read the clean data into a different “analysis” .qmd file.

But what happened to ses when we read it back in?

fav_food_clean |>
  count(ses)
# A tibble: 3 × 2
  ses        n
  <chr>  <int>
1 High       2
2 Low        1
3 Middle     2

read_rds() and write_rds()

  • CSVs can be unreliable for saving interim results if there is specific variable type information you want to hold on to (factors, dates, etc.).

  • An alternative is RDS files. You can read and write them with read_rds() and write_rds(), respectively.

read_rds(path)
write_rds(x, path)

Out and back in, take 2

write_rds(fav_food, file = "data/fav-food-clean.rds")

fav_food_clean <- read_rds("data/fav-food-clean.rds")

fav_food_clean |>
  count(ses)
# A tibble: 3 × 2
  ses        n
  <fct>  <int>
1 Low        1
2 Middle     2
3 High       2

Other types of data sources

Other data sources

  • googlesheets4: Google Sheets
  • haven: SPSS, Stata, and SAS files
  • DBI, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc): allows you to run SQL queries against a database and return a data frame
  • jsonline: JSON
  • xml2: xml
  • rvest: web scraping
  • httr: web APIs
  • sparklyr: data loaded into spark

Try it

ae-09 nobels-csv.qmd

  • Posit Cloud > sales-excel.qmd.
  • Load the sales.xlsx file from the data-raw/ folder, using appropriate arguments for the read_excel() function such that it looks like the output on the left.
  • Stretch goal: Manipulate the sales data such that it looks like the output on the right.
# A tibble: 9 × 2
  id      n    
  <chr>   <chr>
1 Brand 1 n    
2 1234    8    
3 8721    2    
4 1822    3    
5 Brand 2 n    
6 3333    1    
7 2156    3    
8 3987    6    
9 3216    5    
# A tibble: 7 × 3
  brand      id     n
  <chr>   <dbl> <dbl>
1 Brand 1  1234     8
2 Brand 1  8721     2
3 Brand 1  1822     3
4 Brand 2  3333     1
5 Brand 2  2156     3
6 Brand 2  3987     6
7 Brand 2  3216     5