Importing data

Environmental Data Analysis and Visualization

https://mrne222-sp25.github.io/website/

Reading rectangular data into R

Functions for reading rectangular data into R

readr

read_csv() - comma delimited files
read_csv2() - semicolon separated files (common in countries where , is used as the decimal place)
read_tsv() - tab delimited files
read_delim() - reads in files with any delimiter
read_fwf() - fixed width files
…

readxl

read_excel() - read xls or xlsx files
…

Reading data

nobel <- read_csv(file = "data/nobel.csv")
nobel

# A tibble: 935 × 26
      id firstname   surname  year category affiliation city  country born_date 
   <dbl> <chr>       <chr>   <dbl> <chr>    <chr>       <chr> <chr>   <date>    
 1     1 Wilhelm Co… Röntgen  1901 Physics  Munich Uni… Muni… Germany 1845-03-27
 2     2 Hendrik A.  Lorentz  1902 Physics  Leiden Uni… Leid… Nether… 1853-07-18
 3     3 Pieter      Zeeman   1902 Physics  Amsterdam … Amst… Nether… 1865-05-25
 4     4 Henri       Becque…  1903 Physics  École Poly… Paris France  1852-12-15
 5     5 Pierre      Curie    1903 Physics  École muni… Paris France  1859-05-15
 6     6 Marie       Curie    1903 Physics  <NA>        <NA>  <NA>    1867-11-07
 7     6 Marie       Curie    1911 Chemist… Sorbonne U… Paris France  1867-11-07
 8     8 Lord        Raylei…  1904 Physics  Royal Inst… Lond… United… 1842-11-12
 9     9 Philipp     Lenard   1905 Physics  Kiel Unive… Kiel  Germany 1862-06-07
10    10 J.J.        Thomson  1906 Physics  University… Camb… United… 1856-12-18
# ℹ 925 more rows
# ℹ 17 more variables: died_date <date>, gender <chr>, born_city <chr>,
#   born_country <chr>, born_country_code <chr>, died_city <chr>,
#   died_country <chr>, died_country_code <chr>, overall_motivation <chr>,
#   share <dbl>, motivation <chr>, born_country_original <chr>,
#   born_city_original <chr>, died_country_original <chr>,
#   died_city_original <chr>, city_original <chr>, country_original <chr>

Writing data

Write a file

df <- tribble(
  ~x, ~y,
  1,  "a",
  2,  "b",
  3,  "c"
)

write_csv(df, file = "data/df.csv")

Read it back in to inspect

read_csv("data/df.csv")

# A tibble: 3 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 b    
3     3 c

Take charge of your files

Be aware that these things exist:

Working directory
File folders
RStudio projects

Working directory

Where R will look, by default, for files you ask it to load
Where files you write to disk will go, by default
You can type getwd() into the console to check the working directory

getwd()

[1] "/Users/cbgurbisz/Documents/Teaching/MRNE222-sp25/slides/10-data-import"

File folders

“file folders” are like a physical folder in a filing cabinet - they are containers to store multiple files together
organized by topic or project
USE THEM and know where to find them on your computer

RStudio projects

Feature for keeping all files associated with project together
Our labs and exercises are each associated with a project
Always start a new analysis by creating a new project. Keep all data and .qmd files associated with that project in your project folder.
Open the project by double-clicking on your project file - R now knows that the folder containing your project file is the working directory.

Project organization

Save .Rproj file in main project folder
Make folders for data, scripts (.qmd files), figures, and docs (reports, etc.)

Projects are key to reproducibility

Without .Rproj, you’d need to read and write files using your entire directory structure. Is this reproducible? Also, who wants to type all of that out?

read_csv("/users/cbgurbisz/documents/project/data/biomass.csv")

Projects are key to reproducibility

.Rproj tells R that your base project folder is the working directory.
You (and other users who clone your entire project and associated files and folders) now only need to tell R how to access files and folders within your project folder.

read_csv("data/biomass.csv")

Try it

ae-09 nobels-csv.qmd

Posit Cloud > open nobels-csv.qmd
Read in the nobels.csv file from the data-raw/ folder.
Split into two (STEM and non-STEM):
- Create a new data frame, nobel_stem, that filters for the STEM fields (Physics, Medicine, Chemistry, and Economics).
- Create another data frame, nobel_nonstem, that filters for the remaining fields.
Write the two data frames to nobel-stem.csv and nobel-nonstem.csv, respectively, to data/.

Hint: Use the %in% operator when filter()ing.

Filtering with `%in%`

The %in% operator is used to filter rows that contain a value in a vector

penguins |> 
  distinct(island)

# A tibble: 3 × 1
  island   
  <fct>    
1 Torgersen
2 Biscoe   
3 Dream

Filtering with `%in%`

Keep rows containing values of island are in the vector of islands supplied - here just “Torgersen”

penguins |> 
  filter(island %in% "Torgersen") |> 
  distinct(island)

# A tibble: 1 × 1
  island   
  <fct>    
1 Torgersen

Filtering with `%in%`

Keep rows where values of island are in the vector of islands supplied - “Torgersen” and “Dream”.

penguins |> 
  filter(island %in% c("Torgersen", "Dream")) |> 
  distinct(island)

# A tibble: 2 × 1
  island   
  <fct>    
1 Torgersen
2 Dream

Filtering with `%in%`

This can also be done with the == and | operators but %in% becomes useful when you have a longer list of values that you want to include.

penguins |> 
  filter(island == "Torgersen" |
         island == "Dream") |> 
  distinct(island)

# A tibble: 2 × 1
  island   
  <fct>    
1 Torgersen
2 Dream

Variable names

Data with bad names

edibnb_badnames <- read_csv("data/edibnb-badnames.csv")
names(edibnb_badnames)

 [1] "ID"                   "Price"                "neighbourhood"       
 [4] "accommodates"         "Number of bathrooms"  "Number of Bedrooms"  
 [7] "n beds"               "Review Scores Rating" "Number of reviews"   
[10] "listing_url"

R doesn’t allow spaces in variable names

ggplot(edibnb_badnames, aes(x = Number of bathrooms, y = Price)) +
  geom_point()

Error: <text>:1:40: unexpected symbol
1: ggplot(edibnb_badnames, aes(x = Number of
                                           ^

Option 1 - You define the column names when you read in the data

edibnb_col_names <- read_csv("data/edibnb-badnames.csv",
                             col_names = c("id", "price", 
                                           "neighborhood", "accommodates",
                                           "bathroom", "bedroom", 
                                           "bed", "review_scores_rating", 
                                           "n_reviews", "url"))

names(edibnb_col_names)

 [1] "id"                   "price"                "neighborhood"        
 [4] "accommodates"         "bathroom"             "bedroom"             
 [7] "bed"                  "review_scores_rating" "n_reviews"           
[10] "url"

Option 2 - Format text to snake_case using the `clean_names` function (from the `janitor` package)

library(janitor)

edibnb_clean_names <- read_csv("data/edibnb-badnames.csv") |>
  clean_names()

names(edibnb_clean_names)

 [1] "id"                   "price"                "neighbourhood"       
 [4] "accommodates"         "number_of_bathrooms"  "number_of_bedrooms"  
 [7] "n_beds"               "review_scores_rating" "number_of_reviews"   
[10] "listing_url"

Variable types

Which type is `x`? Why?

# A tibble: 9 × 3
  x     y              z     
  <chr> <chr>          <chr> 
1 1     a              hi    
2 <NA>  b              hello 
3 3     Not applicable 9999  
4 4     d              ola   
5 5     e              hola  
6 .     f              whatup
7 7     g              wassup
8 8     h              sup   
9 9     i              <NA>

Option 1. Be explicit about observations that are actually NAs

read_csv("data/df-na.csv", 
         na = c("", "NA", ".", "9999", "Not applicable"))

# A tibble: 9 × 3
      x y     z     
  <dbl> <chr> <chr> 
1     1 a     hi    
2    NA b     hello 
3     3 <NA>  <NA>  
4     4 d     ola   
5     5 e     hola  
6    NA f     whatup
7     7 g     wassup
8     8 h     sup   
9     9 i     <NA>

Option 2. Specify column types

read_csv("data/df-na.csv", col_types = list(col_double(), 
                                            col_character(), 
                                            col_character()))

# A tibble: 9 × 3
      x y              z     
  <dbl> <chr>          <chr> 
1     1 a              hi    
2    NA b              hello 
3     3 Not applicable 9999  
4     4 d              ola   
5     5 e              hola  
6    NA f              whatup
7     7 g              wassup
8     8 h              sup   
9     9 i              <NA>

Column types

type function	data type
`col_character()`	character
`col_date()`	date
`col_datetime()`	POSIXct (date-time)
`col_double()`	double (numeric)
`col_factor()`	factor
`col_guess()`	let readr guess (default)
`col_integer()`	integer
`col_logical()`	logical
`col_number()`	numbers mixed with non-number characters
`col_numeric()`	double or integer
`col_skip()`	do not read
`col_time()`	time

Case study: favorite foods

Favorite foods

fav_food <- read_excel("data/favourite-food.xlsx")

fav_food

# A tibble: 5 × 6
  `Student ID` `Full Name`      favourite.food     mealPlan          AGE   SES  
         <dbl> <chr>            <chr>              <chr>             <chr> <chr>
1            1 Sunil Huffmann   Strawberry yoghurt Lunch only        4     High 
2            2 Barclay Lynn     French fries       Lunch only        5     Midd…
3            3 Jayendra Lyne    N/A                Breakfast and lu… 7     Low  
4            4 Leon Rossini     Anchovies          Lunch only        99999 Midd…
5            5 Chidiegwu Dunkel Pizza              Breakfast and lu… five  High

Variable names

Clean variable names

fav_food <- read_excel("data/favourite-food.xlsx") |>
  clean_names() 

fav_food

# A tibble: 5 × 6
  student_id full_name        favourite_food     meal_plan           age   ses  
       <dbl> <chr>            <chr>              <chr>               <chr> <chr>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4     High 
2          2 Barclay Lynn     French fries       Lunch only          5     Midd…
3          3 Jayendra Lyne    N/A                Breakfast and lunch 7     Low  
4          4 Leon Rossini     Anchovies          Lunch only          99999 Midd…
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five  High

Handling NAs

fav_food <- read_excel("data/favourite-food.xlsx",
                       na = c("N/A", "99999")) |> 
  clean_names()

fav_food

# A tibble: 5 × 6
  student_id full_name        favourite_food     meal_plan           age   ses  
       <dbl> <chr>            <chr>              <chr>               <chr> <chr>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4     High 
2          2 Barclay Lynn     French fries       Lunch only          5     Midd…
3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7     Low  
4          4 Leon Rossini     Anchovies          Lunch only          <NA>  Midd…
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five  High

Make `age` numeric

fav_food <- fav_food |>
  mutate( 
    age = if_else(age == "five", "5", age), 
    age = as.numeric(age) 
    ) 

fav_food

# A tibble: 5 × 6
  student_id full_name        favourite_food     meal_plan             age ses  
       <dbl> <chr>            <chr>              <chr>               <dbl> <chr>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4 High 
2          2 Barclay Lynn     French fries       Lunch only              5 Midd…
3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7 Low  
4          4 Leon Rossini     Anchovies          Lunch only             NA Midd…
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5 High

Make `ses` factor

fav_food |>
  count(ses)

# A tibble: 3 × 2
  ses        n
  <chr>  <int>
1 High       2
2 Low        1
3 Middle     2

Make `ses` factor

fav_food <- fav_food |>
  mutate(ses = fct_relevel(ses, "Low", "Middle", "High")) #<<

fav_food |>
  count(ses)

# A tibble: 3 × 2
  ses        n
  <fct>  <int>
1 Low        1
2 Middle     2
3 High       2

Putting it altogether

fav_food <- read_excel("data/favourite-food.xlsx", na = c("N/A", "99999")) |>
  clean_names() |>
  mutate(
    age = if_else(age == "five", "5", age), 
    age = as.numeric(age),
    ses = fct_relevel(ses, "Low", "Middle", "High")
  )

fav_food

# A tibble: 5 × 6
  student_id full_name        favourite_food     meal_plan             age ses  
       <dbl> <chr>            <chr>              <chr>               <dbl> <fct>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4 High 
2          2 Barclay Lynn     French fries       Lunch only              5 Midd…
3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7 Low  
4          4 Leon Rossini     Anchovies          Lunch only             NA Midd…
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5 High

Out and back in

Write data to .csv and read back in.

write_csv(fav_food, file = "data/fav-food-clean.csv")

fav_food_clean <- read_csv("data/fav-food-clean.csv")

Often, we do data processing in one or more “processing” .qmd files, we write the output to file, and read the clean data into a different “analysis” .qmd file.

But what happened to `ses` when we read it back in?

fav_food_clean |>
  count(ses)

# A tibble: 3 × 2
  ses        n
  <chr>  <int>
1 High       2
2 Low        1
3 Middle     2

`read_rds()` and `write_rds()`

CSVs can be unreliable for saving interim results if there is specific variable type information you want to hold on to (factors, dates, etc.).
An alternative is RDS files. You can read and write them with read_rds() and write_rds(), respectively.

read_rds(path)
write_rds(x, path)

Out and back in, take 2

write_rds(fav_food, file = "data/fav-food-clean.rds")

fav_food_clean <- read_rds("data/fav-food-clean.rds")

fav_food_clean |>
  count(ses)

# A tibble: 3 × 2
  ses        n
  <fct>  <int>
1 Low        1
2 Middle     2
3 High       2

Other types of data sources

Other data sources

googlesheets4: Google Sheets
haven: SPSS, Stata, and SAS files
DBI, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc): allows you to run SQL queries against a database and return a data frame
jsonline: JSON
xml2: xml
rvest: web scraping
httr: web APIs
sparklyr: data loaded into spark

Try it

ae-09 nobels-csv.qmd

Posit Cloud > sales-excel.qmd.
Load the sales.xlsx file from the data-raw/ folder, using appropriate arguments for the read_excel() function such that it looks like the output on the left.
Stretch goal: Manipulate the sales data such that it looks like the output on the right.

# A tibble: 9 × 2
  id      n    
  <chr>   <chr>
1 Brand 1 n    
2 1234    8    
3 8721    2    
4 1822    3    
5 Brand 2 n    
6 3333    1    
7 2156    3    
8 3987    6    
9 3216    5

# A tibble: 7 × 3
  brand      id     n
  <chr>   <dbl> <dbl>
1 Brand 1  1234     8
2 Brand 1  8721     2
3 Brand 1  1822     3
4 Brand 2  3333     1
5 Brand 2  2156     3
6 Brand 2  3987     6
7 Brand 2  3216     5

Importing data

Reading rectangular data into R

Reading rectangular data into R

Functions for reading rectangular data into R

readr

readxl

Reading data

Writing data

Take charge of your files

Working directory

File folders

RStudio projects

Project organization

Projects are key to reproducibility

Projects are key to reproducibility

Try it

ae-09 nobels-csv.qmd

Filtering with %in%

Filtering with %in%

Filtering with %in%

Filtering with %in%

Variable names

Data with bad names

R doesn’t allow spaces in variable names

Option 1 - You define the column names when you read in the data

Option 2 - Format text to snake_case using the clean_names function (from the janitor package)

Variable types

Which type is x? Why?

Option 1. Be explicit about observations that are actually NAs

Option 2. Specify column types

Column types

Case study: favorite foods

Favorite foods

Favorite foods

Variable names

Clean variable names

Handling NAs

Handling NAs

Make age numeric

Make ses factor

Make ses factor

Putting it altogether

Out and back in

But what happened to ses when we read it back in?

read_rds() and write_rds()

Out and back in, take 2

Other types of data sources

Other data sources

Try it

ae-09 nobels-csv.qmd

Filtering with `%in%`

Filtering with `%in%`

Filtering with `%in%`

Filtering with `%in%`

Option 2 - Format text to snake_case using the `clean_names` function (from the `janitor` package)

Which type is `x`? Why?

Make `age` numeric

Make `ses` factor

Make `ses` factor

But what happened to `ses` when we read it back in?

`read_rds()` and `write_rds()`