Iteration

Environmental Data Analysis and Visualization

Performing the same action for multiple “things”

facet_wrap() and facet_grid() draw plots for each subset of data
group_by() plus summarize() computes summary stats for each subset of data

R also has functions that enable us to apply any function across subsets of data or across multiple datasets.

Modifying multiple columns using `dplyr::across()`

We want to perform the same computation across multiple columns

Have: a tibble called df

# A tibble: 10 × 4
        a      b       c      d
    <dbl>  <dbl>   <dbl>  <dbl>
 1  0.243  0.847  0.551   0.196
 2 -2.26  -1.03  -0.467  -0.766
 3  0.883  0.832  0.644   0.521
 4  0.654 -1.31   0.0108  0.647
 5  0.357  0.827  1.14   -2.15 
 6 -0.788 -0.906  0.620  -0.128
 7 -1.95  -1.17  -0.551   0.667
 8 -0.311 -0.648  0.711   1.50 
 9  0.812  0.622  0.752  -1.98 
10 -0.501  1.35  -0.563  -1.52

Want: the median of each column

# A tibble: 1 × 4
      a      b      c     d
  <dbl>  <dbl>  <dbl> <dbl>
1 0.680 -0.166 0.0633 0.106

We could calculate the median of each column individually

df |> summarize(
  a = median(a),
  b = median(b),
  c = median(c),
  d = median(d))

# A tibble: 1 × 4
      a      b      c     d
  <dbl>  <dbl>  <dbl> <dbl>
1 0.680 -0.166 0.0633 0.106

But we could also use the `across()` function to do it

df |> 
  summarize(
    across(a:d, median))

# A tibble: 1 × 4
      a      b      c     d
  <dbl>  <dbl>  <dbl> <dbl>
1 0.680 -0.166 0.0633 0.106

As with functions, iterating across columns in this way makes our code more efficient and less prone to errors.

The `across()` function usage

across(.cols, .fns, .names)

.cols: columns to which you want to apply the calculation
.fns: function(s) you want to apply to each column
.names: how you want to name the output columns

Selecting columns (`.cols`)

Only columns a and b: use : to select columns located sequentially in the dataframe

df |> 
  summarize(
    across(a:b, median)
  )

# A tibble: 1 × 2
      a      b
  <dbl>  <dbl>
1 0.680 -0.166

Selecting columns (`.cols`)

Only columns a and c: use c() to concatenate the column names if they are not sequentially located in the dataframe.

df |> 
  summarize(
    across(c(a, c), median)
  )

# A tibble: 1 × 2
      a      c
  <dbl>  <dbl>
1 0.680 0.0633

Selecting columns (`.cols`)

All columns: use everything()

df |> 
  summarize(
    across(everything(), median)
  )

# A tibble: 1 × 4
      a      b      c     d
  <dbl>  <dbl>  <dbl> <dbl>
1 0.680 -0.166 0.0633 0.106

Use `where()` to select colunns by data type

For more dataframes with multiple data types, use where() to apply across() functions to certain types of data.

# A tibble: 53,940 × 7
   carat cut       color clarity depth table price
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  0.23 Ideal     E     SI2      61.5    55   326
 2  0.21 Premium   E     SI1      59.8    61   326
 3  0.23 Good      E     VS1      56.9    65   327
 4  0.29 Premium   I     VS2      62.4    58   334
 5  0.31 Good      J     SI2      63.3    58   335
 6  0.24 Very Good J     VVS2     62.8    57   336
 7  0.24 Very Good I     VVS1     62.3    57   336
 8  0.26 Very Good H     SI1      61.9    55   337
 9  0.22 Fair      E     VS2      65.1    61   337
10  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows

diamonds |> 
  summarize(
    across(where(is.numeric), mean)
  )

# A tibble: 1 × 7
  carat depth table price     x     y     z
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.798  61.7  57.5 3933.  5.73  5.73  3.54

`where()` functions

where(is.numeric) selects all numeric columns.
where(is.character) selects all string columns.
where(is.Date) selects all date columns.
where(is.POSIXct) selects all date-time columns.
where(is.logical) selects all logical columns.

What about when the function requires additional arguments?

# A tibble: 5 × 4
      a      b      c      d
  <dbl>  <dbl>  <dbl>  <dbl>
1  1.91 -1.42  -0.615 -0.221
2 -1.12 -1.28  -1.89   0.383
3 NA     0.227 -0.945 -0.564
4 -1.11  1.30  NA      1.56 
5  1.83 NA     NA      1.02

df_miss |> 
  summarize(
    across(a:d, median))

# A tibble: 1 × 4
      a     b     c     d
  <dbl> <dbl> <dbl> <dbl>
1    NA    NA    NA 0.383

The output contains missing values. We need to pass the argument na.rm=TRUE.

What about when the function requires additional arguments?

df_miss |> 
  summarize(
    across(a:d, median, na.rm = TRUE))

Warning: There was 1 warning in `summarize()`.
ℹ In argument: `across(a:d, median, na.rm = TRUE)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))

# A tibble: 1 × 4
      a      b      c     d
  <dbl>  <dbl>  <dbl> <dbl>
1 0.359 -0.529 -0.945 0.383

What about when the function requires additional arguments?

You can call a new function “on the fly” - the function is not saved as an object to the global environment.

df_miss |> 
  summarize(
    across(a:d, function(x) median(x, na.rm = TRUE)))

# A tibble: 1 × 4
      a      b      c     d
  <dbl>  <dbl>  <dbl> <dbl>
1 0.359 -0.529 -0.945 0.383

What about when the function requires additional arguments?

You can shorten your code by using an “anonymous function” - replace function with \

df_miss |> 
  summarize(
    across(a:d, \(x) median(x, na.rm = TRUE)))

# A tibble: 1 × 4
      a      b      c     d
  <dbl>  <dbl>  <dbl> <dbl>
1 0.359 -0.529 -0.945 0.383

`across()` with multiple functions

Use list() to calculate the median and number of missing values in df_miss

df_miss |> 
  summarize(
    across(a:d, list(
      median = \(x) median(x, na.rm = TRUE),
      n_missing = \(x) sum(is.na(x))
    )))

# A tibble: 1 × 8
  a_median a_n_missing b_median b_n_missing c_median c_n_missing d_median
     <dbl>       <int>    <dbl>       <int>    <dbl>       <int>    <dbl>
1    0.359           1   -0.529           1   -0.945           2    0.383
# ℹ 1 more variable: d_n_missing <int>

`across()` with multiple functions

Use list() to calculate the median and number of missing values in df_miss

df_miss |> 
  summarize(
    across(a:d, list(
      median = \(x) median(x, na.rm = TRUE),
      n_missing = \(x) sum(is.na(x))
    )))

# A tibble: 1 × 8
  a_median a_n_missing b_median b_n_missing c_median c_n_missing d_median
     <dbl>       <int>    <dbl>       <int>    <dbl>       <int>    <dbl>
1    0.359           1   -0.529           1   -0.945           2    0.383
# ℹ 1 more variable: d_n_missing <int>

Note: By default, R “glues” the column name and function name together because we applied multiple functions to each column.

Take control of the resulting column names with `.names` argument

Maybe we want the function name to come first followed by the column name, and we want to separate by _:

df_miss |> 
  summarize(
    across(a:d, list(
      median = \(x) median(x, na.rm = TRUE),
      n_missing = \(x) sum(is.na(x))),
      .names = "{.fn}-{.col}"
    ))

# A tibble: 1 × 8
  `median-a` `n_missing-a` `median-b` `n_missing-b` `median-c` `n_missing-c`
       <dbl>         <int>      <dbl>         <int>      <dbl>         <int>
1      0.359             1     -0.529             1     -0.945             2
# ℹ 2 more variables: `median-d` <dbl>, `n_missing-d` <int>

- Put it all in "" - {.fn} takes the function name - {.col} takes the column name - _ indicates we want to separate the function and column name by _ - customize however you want

`across()` works with `mutate()` too

df

# A tibble: 10 × 4
        a       b       c      d
    <dbl>   <dbl>   <dbl>  <dbl>
 1  1.40  -2.78    0.0305  0.881
 2 -0.350 -0.145   0.992   0.472
 3 -0.885  1.63    0.0272  0.460
 4  3.12  -0.665  -0.630  -0.462
 5  0.778  0.380  -0.683  -0.266
 6  0.583 -0.462  -0.672  -0.248
 7  1.34   0.102   0.636   2.36 
 8 -1.70  -1.18    0.0962 -0.557
 9 -2.91  -0.186   1.69   -0.748
10  1.24   0.0190  0.420   1.19

df |> 
  mutate(
    across(a:d, \(x) x + 2))

# A tibble: 10 × 4
        a      b     c     d
    <dbl>  <dbl> <dbl> <dbl>
 1  3.40  -0.782  2.03  2.88
 2  1.65   1.85   2.99  2.47
 3  1.12   3.63   2.03  2.46
 4  5.12   1.33   1.37  1.54
 5  2.78   2.38   1.32  1.73
 6  2.58   1.54   1.33  1.75
 7  3.34   2.10   2.64  4.36
 8  0.301  0.819  2.10  1.44
 9 -0.909  1.81   3.69  1.25
10  3.24   2.02   2.42  3.19

Wait, I wanted to keep the original columns!

`across()` works with `mutate()` too

Specify the .names argument to keep original columns and add new columns.

df

# A tibble: 10 × 4
        a       b       c      d
    <dbl>   <dbl>   <dbl>  <dbl>
 1  1.40  -2.78    0.0305  0.881
 2 -0.350 -0.145   0.992   0.472
 3 -0.885  1.63    0.0272  0.460
 4  3.12  -0.665  -0.630  -0.462
 5  0.778  0.380  -0.683  -0.266
 6  0.583 -0.462  -0.672  -0.248
 7  1.34   0.102   0.636   2.36 
 8 -1.70  -1.18    0.0962 -0.557
 9 -2.91  -0.186   1.69   -0.748
10  1.24   0.0190  0.420   1.19

df |> 
  mutate(
    across(a:d, \(x) x + 2, .names = "{.col}_add_two"))

# A tibble: 10 × 8
        a       b       c      d a_add_two b_add_two c_add_two d_add_two
    <dbl>   <dbl>   <dbl>  <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1  1.40  -2.78    0.0305  0.881     3.40     -0.782      2.03      2.88
 2 -0.350 -0.145   0.992   0.472     1.65      1.85       2.99      2.47
 3 -0.885  1.63    0.0272  0.460     1.12      3.63       2.03      2.46
 4  3.12  -0.665  -0.630  -0.462     5.12      1.33       1.37      1.54
 5  0.778  0.380  -0.683  -0.266     2.78      2.38       1.32      1.73
 6  0.583 -0.462  -0.672  -0.248     2.58      1.54       1.33      1.75
 7  1.34   0.102   0.636   2.36      3.34      2.10       2.64      4.36
 8 -1.70  -1.18    0.0962 -0.557     0.301     0.819      2.10      1.44
 9 -2.91  -0.186   1.69   -0.748    -0.909     1.81       3.69      1.25
10  1.24   0.0190  0.420   1.19      3.24      2.02       2.42      3.19

Intro to `purrr:map()`: reading and writing multiple files

Reading in multiple files the slow, error-prone way

First read in each file and assign the output to a different object:

ph_a <- read_csv("data/sitenames/2024-05-21_YOR_BR_PH_A.csv")
ph_b <- read_csv("data/sitenames/2024-05-21_YOR_BR_PH_B.csv")
ph_c <- read_csv("data/sitenames/2024-05-21_YOR_BR_PH_C.csv")
ph_d <- read_csv("data/sitenames/2024-05-21_YOR_BR_PH_D.csv")
ph_e <- read_csv("data/sitenames/2024-05-21_YOR_BR_PH_E.csv")

ph_a

# A tibble: 61 × 3
   datetime         ph site 
   <chr>         <dbl> <chr>
 1 5/21/24 9:50   8.05 a    
 2 5/21/24 9:55   8.06 a    
 3 5/21/24 10:00  8.01 a    
 4 5/21/24 10:05  8.05 a    
 5 5/21/24 10:10  8.03 a    
 6 5/21/24 10:15  8.05 a    
 7 5/21/24 10:20  8.05 a    
 8 5/21/24 10:25  8.03 a    
 9 5/21/24 10:30  8.03 a    
10 5/21/24 10:35  8.03 a    
# ℹ 51 more rows

Reading in multiple files the slow, error-prone way

Then combine the dataframe objects into a tibble using bind_rows()

data <- bind_rows(ph_a, ph_b, ph_c, ph_d, ph_e)

data

# A tibble: 305 × 3
   datetime         ph site 
   <chr>         <dbl> <chr>
 1 5/21/24 9:50   8.05 a    
 2 5/21/24 9:55   8.06 a    
 3 5/21/24 10:00  8.01 a    
 4 5/21/24 10:05  8.05 a    
 5 5/21/24 10:10  8.03 a    
 6 5/21/24 10:15  8.05 a    
 7 5/21/24 10:20  8.05 a    
 8 5/21/24 10:25  8.03 a    
 9 5/21/24 10:30  8.03 a    
10 5/21/24 10:35  8.03 a    
# ℹ 295 more rows

Problems with this approach:

verbose: so much code
error-prone: easy to mistype the parts we are changing
clutters the environment: we have to save each file as an object, but we only need the final combined object.

Automate it

The basic steps:

Use list.files() to generate a list of filenames in your data folder
Use purrr::map() to read in each file in the file name list
Use list_rbind() to combine the data frames into a single data frame

List files in a directory

paths <- list.files("data/sitenames", full.names = TRUE)

paths

[1] "data/sitenames/2024-05-21_YOR_BR_PH_A.csv"
[2] "data/sitenames/2024-05-21_YOR_BR_PH_B.csv"
[3] "data/sitenames/2024-05-21_YOR_BR_PH_C.csv"
[4] "data/sitenames/2024-05-21_YOR_BR_PH_D.csv"
[5] "data/sitenames/2024-05-21_YOR_BR_PH_E.csv"

Because we use projects, R knows the base folder is the folder containing our project files
If your data files are in a subfolder, specify the folder structure
full.names=TRUE indicates you want to include the full directory structure

List files in a directory

class(paths)

[1] "character"

The output is a character vector of file names (with full directory structure included).

Use `map()` to read in each element of the `paths` vector

files <- map(paths, read_csv)

files[[1]]

# A tibble: 61 × 3
   datetime         ph site 
   <chr>         <dbl> <chr>
 1 5/21/24 9:50   8.05 a    
 2 5/21/24 9:55   8.06 a    
 3 5/21/24 10:00  8.01 a    
 4 5/21/24 10:05  8.05 a    
 5 5/21/24 10:10  8.03 a    
 6 5/21/24 10:15  8.05 a    
 7 5/21/24 10:20  8.05 a    
 8 5/21/24 10:25  8.03 a    
 9 5/21/24 10:30  8.03 a    
10 5/21/24 10:35  8.03 a    
# ℹ 51 more rows

Use `map()` to read in each element of the `paths` vector

The ouput is a list

class(files)

[1] "list"

Recall, a list is a vector of data elements. Each element of the list can be anything - a single observation, a vector, a data frame, a plot, etc.
files is a list of data frames.
We can extract an element of a list using [[]]. We place the element number inside the brackets files[[3]].

Use `list_rbind()` to collapse list elements into a single data frame

files |> 
  list_rbind()

# A tibble: 305 × 3
   datetime         ph site 
   <chr>         <dbl> <chr>
 1 5/21/24 9:50   8.05 a    
 2 5/21/24 9:55   8.06 a    
 3 5/21/24 10:00  8.01 a    
 4 5/21/24 10:05  8.05 a    
 5 5/21/24 10:10  8.03 a    
 6 5/21/24 10:15  8.05 a    
 7 5/21/24 10:20  8.05 a    
 8 5/21/24 10:25  8.03 a    
 9 5/21/24 10:30  8.03 a    
10 5/21/24 10:35  8.03 a    
# ℹ 295 more rows

Put it all together

ph <- list.files("data/sitenames", full.names = TRUE) |> 
  map(read_csv) |> 
  list_rbind()

ph

# A tibble: 305 × 3
   datetime         ph site 
   <chr>         <dbl> <chr>
 1 5/21/24 9:50   8.05 a    
 2 5/21/24 9:55   8.06 a    
 3 5/21/24 10:00  8.01 a    
 4 5/21/24 10:05  8.05 a    
 5 5/21/24 10:10  8.03 a    
 6 5/21/24 10:15  8.05 a    
 7 5/21/24 10:20  8.05 a    
 8 5/21/24 10:25  8.03 a    
 9 5/21/24 10:30  8.03 a    
10 5/21/24 10:35  8.03 a    
# ℹ 295 more rows

What if you need to extract data from the file name

e.g., we have another dataset where there is no site name column, but the site name is contained in the file names.

list.files("data/no_sitenames", full.names = TRUE) |> 
  map(read_csv) |> 
  list_rbind()

# A tibble: 305 × 2
   datetime         ph
   <chr>         <dbl>
 1 5/21/24 9:50   8.05
 2 5/21/24 9:55   8.06
 3 5/21/24 10:00  8.01
 4 5/21/24 10:05  8.05
 5 5/21/24 10:10  8.03
 6 5/21/24 10:15  8.05
 7 5/21/24 10:20  8.05
 8 5/21/24 10:25  8.03
 9 5/21/24 10:30  8.03
10 5/21/24 10:35  8.03
# ℹ 295 more rows

No site name - there’s no way to identify which site the data came from.

Use `set_names()` to extract info from the file names

list.files("data/no_sitenames", full.names = TRUE) |> 
  set_names(basename)

                    2024-05-21_YOR_BR_PH_A.csv 
"data/no_sitenames/2024-05-21_YOR_BR_PH_A.csv" 
                    2024-05-21_YOR_BR_PH_B.csv 
"data/no_sitenames/2024-05-21_YOR_BR_PH_B.csv" 
                    2024-05-21_YOR_BR_PH_C.csv 
"data/no_sitenames/2024-05-21_YOR_BR_PH_C.csv" 
                    2024-05-21_YOR_BR_PH_D.csv 
"data/no_sitenames/2024-05-21_YOR_BR_PH_D.csv" 
                    2024-05-21_YOR_BR_PH_E.csv 
"data/no_sitenames/2024-05-21_YOR_BR_PH_E.csv"

The full file paths are vector elements. The “basenames” above are the names of each element.

Use `set_names()` to extract info from the file names

Now use map to read in the data files

list.files("data/no_sitenames", full.names = TRUE) |> 
  set_names(basename) |> 
  map(read_csv)

$`2024-05-21_YOR_BR_PH_A.csv`
# A tibble: 61 × 2
   datetime         ph
   <chr>         <dbl>
 1 5/21/24 9:50   8.05
 2 5/21/24 9:55   8.06
 3 5/21/24 10:00  8.01
 4 5/21/24 10:05  8.05
 5 5/21/24 10:10  8.03
 6 5/21/24 10:15  8.05
 7 5/21/24 10:20  8.05
 8 5/21/24 10:25  8.03
 9 5/21/24 10:30  8.03
10 5/21/24 10:35  8.03
# ℹ 51 more rows

$`2024-05-21_YOR_BR_PH_B.csv`
# A tibble: 61 × 2
   datetime         ph
   <chr>         <dbl>
 1 5/21/24 9:50   8.06
 2 5/21/24 9:55   8.07
 3 5/21/24 10:00  8.03
 4 5/21/24 10:05  8.07
 5 5/21/24 10:10  8.06
 6 5/21/24 10:15  8.07
 7 5/21/24 10:20  8.07
 8 5/21/24 10:25  8.04
 9 5/21/24 10:30  8.05
10 5/21/24 10:35  8.05
# ℹ 51 more rows

$`2024-05-21_YOR_BR_PH_C.csv`
# A tibble: 61 × 2
   datetime         ph
   <chr>         <dbl>
 1 5/22/24 9:50   7.03
 2 5/22/24 9:55   7.02
 3 5/22/24 10:00  7.02
 4 5/22/24 10:05  7.02
 5 5/22/24 10:10  7.01
 6 5/22/24 10:15  7.01
 7 5/22/24 10:20  7.01
 8 5/22/24 10:25  7.01
 9 5/22/24 10:30  7.01
10 5/22/24 10:35  7   
# ℹ 51 more rows

$`2024-05-21_YOR_BR_PH_D.csv`
# A tibble: 61 × 2
   datetime         ph
   <chr>         <dbl>
 1 5/21/24 9:50   7.98
 2 5/21/24 9:55   8.02
 3 5/21/24 10:00  7.97
 4 5/21/24 10:05  8.03
 5 5/21/24 10:10  7.99
 6 5/21/24 10:15  8.02
 7 5/21/24 10:20  8.04
 8 5/21/24 10:25  7.99
 9 5/21/24 10:30  8   
10 5/21/24 10:35  8.01
# ℹ 51 more rows

$`2024-05-21_YOR_BR_PH_E.csv`
# A tibble: 61 × 2
   datetime         ph
   <chr>         <dbl>
 1 5/21/24 9:50   8.03
 2 5/21/24 9:55   8.06
 3 5/21/24 10:00  8.01
 4 5/21/24 10:05  8.06
 5 5/21/24 10:10  8.03
 6 5/21/24 10:15  8.05
 7 5/21/24 10:20  8.06
 8 5/21/24 10:25  8.03
 9 5/21/24 10:30  8.01
10 5/21/24 10:35  8.01
# ℹ 51 more rows

The output shows the first element of the files list.
The “name” of the first element is the name of the first .csv file that was read in.

Use `set_names()` and `mutate()` to extract info from the file names

Use the names_to argument in list_rbind() to create a site column that is populated with the corresponding list element names

list.files("data/no_sitenames", full.names = TRUE) |> 
  set_names(basename) |> 
  map(read_csv) |> 
  list_rbind(names_to = "site")

# A tibble: 305 × 3
   site                       datetime         ph
   <chr>                      <chr>         <dbl>
 1 2024-05-21_YOR_BR_PH_A.csv 5/21/24 9:50   8.05
 2 2024-05-21_YOR_BR_PH_A.csv 5/21/24 9:55   8.06
 3 2024-05-21_YOR_BR_PH_A.csv 5/21/24 10:00  8.01
 4 2024-05-21_YOR_BR_PH_A.csv 5/21/24 10:05  8.05
 5 2024-05-21_YOR_BR_PH_A.csv 5/21/24 10:10  8.03
 6 2024-05-21_YOR_BR_PH_A.csv 5/21/24 10:15  8.05
 7 2024-05-21_YOR_BR_PH_A.csv 5/21/24 10:20  8.05
 8 2024-05-21_YOR_BR_PH_A.csv 5/21/24 10:25  8.03
 9 2024-05-21_YOR_BR_PH_A.csv 5/21/24 10:30  8.03
10 2024-05-21_YOR_BR_PH_A.csv 5/21/24 10:35  8.03
# ℹ 295 more rows

site is too long but we can fix that

Use `set_names()` and `mutate()` to extract info from the file names

Use the substr() to extract the 22nd character from the site names

list.files("data/no_sitenames", full.names = TRUE) |> 
  set_names(basename) |> 
  map(read_csv) |> 
  list_rbind(names_to = "site") |> 
  mutate(site = substr(site, 22, 22))

# A tibble: 305 × 3
   site  datetime         ph
   <chr> <chr>         <dbl>
 1 A     5/21/24 9:50   8.05
 2 A     5/21/24 9:55   8.06
 3 A     5/21/24 10:00  8.01
 4 A     5/21/24 10:05  8.05
 5 A     5/21/24 10:10  8.03
 6 A     5/21/24 10:15  8.05
 7 A     5/21/24 10:20  8.05
 8 A     5/21/24 10:25  8.03
 9 A     5/21/24 10:30  8.03
10 A     5/21/24 10:35  8.03
# ℹ 295 more rows

Use `separate_wider_delim()` to keep and separate all elements of the file names

list.files("data/no_sitenames", full.names = TRUE) |> 
  set_names(basename) |> 
  map(read_csv) |> 
  list_rbind(names_to = "file") |> 
  separate_wider_delim(file, delim = "_", 
                       names = c("date", "river", "salinity", "param", "chamber"))

# A tibble: 305 × 7
   date       river salinity param chamber datetime         ph
   <chr>      <chr> <chr>    <chr> <chr>   <chr>         <dbl>
 1 2024-05-21 YOR   BR       PH    A.csv   5/21/24 9:50   8.05
 2 2024-05-21 YOR   BR       PH    A.csv   5/21/24 9:55   8.06
 3 2024-05-21 YOR   BR       PH    A.csv   5/21/24 10:00  8.01
 4 2024-05-21 YOR   BR       PH    A.csv   5/21/24 10:05  8.05
 5 2024-05-21 YOR   BR       PH    A.csv   5/21/24 10:10  8.03
 6 2024-05-21 YOR   BR       PH    A.csv   5/21/24 10:15  8.05
 7 2024-05-21 YOR   BR       PH    A.csv   5/21/24 10:20  8.05
 8 2024-05-21 YOR   BR       PH    A.csv   5/21/24 10:25  8.03
 9 2024-05-21 YOR   BR       PH    A.csv   5/21/24 10:30  8.03
10 2024-05-21 YOR   BR       PH    A.csv   5/21/24 10:35  8.03
# ℹ 295 more rows

Close - we need to get rid of the “.csv”

Use `separate_wider_delim()` to keep and separate all elements of the file names

list.files("data/no_sitenames", full.names = TRUE) |> 
  set_names(basename) |> 
  map(read_csv) |> 
  list_rbind(names_to = "file") |> 
  separate_wider_delim(file, delim = "_", 
                       names = c("date", "river", "salinity", "param", "site")) |> 
  mutate(site = substr(site, 1,1))

# A tibble: 305 × 7
   date       river salinity param site  datetime         ph
   <chr>      <chr> <chr>    <chr> <chr> <chr>         <dbl>
 1 2024-05-21 YOR   BR       PH    A     5/21/24 9:50   8.05
 2 2024-05-21 YOR   BR       PH    A     5/21/24 9:55   8.06
 3 2024-05-21 YOR   BR       PH    A     5/21/24 10:00  8.01
 4 2024-05-21 YOR   BR       PH    A     5/21/24 10:05  8.05
 5 2024-05-21 YOR   BR       PH    A     5/21/24 10:10  8.03
 6 2024-05-21 YOR   BR       PH    A     5/21/24 10:15  8.05
 7 2024-05-21 YOR   BR       PH    A     5/21/24 10:20  8.05
 8 2024-05-21 YOR   BR       PH    A     5/21/24 10:25  8.03
 9 2024-05-21 YOR   BR       PH    A     5/21/24 10:30  8.03
10 2024-05-21 YOR   BR       PH    A     5/21/24 10:35  8.03
# ℹ 295 more rows

Assign the output to an object

ph <- list.files("data/no_sitenames", full.names = TRUE) |> 
  set_names(basename) |> 
  map(read_csv) |> 
  list_rbind(names_to = "file") |> 
  separate_wider_delim(file, delim = "_", 
                       names = c("date", "river", "salinity", "param", "site")) |> 
  mutate(site = substr(site, 1,1))

Save so we can start our analysis in a fresh .qmd

write_csv(ph, file = "data/ph_combined.csv")

You can use `map()` (and the family of other `map()` functions) to perform almost any operation across multiple elements of a vector or list. More to come.

Iteration

Performing the same action for multiple “things”

Modifying multiple columns using dplyr::across()

We want to perform the same computation across multiple columns

We could calculate the median of each column individually

But we could also use the across() function to do it

The across() function usage

Selecting columns (.cols)

Selecting columns (.cols)

Selecting columns (.cols)

Use where() to select colunns by data type

where() functions

What about when the function requires additional arguments?

What about when the function requires additional arguments?

What about when the function requires additional arguments?

What about when the function requires additional arguments?

across() with multiple functions

across() with multiple functions

Take control of the resulting column names with .names argument

across() works with mutate() too

across() works with mutate() too

Intro to purrr:map(): reading and writing multiple files

Reading in multiple files the slow, error-prone way

Reading in multiple files the slow, error-prone way

Automate it

List files in a directory

List files in a directory

Use map() to read in each element of the paths vector

Use map() to read in each element of the paths vector

Use list_rbind() to collapse list elements into a single data frame

Put it all together

What if you need to extract data from the file name

Use set_names() to extract info from the file names

Use set_names() to extract info from the file names

Use set_names() and mutate() to extract info from the file names

Use set_names() and mutate() to extract info from the file names

Use separate_wider_delim() to keep and separate all elements of the file names

Use separate_wider_delim() to keep and separate all elements of the file names

Assign the output to an object

Save so we can start our analysis in a fresh .qmd

You can use map() (and the family of other map() functions) to perform almost any operation across multiple elements of a vector or list. More to come.

Modifying multiple columns using `dplyr::across()`

But we could also use the `across()` function to do it

The `across()` function usage

Selecting columns (`.cols`)

Selecting columns (`.cols`)

Selecting columns (`.cols`)

Use `where()` to select colunns by data type

`where()` functions

`across()` with multiple functions

`across()` with multiple functions

Take control of the resulting column names with `.names` argument

`across()` works with `mutate()` too

`across()` works with `mutate()` too

Intro to `purrr:map()`: reading and writing multiple files

Use `map()` to read in each element of the `paths` vector

Use `map()` to read in each element of the `paths` vector

Use `list_rbind()` to collapse list elements into a single data frame

Use `set_names()` to extract info from the file names

Use `set_names()` to extract info from the file names

Use `set_names()` and `mutate()` to extract info from the file names

Use `set_names()` and `mutate()` to extract info from the file names

Use `separate_wider_delim()` to keep and separate all elements of the file names

Use `separate_wider_delim()` to keep and separate all elements of the file names

You can use `map()` (and the family of other `map()` functions) to perform almost any operation across multiple elements of a vector or list. More to come.