Iteration



Environmental Data Analysis and Visualization

Performing the same action for multiple “things”

  • facet_wrap() and facet_grid() draw plots for each subset of data
  • group_by() plus summarize() computes summary stats for each subset of data

R also has functions that enable us to apply any function across subsets of data or across multiple datasets.

Modifying multiple columns using dplyr::across()

We want to perform the same computation across multiple columns

Have: a tibble called df

# A tibble: 10 × 4
        a      b       c      d
    <dbl>  <dbl>   <dbl>  <dbl>
 1  0.243  0.847  0.551   0.196
 2 -2.26  -1.03  -0.467  -0.766
 3  0.883  0.832  0.644   0.521
 4  0.654 -1.31   0.0108  0.647
 5  0.357  0.827  1.14   -2.15 
 6 -0.788 -0.906  0.620  -0.128
 7 -1.95  -1.17  -0.551   0.667
 8 -0.311 -0.648  0.711   1.50 
 9  0.812  0.622  0.752  -1.98 
10 -0.501  1.35  -0.563  -1.52 

Want: the median of each column

# A tibble: 1 × 4
      a      b      c     d
  <dbl>  <dbl>  <dbl> <dbl>
1 0.680 -0.166 0.0633 0.106

We could calculate the median of each column individually

df |> summarize(
  a = median(a),
  b = median(b),
  c = median(c),
  d = median(d))
# A tibble: 1 × 4
      a      b      c     d
  <dbl>  <dbl>  <dbl> <dbl>
1 0.680 -0.166 0.0633 0.106

But we could also use the across() function to do it

df |> 
  summarize(
    across(a:d, median))
# A tibble: 1 × 4
      a      b      c     d
  <dbl>  <dbl>  <dbl> <dbl>
1 0.680 -0.166 0.0633 0.106


As with functions, iterating across columns in this way makes our code more efficient and less prone to errors.

The across() function usage

across(.cols, .fns, .names)

  • .cols: columns to which you want to apply the calculation
  • .fns: function(s) you want to apply to each column
  • .names: how you want to name the output columns

Selecting columns (.cols)

Only columns a and b: use : to select columns located sequentially in the dataframe

df |> 
  summarize(
    across(a:b, median)
  )
# A tibble: 1 × 2
      a      b
  <dbl>  <dbl>
1 0.680 -0.166

Selecting columns (.cols)

Only columns a and c: use c() to concatenate the column names if they are not sequentially located in the dataframe.

df |> 
  summarize(
    across(c(a, c), median)
  )
# A tibble: 1 × 2
      a      c
  <dbl>  <dbl>
1 0.680 0.0633

Selecting columns (.cols)

All columns: use everything()

df |> 
  summarize(
    across(everything(), median)
  )
# A tibble: 1 × 4
      a      b      c     d
  <dbl>  <dbl>  <dbl> <dbl>
1 0.680 -0.166 0.0633 0.106

Use where() to select colunns by data type

For more dataframes with multiple data types, use where() to apply across() functions to certain types of data.

# A tibble: 53,940 × 7
   carat cut       color clarity depth table price
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  0.23 Ideal     E     SI2      61.5    55   326
 2  0.21 Premium   E     SI1      59.8    61   326
 3  0.23 Good      E     VS1      56.9    65   327
 4  0.29 Premium   I     VS2      62.4    58   334
 5  0.31 Good      J     SI2      63.3    58   335
 6  0.24 Very Good J     VVS2     62.8    57   336
 7  0.24 Very Good I     VVS1     62.3    57   336
 8  0.26 Very Good H     SI1      61.9    55   337
 9  0.22 Fair      E     VS2      65.1    61   337
10  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows
diamonds |> 
  summarize(
    across(where(is.numeric), mean)
  )
# A tibble: 1 × 7
  carat depth table price     x     y     z
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.798  61.7  57.5 3933.  5.73  5.73  3.54

where() functions

  • where(is.numeric) selects all numeric columns.
  • where(is.character) selects all string columns.
  • where(is.Date) selects all date columns.
  • where(is.POSIXct) selects all date-time columns.
  • where(is.logical) selects all logical columns.

What about when the function requires additional arguments?

# A tibble: 5 × 4
      a      b      c      d
  <dbl>  <dbl>  <dbl>  <dbl>
1  1.91 -1.42  -0.615 -0.221
2 -1.12 -1.28  -1.89   0.383
3 NA     0.227 -0.945 -0.564
4 -1.11  1.30  NA      1.56 
5  1.83 NA     NA      1.02 
df_miss |> 
  summarize(
    across(a:d, median))
# A tibble: 1 × 4
      a     b     c     d
  <dbl> <dbl> <dbl> <dbl>
1    NA    NA    NA 0.383


The output contains missing values. We need to pass the argument na.rm=TRUE.

What about when the function requires additional arguments?

df_miss |> 
  summarize(
    across(a:d, median, na.rm = TRUE))
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `across(a:d, median, na.rm = TRUE)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))
# A tibble: 1 × 4
      a      b      c     d
  <dbl>  <dbl>  <dbl> <dbl>
1 0.359 -0.529 -0.945 0.383

What about when the function requires additional arguments?

You can call a new function “on the fly” - the function is not saved as an object to the global environment.

df_miss |> 
  summarize(
    across(a:d, function(x) median(x, na.rm = TRUE)))
# A tibble: 1 × 4
      a      b      c     d
  <dbl>  <dbl>  <dbl> <dbl>
1 0.359 -0.529 -0.945 0.383

What about when the function requires additional arguments?

You can shorten your code by using an “anonymous function” - replace function with \

df_miss |> 
  summarize(
    across(a:d, \(x) median(x, na.rm = TRUE)))
# A tibble: 1 × 4
      a      b      c     d
  <dbl>  <dbl>  <dbl> <dbl>
1 0.359 -0.529 -0.945 0.383

across() with multiple functions

Use list() to calculate the median and number of missing values in df_miss

df_miss |> 
  summarize(
    across(a:d, list(
      median = \(x) median(x, na.rm = TRUE),
      n_missing = \(x) sum(is.na(x))
    )))
# A tibble: 1 × 8
  a_median a_n_missing b_median b_n_missing c_median c_n_missing d_median
     <dbl>       <int>    <dbl>       <int>    <dbl>       <int>    <dbl>
1    0.359           1   -0.529           1   -0.945           2    0.383
# ℹ 1 more variable: d_n_missing <int>

across() with multiple functions

Use list() to calculate the median and number of missing values in df_miss

df_miss |> 
  summarize(
    across(a:d, list(
      median = \(x) median(x, na.rm = TRUE),
      n_missing = \(x) sum(is.na(x))
    )))
# A tibble: 1 × 8
  a_median a_n_missing b_median b_n_missing c_median c_n_missing d_median
     <dbl>       <int>    <dbl>       <int>    <dbl>       <int>    <dbl>
1    0.359           1   -0.529           1   -0.945           2    0.383
# ℹ 1 more variable: d_n_missing <int>


Note: By default, R “glues” the column name and function name together because we applied multiple functions to each column.

Take control of the resulting column names with .names argument

Maybe we want the function name to come first followed by the column name, and we want to separate by _:

df_miss |> 
  summarize(
    across(a:d, list(
      median = \(x) median(x, na.rm = TRUE),
      n_missing = \(x) sum(is.na(x))),
      .names = "{.fn}-{.col}"
    ))
# A tibble: 1 × 8
  `median-a` `n_missing-a` `median-b` `n_missing-b` `median-c` `n_missing-c`
       <dbl>         <int>      <dbl>         <int>      <dbl>         <int>
1      0.359             1     -0.529             1     -0.945             2
# ℹ 2 more variables: `median-d` <dbl>, `n_missing-d` <int>


- Put it all in "" - {.fn} takes the function name - {.col} takes the column name - _ indicates we want to separate the function and column name by _ - customize however you want

across() works with mutate() too

df
# A tibble: 10 × 4
        a       b       c      d
    <dbl>   <dbl>   <dbl>  <dbl>
 1  1.40  -2.78    0.0305  0.881
 2 -0.350 -0.145   0.992   0.472
 3 -0.885  1.63    0.0272  0.460
 4  3.12  -0.665  -0.630  -0.462
 5  0.778  0.380  -0.683  -0.266
 6  0.583 -0.462  -0.672  -0.248
 7  1.34   0.102   0.636   2.36 
 8 -1.70  -1.18    0.0962 -0.557
 9 -2.91  -0.186   1.69   -0.748
10  1.24   0.0190  0.420   1.19 
df |> 
  mutate(
    across(a:d, \(x) x + 2))
# A tibble: 10 × 4
        a      b     c     d
    <dbl>  <dbl> <dbl> <dbl>
 1  3.40  -0.782  2.03  2.88
 2  1.65   1.85   2.99  2.47
 3  1.12   3.63   2.03  2.46
 4  5.12   1.33   1.37  1.54
 5  2.78   2.38   1.32  1.73
 6  2.58   1.54   1.33  1.75
 7  3.34   2.10   2.64  4.36
 8  0.301  0.819  2.10  1.44
 9 -0.909  1.81   3.69  1.25
10  3.24   2.02   2.42  3.19

Wait, I wanted to keep the original columns!

across() works with mutate() too

Specify the .names argument to keep original columns and add new columns.

df
# A tibble: 10 × 4
        a       b       c      d
    <dbl>   <dbl>   <dbl>  <dbl>
 1  1.40  -2.78    0.0305  0.881
 2 -0.350 -0.145   0.992   0.472
 3 -0.885  1.63    0.0272  0.460
 4  3.12  -0.665  -0.630  -0.462
 5  0.778  0.380  -0.683  -0.266
 6  0.583 -0.462  -0.672  -0.248
 7  1.34   0.102   0.636   2.36 
 8 -1.70  -1.18    0.0962 -0.557
 9 -2.91  -0.186   1.69   -0.748
10  1.24   0.0190  0.420   1.19 
df |> 
  mutate(
    across(a:d, \(x) x + 2, .names = "{.col}_add_two"))
# A tibble: 10 × 8
        a       b       c      d a_add_two b_add_two c_add_two d_add_two
    <dbl>   <dbl>   <dbl>  <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1  1.40  -2.78    0.0305  0.881     3.40     -0.782      2.03      2.88
 2 -0.350 -0.145   0.992   0.472     1.65      1.85       2.99      2.47
 3 -0.885  1.63    0.0272  0.460     1.12      3.63       2.03      2.46
 4  3.12  -0.665  -0.630  -0.462     5.12      1.33       1.37      1.54
 5  0.778  0.380  -0.683  -0.266     2.78      2.38       1.32      1.73
 6  0.583 -0.462  -0.672  -0.248     2.58      1.54       1.33      1.75
 7  1.34   0.102   0.636   2.36      3.34      2.10       2.64      4.36
 8 -1.70  -1.18    0.0962 -0.557     0.301     0.819      2.10      1.44
 9 -2.91  -0.186   1.69   -0.748    -0.909     1.81       3.69      1.25
10  1.24   0.0190  0.420   1.19      3.24      2.02       2.42      3.19

Intro to purrr:map(): reading and writing multiple files

Reading in multiple files the slow, error-prone way

First read in each file and assign the output to a different object:

ph_a <- read_csv("data/sitenames/2024-05-21_YOR_BR_PH_A.csv")
ph_b <- read_csv("data/sitenames/2024-05-21_YOR_BR_PH_B.csv")
ph_c <- read_csv("data/sitenames/2024-05-21_YOR_BR_PH_C.csv")
ph_d <- read_csv("data/sitenames/2024-05-21_YOR_BR_PH_D.csv")
ph_e <- read_csv("data/sitenames/2024-05-21_YOR_BR_PH_E.csv")

ph_a
# A tibble: 61 × 3
   datetime         ph site 
   <chr>         <dbl> <chr>
 1 5/21/24 9:50   8.05 a    
 2 5/21/24 9:55   8.06 a    
 3 5/21/24 10:00  8.01 a    
 4 5/21/24 10:05  8.05 a    
 5 5/21/24 10:10  8.03 a    
 6 5/21/24 10:15  8.05 a    
 7 5/21/24 10:20  8.05 a    
 8 5/21/24 10:25  8.03 a    
 9 5/21/24 10:30  8.03 a    
10 5/21/24 10:35  8.03 a    
# ℹ 51 more rows

Reading in multiple files the slow, error-prone way

Then combine the dataframe objects into a tibble using bind_rows()

data <- bind_rows(ph_a, ph_b, ph_c, ph_d, ph_e)

data
# A tibble: 305 × 3
   datetime         ph site 
   <chr>         <dbl> <chr>
 1 5/21/24 9:50   8.05 a    
 2 5/21/24 9:55   8.06 a    
 3 5/21/24 10:00  8.01 a    
 4 5/21/24 10:05  8.05 a    
 5 5/21/24 10:10  8.03 a    
 6 5/21/24 10:15  8.05 a    
 7 5/21/24 10:20  8.05 a    
 8 5/21/24 10:25  8.03 a    
 9 5/21/24 10:30  8.03 a    
10 5/21/24 10:35  8.03 a    
# ℹ 295 more rows

Problems with this approach:

  • verbose: so much code
  • error-prone: easy to mistype the parts we are changing
  • clutters the environment: we have to save each file as an object, but we only need the final combined object.

Automate it

The basic steps:

  1. Use list.files() to generate a list of filenames in your data folder
  2. Use purrr::map() to read in each file in the file name list
  3. Use list_rbind() to combine the data frames into a single data frame

List files in a directory


paths <- list.files("data/sitenames", full.names = TRUE)

paths
[1] "data/sitenames/2024-05-21_YOR_BR_PH_A.csv"
[2] "data/sitenames/2024-05-21_YOR_BR_PH_B.csv"
[3] "data/sitenames/2024-05-21_YOR_BR_PH_C.csv"
[4] "data/sitenames/2024-05-21_YOR_BR_PH_D.csv"
[5] "data/sitenames/2024-05-21_YOR_BR_PH_E.csv"
  • Because we use projects, R knows the base folder is the folder containing our project files
  • If your data files are in a subfolder, specify the folder structure
  • full.names=TRUE indicates you want to include the full directory structure

List files in a directory


class(paths)
[1] "character"

The output is a character vector of file names (with full directory structure included).

Use map() to read in each element of the paths vector

files <- map(paths, read_csv)

files[[1]]
# A tibble: 61 × 3
   datetime         ph site 
   <chr>         <dbl> <chr>
 1 5/21/24 9:50   8.05 a    
 2 5/21/24 9:55   8.06 a    
 3 5/21/24 10:00  8.01 a    
 4 5/21/24 10:05  8.05 a    
 5 5/21/24 10:10  8.03 a    
 6 5/21/24 10:15  8.05 a    
 7 5/21/24 10:20  8.05 a    
 8 5/21/24 10:25  8.03 a    
 9 5/21/24 10:30  8.03 a    
10 5/21/24 10:35  8.03 a    
# ℹ 51 more rows

Use map() to read in each element of the paths vector

The ouput is a list

class(files)
[1] "list"
  • Recall, a list is a vector of data elements. Each element of the list can be anything - a single observation, a vector, a data frame, a plot, etc.
  • files is a list of data frames.
  • We can extract an element of a list using [[]]. We place the element number inside the brackets files[[3]].

Use list_rbind() to collapse list elements into a single data frame

files |> 
  list_rbind()
# A tibble: 305 × 3
   datetime         ph site 
   <chr>         <dbl> <chr>
 1 5/21/24 9:50   8.05 a    
 2 5/21/24 9:55   8.06 a    
 3 5/21/24 10:00  8.01 a    
 4 5/21/24 10:05  8.05 a    
 5 5/21/24 10:10  8.03 a    
 6 5/21/24 10:15  8.05 a    
 7 5/21/24 10:20  8.05 a    
 8 5/21/24 10:25  8.03 a    
 9 5/21/24 10:30  8.03 a    
10 5/21/24 10:35  8.03 a    
# ℹ 295 more rows

Put it all together

ph <- list.files("data/sitenames", full.names = TRUE) |> 
  map(read_csv) |> 
  list_rbind()

ph
# A tibble: 305 × 3
   datetime         ph site 
   <chr>         <dbl> <chr>
 1 5/21/24 9:50   8.05 a    
 2 5/21/24 9:55   8.06 a    
 3 5/21/24 10:00  8.01 a    
 4 5/21/24 10:05  8.05 a    
 5 5/21/24 10:10  8.03 a    
 6 5/21/24 10:15  8.05 a    
 7 5/21/24 10:20  8.05 a    
 8 5/21/24 10:25  8.03 a    
 9 5/21/24 10:30  8.03 a    
10 5/21/24 10:35  8.03 a    
# ℹ 295 more rows

What if you need to extract data from the file name

e.g., we have another dataset where there is no site name column, but the site name is contained in the file names.

list.files("data/no_sitenames", full.names = TRUE) |> 
  map(read_csv) |> 
  list_rbind()
# A tibble: 305 × 2
   datetime         ph
   <chr>         <dbl>
 1 5/21/24 9:50   8.05
 2 5/21/24 9:55   8.06
 3 5/21/24 10:00  8.01
 4 5/21/24 10:05  8.05
 5 5/21/24 10:10  8.03
 6 5/21/24 10:15  8.05
 7 5/21/24 10:20  8.05
 8 5/21/24 10:25  8.03
 9 5/21/24 10:30  8.03
10 5/21/24 10:35  8.03
# ℹ 295 more rows


No site name - there’s no way to identify which site the data came from.

Use set_names() to extract info from the file names

list.files("data/no_sitenames", full.names = TRUE) |> 
  set_names(basename) 
                    2024-05-21_YOR_BR_PH_A.csv 
"data/no_sitenames/2024-05-21_YOR_BR_PH_A.csv" 
                    2024-05-21_YOR_BR_PH_B.csv 
"data/no_sitenames/2024-05-21_YOR_BR_PH_B.csv" 
                    2024-05-21_YOR_BR_PH_C.csv 
"data/no_sitenames/2024-05-21_YOR_BR_PH_C.csv" 
                    2024-05-21_YOR_BR_PH_D.csv 
"data/no_sitenames/2024-05-21_YOR_BR_PH_D.csv" 
                    2024-05-21_YOR_BR_PH_E.csv 
"data/no_sitenames/2024-05-21_YOR_BR_PH_E.csv" 

The full file paths are vector elements. The “basenames” above are the names of each element.

Use set_names() to extract info from the file names

Now use map to read in the data files

list.files("data/no_sitenames", full.names = TRUE) |> 
  set_names(basename) |> 
  map(read_csv)
$`2024-05-21_YOR_BR_PH_A.csv`
# A tibble: 61 × 2
   datetime         ph
   <chr>         <dbl>
 1 5/21/24 9:50   8.05
 2 5/21/24 9:55   8.06
 3 5/21/24 10:00  8.01
 4 5/21/24 10:05  8.05
 5 5/21/24 10:10  8.03
 6 5/21/24 10:15  8.05
 7 5/21/24 10:20  8.05
 8 5/21/24 10:25  8.03
 9 5/21/24 10:30  8.03
10 5/21/24 10:35  8.03
# ℹ 51 more rows

$`2024-05-21_YOR_BR_PH_B.csv`
# A tibble: 61 × 2
   datetime         ph
   <chr>         <dbl>
 1 5/21/24 9:50   8.06
 2 5/21/24 9:55   8.07
 3 5/21/24 10:00  8.03
 4 5/21/24 10:05  8.07
 5 5/21/24 10:10  8.06
 6 5/21/24 10:15  8.07
 7 5/21/24 10:20  8.07
 8 5/21/24 10:25  8.04
 9 5/21/24 10:30  8.05
10 5/21/24 10:35  8.05
# ℹ 51 more rows

$`2024-05-21_YOR_BR_PH_C.csv`
# A tibble: 61 × 2
   datetime         ph
   <chr>         <dbl>
 1 5/22/24 9:50   7.03
 2 5/22/24 9:55   7.02
 3 5/22/24 10:00  7.02
 4 5/22/24 10:05  7.02
 5 5/22/24 10:10  7.01
 6 5/22/24 10:15  7.01
 7 5/22/24 10:20  7.01
 8 5/22/24 10:25  7.01
 9 5/22/24 10:30  7.01
10 5/22/24 10:35  7   
# ℹ 51 more rows

$`2024-05-21_YOR_BR_PH_D.csv`
# A tibble: 61 × 2
   datetime         ph
   <chr>         <dbl>
 1 5/21/24 9:50   7.98
 2 5/21/24 9:55   8.02
 3 5/21/24 10:00  7.97
 4 5/21/24 10:05  8.03
 5 5/21/24 10:10  7.99
 6 5/21/24 10:15  8.02
 7 5/21/24 10:20  8.04
 8 5/21/24 10:25  7.99
 9 5/21/24 10:30  8   
10 5/21/24 10:35  8.01
# ℹ 51 more rows

$`2024-05-21_YOR_BR_PH_E.csv`
# A tibble: 61 × 2
   datetime         ph
   <chr>         <dbl>
 1 5/21/24 9:50   8.03
 2 5/21/24 9:55   8.06
 3 5/21/24 10:00  8.01
 4 5/21/24 10:05  8.06
 5 5/21/24 10:10  8.03
 6 5/21/24 10:15  8.05
 7 5/21/24 10:20  8.06
 8 5/21/24 10:25  8.03
 9 5/21/24 10:30  8.01
10 5/21/24 10:35  8.01
# ℹ 51 more rows
  • The output shows the first element of the files list.
  • The “name” of the first element is the name of the first .csv file that was read in.

Use set_names() and mutate() to extract info from the file names

Use the names_to argument in list_rbind() to create a site column that is populated with the corresponding list element names

list.files("data/no_sitenames", full.names = TRUE) |> 
  set_names(basename) |> 
  map(read_csv) |> 
  list_rbind(names_to = "site")
# A tibble: 305 × 3
   site                       datetime         ph
   <chr>                      <chr>         <dbl>
 1 2024-05-21_YOR_BR_PH_A.csv 5/21/24 9:50   8.05
 2 2024-05-21_YOR_BR_PH_A.csv 5/21/24 9:55   8.06
 3 2024-05-21_YOR_BR_PH_A.csv 5/21/24 10:00  8.01
 4 2024-05-21_YOR_BR_PH_A.csv 5/21/24 10:05  8.05
 5 2024-05-21_YOR_BR_PH_A.csv 5/21/24 10:10  8.03
 6 2024-05-21_YOR_BR_PH_A.csv 5/21/24 10:15  8.05
 7 2024-05-21_YOR_BR_PH_A.csv 5/21/24 10:20  8.05
 8 2024-05-21_YOR_BR_PH_A.csv 5/21/24 10:25  8.03
 9 2024-05-21_YOR_BR_PH_A.csv 5/21/24 10:30  8.03
10 2024-05-21_YOR_BR_PH_A.csv 5/21/24 10:35  8.03
# ℹ 295 more rows

site is too long but we can fix that

Use set_names() and mutate() to extract info from the file names

Use the substr() to extract the 22nd character from the site names

list.files("data/no_sitenames", full.names = TRUE) |> 
  set_names(basename) |> 
  map(read_csv) |> 
  list_rbind(names_to = "site") |> 
  mutate(site = substr(site, 22, 22))
# A tibble: 305 × 3
   site  datetime         ph
   <chr> <chr>         <dbl>
 1 A     5/21/24 9:50   8.05
 2 A     5/21/24 9:55   8.06
 3 A     5/21/24 10:00  8.01
 4 A     5/21/24 10:05  8.05
 5 A     5/21/24 10:10  8.03
 6 A     5/21/24 10:15  8.05
 7 A     5/21/24 10:20  8.05
 8 A     5/21/24 10:25  8.03
 9 A     5/21/24 10:30  8.03
10 A     5/21/24 10:35  8.03
# ℹ 295 more rows

Use separate_wider_delim() to keep and separate all elements of the file names

list.files("data/no_sitenames", full.names = TRUE) |> 
  set_names(basename) |> 
  map(read_csv) |> 
  list_rbind(names_to = "file") |> 
  separate_wider_delim(file, delim = "_", 
                       names = c("date", "river", "salinity", "param", "chamber"))
# A tibble: 305 × 7
   date       river salinity param chamber datetime         ph
   <chr>      <chr> <chr>    <chr> <chr>   <chr>         <dbl>
 1 2024-05-21 YOR   BR       PH    A.csv   5/21/24 9:50   8.05
 2 2024-05-21 YOR   BR       PH    A.csv   5/21/24 9:55   8.06
 3 2024-05-21 YOR   BR       PH    A.csv   5/21/24 10:00  8.01
 4 2024-05-21 YOR   BR       PH    A.csv   5/21/24 10:05  8.05
 5 2024-05-21 YOR   BR       PH    A.csv   5/21/24 10:10  8.03
 6 2024-05-21 YOR   BR       PH    A.csv   5/21/24 10:15  8.05
 7 2024-05-21 YOR   BR       PH    A.csv   5/21/24 10:20  8.05
 8 2024-05-21 YOR   BR       PH    A.csv   5/21/24 10:25  8.03
 9 2024-05-21 YOR   BR       PH    A.csv   5/21/24 10:30  8.03
10 2024-05-21 YOR   BR       PH    A.csv   5/21/24 10:35  8.03
# ℹ 295 more rows

Close - we need to get rid of the “.csv”

Use separate_wider_delim() to keep and separate all elements of the file names

list.files("data/no_sitenames", full.names = TRUE) |> 
  set_names(basename) |> 
  map(read_csv) |> 
  list_rbind(names_to = "file") |> 
  separate_wider_delim(file, delim = "_", 
                       names = c("date", "river", "salinity", "param", "site")) |> 
  mutate(site = substr(site, 1,1))
# A tibble: 305 × 7
   date       river salinity param site  datetime         ph
   <chr>      <chr> <chr>    <chr> <chr> <chr>         <dbl>
 1 2024-05-21 YOR   BR       PH    A     5/21/24 9:50   8.05
 2 2024-05-21 YOR   BR       PH    A     5/21/24 9:55   8.06
 3 2024-05-21 YOR   BR       PH    A     5/21/24 10:00  8.01
 4 2024-05-21 YOR   BR       PH    A     5/21/24 10:05  8.05
 5 2024-05-21 YOR   BR       PH    A     5/21/24 10:10  8.03
 6 2024-05-21 YOR   BR       PH    A     5/21/24 10:15  8.05
 7 2024-05-21 YOR   BR       PH    A     5/21/24 10:20  8.05
 8 2024-05-21 YOR   BR       PH    A     5/21/24 10:25  8.03
 9 2024-05-21 YOR   BR       PH    A     5/21/24 10:30  8.03
10 2024-05-21 YOR   BR       PH    A     5/21/24 10:35  8.03
# ℹ 295 more rows

Assign the output to an object

ph <- list.files("data/no_sitenames", full.names = TRUE) |> 
  set_names(basename) |> 
  map(read_csv) |> 
  list_rbind(names_to = "file") |> 
  separate_wider_delim(file, delim = "_", 
                       names = c("date", "river", "salinity", "param", "site")) |> 
  mutate(site = substr(site, 1,1))

Save so we can start our analysis in a fresh .qmd



write_csv(ph, file = "data/ph_combined.csv")

You can use map() (and the family of other map() functions) to perform almost any operation across multiple elements of a vector or list. More to come.