Iteration

Author

David Gerard

Published

June 3, 2025

Learning Objectives

For-loops.
Iteration.
Chapter 11 of HOPR
Purrr Cheat Sheet.
Purrr Overview.

For Loops

Load the tidyverse
```
library(tidyverse)
```
Iteration is the repetition of some amount of code.
If we didn’t know the sum() function, how would we add up the elements of a vector?
```
x <- c(8, 1, 3, 1, 3)
```
We could manually add the elements.
```
x[1] + x[2] + x[3] + x[4] + x[5]
```
```
[1] 16
```
But this is prone to error (through copy and paste). Also, what if x has 10,000 elements?

For loops to the rescue!

sumval <- 0
for (i in seq_along(x)) {
   sumval <- sumval + x[[i]]
}
sumval

[1] 16

Each for loop contains the following elements:
1. Output: This is sumval above. We allocate the space for the output before the for loop.
2. Sequence: This is seq_along(x) above, which evaluates to 1 2 3 4 5. These are the values that i will go through each iteration.
3. Body: This is the code between the curly braces {}. This is the code that will be evaluated each iteration with a new value of i.

In the above sequence, R internally transforms the code to:

sumval <- 0
sumval <- sumval + x[[1]]
sumval <- sumval + x[[2]]
sumval <- sumval + x[[3]]
sumval <- sumval + x[[4]]
sumval <- sumval + x[[5]]
sumval

[1] 16

You often want to fill a vector with values. You should create this vector beforehand using the vector() function.

For example, let’s calculate a vector of cumulative sums of x.

cumvec <- vector(mode = "double", length = length(x))
cumvec

[1] 0 0 0 0 0

for (i in seq_along(cumvec)) {
   if (i == 1) {
     cumvec[[i]] <- x[[i]]
   } else {
     cumvec[[i]] <- cumvec[[i - 1]] + x[[i]] 
   }
}
cumvec

[1]  8  9 12 13 16

## Same as cumsum(x)
cumsum(x)

[1]  8  9 12 13 16

Exercise: The first two numbers of the Fibonacci Sequence are 0 and 1. Each succeeding number is the sum of the previous two numbers in the sequence. For example, the third element is 1 = 0 + 1, while the fourth elements is 2 = 1 + 1, and the fifth element is 3 = 2 + 1. Use a for loop to calculate the first 100 Fibonacci Numbers. Sanity Check: The $\log_2$ of the 100th Fibonacci Number is about 67.57.
Looping is often done over the columns of a data frame.
Note: for a data frame df, seq_along(df) is the same as 1:ncol(df) which is the same as 1:length(df) (since data frames are special cases of lists).

Let’s calculate the mean of each column of mtcars

data("mtcars")
mean_vec <- vector(mode = "numeric", length = length(mtcars))
for (i in seq_along(mtcars)) {
  mean_vec[[i]] <- mean(mtcars[[i]], na.rm = TRUE)   
}
mean_vec

 [1]  20.0906   6.1875 230.7219 146.6875   3.5966   3.2172  17.8487   0.4375
 [9]   0.4062   3.6875   2.8125

colMeans(mtcars)

     mpg      cyl     disp       hp     drat       wt     qsec       vs 
 20.0906   6.1875 230.7219 146.6875   3.5966   3.2172  17.8488   0.4375 
      am     gear     carb 
  0.4062   3.6875   2.8125

Why not just use colMeans()? Well, there is no “colSDs” function, so iteration is important for applying non-implemented functions to multiple elements in R.
Exercise: Use a for loop to calculate the standard deviation of each penguin trait in the penguins data frame from the palmerpenguins package.

purrr

Basic Mappings

R is a functional programming language. Which means that you can pass functions to functions.
Suppose on mtcars we want to calculate the column-wise mean, the column-wise median, the column-wise standard deviation, the column-wise maximum, the column-wise minimum, and the column-wise MAD. The for-loop would look very similar
```
funvec <- rep(NA, length = length(mtcars))
for (i in seq_along(funvec)) {
  funvec[i] <- fun(mtcars[[i]], na.rm = TRUE) 
}
funvec
```
Ideally, we would like to just tell R what function to apply to each column of mtcars. This is what the purrr package allows us to do.
purrr is a part of the tidyverse, and so does not need to be loaded separately.
map_*() takes a vector (or list or data frame) as input, applies a provided function on each element of that vector, and outputs a vector of the same length.
- map() returns a list.
- map_lgl() returns a logical vector.
- map_int() returns an integer vector.
- map_dbl() returns a double vector.
- map_chr() returns a character vector.
```
map_dbl(mtcars, mean)
map_dbl(mtcars, median)
map_dbl(mtcars, sd)
map_dbl(mtcars, mad)
map_dbl(mtcars, min)
map_dbl(mtcars, max)
```
You can pass on more arguments in map_*().
```
map_dbl(mtcars, mean, na.rm = TRUE)
```
Suppose you want to get the output of summary() on each column.
```
map(mtcars, summary)
```
Exercise (RDS 21.5.3.1): Write code that uses one of the map functions to:
1. Determine the type of each column in nycflights13::flights.
2. Compute the number of unique values in each column of palmerpenguins::penguins.
3. Generate 10 random normals for each of $\mu = -10, 0, 10, \ldots, 100$.

Shortcuts

Instead of specifying a built-in funciton, you can create an anonymous function to map over.
Anonymous functions are non-named functions that are used as inputs to other functions.
Typically, they are one-liners and are of the form
```
function(args) code-using-args
```

E.g., an anonymous function that outputs the interquartile range of a vector is

function(x) quantile(x, 0.75) - quantile(x, 0.25)

function (x) 
quantile(x, 0.75) - quantile(x, 0.25)

R 4.0 and above allows for a shorter syntax for anonymous functions.

\(x) quantile(x, 0.75) - quantile(x, 0.25)

function (x) 
quantile(x, 0.75) - quantile(x, 0.25)

For example, the following are three equivalent ways to calculate the mean of each column in mtcars.
```
map_dbl(mtcars, mean)
map_dbl(mtcars, function(x) mean(x))
map_dbl(mtcars, \(x) mean(x))
```
You can think about this as purrr creating an anonymous function
```
.f <- function(.) {
  mean(.)
}
```
and then calling this function in map().
```
map_dbl(mtcars, .f)
```
Why is this useful? Consider the following chunk of code which allows us to fit many simple linear regression models:
```
mtcars |>
  nest(.by = cyl) |>
  mutate(lmout = map(data, \(df) lm(mpg ~ wt, data = df))) ->
  sumdf
```
- nest(.by = cyl) will create a new data frame containing a list column of data frames, where each data frame has the same value of cyl for all units within that data frame.
- \(df) lm(mpg ~ wt, data = df) defines a function (called an “anonymous function”) that will fit a linear model of mpg on wt where those variables are in the data frame df.
- The map() call fits that linear model to each of the three data frames in the list-column called data created by nest().
- What is returned is a data frame containing a new list column called lmout that contains the three lm objects that you can use to get fits and summaries.
```
summary(sumdf$lmout[[1]])
```
```
Call:
lm(formula = mpg ~ wt, data = df)

Residuals:
     1      2      3      4      5      6      7 
-0.125  0.584  1.929 -0.690  0.355 -1.045 -1.008 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    28.41       4.18    6.79   0.0011
wt             -2.78       1.33   -2.08   0.0918

Residual standard error: 1.17 on 5 degrees of freedom
Multiple R-squared:  0.465, Adjusted R-squared:  0.357 
F-statistic: 4.34 on 1 and 5 DF,  p-value: 0.0918
```

We can use map() to get a list of summaries.

sumdf |>
  mutate(sumlm = map(lmout, summary)) ->
  sumdf

sumdf$sumlm[[1]]


Call:
lm(formula = mpg ~ wt, data = df)

Residuals:
     1      2      3      4      5      6      7 
-0.125  0.584  1.929 -0.690  0.355 -1.045 -1.008 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    28.41       4.18    6.79   0.0011
wt             -2.78       1.33   -2.08   0.0918

Residual standard error: 1.17 on 5 degrees of freedom
Multiple R-squared:  0.465, Adjusted R-squared:  0.357 
F-statistic: 4.34 on 1 and 5 DF,  p-value: 0.0918

If you want to extract the $R^2$, you can do this using map as well

sumdf$sumlm[[1]]$r.squared ## only gets one R^2 out.

[1] 0.4645

## Gets all R^2 out
sumdf |>
  mutate(rsquared = map_dbl(sumlm, "r.squared")) ->
  sumdf

sumdf$rsquared

[1] 0.4645 0.5086 0.4230

Exercise: A $t$-test is used to test for differences in population means. R implements this with t.test(). For example, if I want to test for differences between the mean mpg’s of automatics and manuals (coded in variable am), I would use the following syntax.
```
t.test(mpg ~ am, data = mtcars)$p.value
```
Use map() to get the $p$-value for this test within each group of cyl.

`keep()` and `discard()`.

keep() selects all variables that return TRUE according to some function.

E.g. let’s keep all numeric variables and calculate their means in the palmerpenguins::penguins data frame.

library(palmerpenguins)


Attaching package: 'palmerpenguins'

The following objects are masked from 'package:datasets':

    penguins, penguins_raw

data("penguins")
penguins |>
  keep(is.numeric) |>
  map_dbl(mean, na.rm = TRUE)

   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
            43.92             17.15            200.92           4201.75 
             year 
          2008.03

discard() will select all variables that return FALSE according to some function.

Let’s count the number of each value for each categorical variable:

penguins |>
  discard(is.numeric) |>
  map(table)

$species

   Adelie Chinstrap    Gentoo 
      152        68       124 

$island

   Biscoe     Dream Torgersen 
      168       124        52 

$sex

female   male 
   165    168

Other less useful functions are available in Section 21.9 of RDS.
Exercise: In the mtcars data frame, keep only variables that have a mean greater than 10 and calculate their mean. Hint: You’ll have to use some of the shortcuts above.