Iteration

Author

David Gerard

Published

June 3, 2025

Learning Objectives

For Loops

  • Load the tidyverse

    library(tidyverse)
  • Iteration is the repetition of some amount of code.

  • If we didn’t know the sum() function, how would we add up the elements of a vector?

    x <- c(8, 1, 3, 1, 3)
  • We could manually add the elements.

    x[1] + x[2] + x[3] + x[4] + x[5]
    [1] 16

    But this is prone to error (through copy and paste). Also, what if x has 10,000 elements?

  • For loops to the rescue!

    sumval <- 0
    for (i in seq_along(x)) {
       sumval <- sumval + x[[i]]
    }
    sumval
    [1] 16
  • Each for loop contains the following elements:

    1. Output: This is sumval above. We allocate the space for the output before the for loop.
    2. Sequence: This is seq_along(x) above, which evaluates to 1 2 3 4 5. These are the values that i will go through each iteration.
    3. Body: This is the code between the curly braces {}. This is the code that will be evaluated each iteration with a new value of i.
  • In the above sequence, R internally transforms the code to:

    sumval <- 0
    sumval <- sumval + x[[1]]
    sumval <- sumval + x[[2]]
    sumval <- sumval + x[[3]]
    sumval <- sumval + x[[4]]
    sumval <- sumval + x[[5]]
    sumval
    [1] 16
  • You often want to fill a vector with values. You should create this vector beforehand using the vector() function.

  • For example, let’s calculate a vector of cumulative sums of x.

    cumvec <- vector(mode = "double", length = length(x))
    cumvec
    [1] 0 0 0 0 0
    for (i in seq_along(cumvec)) {
       if (i == 1) {
         cumvec[[i]] <- x[[i]]
       } else {
         cumvec[[i]] <- cumvec[[i - 1]] + x[[i]] 
       }
    }
    cumvec
    [1]  8  9 12 13 16
    ## Same as cumsum(x)
    cumsum(x)
    [1]  8  9 12 13 16
  • Exercise: The first two numbers of the Fibonacci Sequence are 0 and 1. Each succeeding number is the sum of the previous two numbers in the sequence. For example, the third element is 1 = 0 + 1, while the fourth elements is 2 = 1 + 1, and the fifth element is 3 = 2 + 1. Use a for loop to calculate the first 100 Fibonacci Numbers. Sanity Check: The \(\log_2\) of the 100th Fibonacci Number is about 67.57.

  • Looping is often done over the columns of a data frame.

  • Note: for a data frame df, seq_along(df) is the same as 1:ncol(df) which is the same as 1:length(df) (since data frames are special cases of lists).

  • Let’s calculate the mean of each column of mtcars

    data("mtcars")
    mean_vec <- vector(mode = "numeric", length = length(mtcars))
    for (i in seq_along(mtcars)) {
      mean_vec[[i]] <- mean(mtcars[[i]], na.rm = TRUE)   
    }
    mean_vec
     [1]  20.0906   6.1875 230.7219 146.6875   3.5966   3.2172  17.8487   0.4375
     [9]   0.4062   3.6875   2.8125
    colMeans(mtcars)
         mpg      cyl     disp       hp     drat       wt     qsec       vs 
     20.0906   6.1875 230.7219 146.6875   3.5966   3.2172  17.8488   0.4375 
          am     gear     carb 
      0.4062   3.6875   2.8125 
  • Why not just use colMeans()? Well, there is no “colSDs” function, so iteration is important for applying non-implemented functions to multiple elements in R.

  • Exercise: Use a for loop to calculate the standard deviation of each penguin trait in the penguins data frame from the palmerpenguins package.

purrr

Basic Mappings

  • R is a functional programming language. Which means that you can pass functions to functions.

  • Suppose on mtcars we want to calculate the column-wise mean, the column-wise median, the column-wise standard deviation, the column-wise maximum, the column-wise minimum, and the column-wise MAD. The for-loop would look very similar

    funvec <- rep(NA, length = length(mtcars))
    for (i in seq_along(funvec)) {
      funvec[i] <- fun(mtcars[[i]], na.rm = TRUE) 
    }
    funvec
  • Ideally, we would like to just tell R what function to apply to each column of mtcars. This is what the purrr package allows us to do.

  • purrr is a part of the tidyverse, and so does not need to be loaded separately.

  • map_*() takes a vector (or list or data frame) as input, applies a provided function on each element of that vector, and outputs a vector of the same length.

    • map() returns a list.
    • map_lgl() returns a logical vector.
    • map_int() returns an integer vector.
    • map_dbl() returns a double vector.
    • map_chr() returns a character vector.
    map_dbl(mtcars, mean)
    map_dbl(mtcars, median)
    map_dbl(mtcars, sd)
    map_dbl(mtcars, mad)
    map_dbl(mtcars, min)
    map_dbl(mtcars, max)
  • You can pass on more arguments in map_*().

    map_dbl(mtcars, mean, na.rm = TRUE)
  • Suppose you want to get the output of summary() on each column.

    map(mtcars, summary)
  • Exercise (RDS 21.5.3.1): Write code that uses one of the map functions to:

    1. Determine the type of each column in nycflights13::flights.
    2. Compute the number of unique values in each column of palmerpenguins::penguins.
    3. Generate 10 random normals for each of \(\mu = -10, 0, 10, \ldots, 100\).

Shortcuts

  • Instead of specifying a built-in funciton, you can create an anonymous function to map over.

  • Anonymous functions are non-named functions that are used as inputs to other functions.

  • Typically, they are one-liners and are of the form

    function(args) code-using-args
  • E.g., an anonymous function that outputs the interquartile range of a vector is

    function(x) quantile(x, 0.75) - quantile(x, 0.25)
    function (x) 
    quantile(x, 0.75) - quantile(x, 0.25)
  • R 4.0 and above allows for a shorter syntax for anonymous functions.

    \(x) quantile(x, 0.75) - quantile(x, 0.25)
    function (x) 
    quantile(x, 0.75) - quantile(x, 0.25)
  • For example, the following are three equivalent ways to calculate the mean of each column in mtcars.

    map_dbl(mtcars, mean)
    map_dbl(mtcars, function(x) mean(x))
    map_dbl(mtcars, \(x) mean(x))
  • You can think about this as purrr creating an anonymous function

    .f <- function(.) {
      mean(.)
    }

    and then calling this function in map().

    map_dbl(mtcars, .f)
  • Why is this useful? Consider the following chunk of code which allows us to fit many simple linear regression models:

    mtcars |>
      nest(.by = cyl) |>
      mutate(lmout = map(data, \(df) lm(mpg ~ wt, data = df))) ->
      sumdf
    • nest(.by = cyl) will create a new data frame containing a list column of data frames, where each data frame has the same value of cyl for all units within that data frame.
    • \(df) lm(mpg ~ wt, data = df) defines a function (called an “anonymous function”) that will fit a linear model of mpg on wt where those variables are in the data frame df.
    • The map() call fits that linear model to each of the three data frames in the list-column called data created by nest().
    • What is returned is a data frame containing a new list column called lmout that contains the three lm objects that you can use to get fits and summaries.
    summary(sumdf$lmout[[1]])
    
    Call:
    lm(formula = mpg ~ wt, data = df)
    
    Residuals:
         1      2      3      4      5      6      7 
    -0.125  0.584  1.929 -0.690  0.355 -1.045 -1.008 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)
    (Intercept)    28.41       4.18    6.79   0.0011
    wt             -2.78       1.33   -2.08   0.0918
    
    Residual standard error: 1.17 on 5 degrees of freedom
    Multiple R-squared:  0.465, Adjusted R-squared:  0.357 
    F-statistic: 4.34 on 1 and 5 DF,  p-value: 0.0918
  • We can use map() to get a list of summaries.

    sumdf |>
      mutate(sumlm = map(lmout, summary)) ->
      sumdf
    
    sumdf$sumlm[[1]]
    
    Call:
    lm(formula = mpg ~ wt, data = df)
    
    Residuals:
         1      2      3      4      5      6      7 
    -0.125  0.584  1.929 -0.690  0.355 -1.045 -1.008 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)
    (Intercept)    28.41       4.18    6.79   0.0011
    wt             -2.78       1.33   -2.08   0.0918
    
    Residual standard error: 1.17 on 5 degrees of freedom
    Multiple R-squared:  0.465, Adjusted R-squared:  0.357 
    F-statistic: 4.34 on 1 and 5 DF,  p-value: 0.0918
  • If you want to extract the \(R^2\), you can do this using map as well

    sumdf$sumlm[[1]]$r.squared ## only gets one R^2 out.
    [1] 0.4645
    ## Gets all R^2 out
    sumdf |>
      mutate(rsquared = map_dbl(sumlm, "r.squared")) ->
      sumdf
    
    sumdf$rsquared
    [1] 0.4645 0.5086 0.4230
  • Exercise: A \(t\)-test is used to test for differences in population means. R implements this with t.test(). For example, if I want to test for differences between the mean mpg’s of automatics and manuals (coded in variable am), I would use the following syntax.

    t.test(mpg ~ am, data = mtcars)$p.value

    Use map() to get the \(p\)-value for this test within each group of cyl.

keep() and discard().

  • keep() selects all variables that return TRUE according to some function.

  • E.g. let’s keep all numeric variables and calculate their means in the palmerpenguins::penguins data frame.

    library(palmerpenguins)
    
    Attaching package: 'palmerpenguins'
    The following objects are masked from 'package:datasets':
    
        penguins, penguins_raw
    data("penguins")
    penguins |>
      keep(is.numeric) |>
      map_dbl(mean, na.rm = TRUE)
       bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
                43.92             17.15            200.92           4201.75 
                 year 
              2008.03 
  • discard() will select all variables that return FALSE according to some function.

  • Let’s count the number of each value for each categorical variable:

    penguins |>
      discard(is.numeric) |>
      map(table)
    $species
    
       Adelie Chinstrap    Gentoo 
          152        68       124 
    
    $island
    
       Biscoe     Dream Torgersen 
          168       124        52 
    
    $sex
    
    female   male 
       165    168 
  • Other less useful functions are available in Section 21.9 of RDS.

  • Exercise: In the mtcars data frame, keep only variables that have a mean greater than 10 and calculate their mean. Hint: You’ll have to use some of the shortcuts above.