Learning Objectives

For Loops

purrr

Basic Mappings

  • R is a functional programming language. Which means that you can pass functions to functions.

  • Suppose on mtcars we want to calculate the column-wise mean, the column-wise median, the column-wise standard deviation, the column-wise maximum, the column-wise minimum, and the column-wise MAD. The for-loop would look very similar

    funvec <- rep(NA, length = length(mtcars))
    for (i in seq_along(funvec)) {
      funvec[i] <- fun(mtcars[[i]], na.rm = TRUE) 
    }
    funvec
  • Ideally, we would like to just tell R what function to apply to each column of mtcars. This is what the purrr package allows us to do.

  • purrr is a part of the tidyverse, and so does not need to be loaded separately.

  • map_*() takes a vector (or list or data frame) as input, applies a provided function on each element of that vector, and outputs a vector of the same length.

    • map() returns a list.
    • map_lgl() returns a logical vector.
    • map_int() returns an integer vector.
    • map_dbl() returns a double vector.
    • map_chr() returns a character vector.
    map_dbl(mtcars, mean)
    map_dbl(mtcars, median)
    map_dbl(mtcars, sd)
    map_dbl(mtcars, mad)
    map_dbl(mtcars, min)
    map_dbl(mtcars, max)
  • You can pass on more arguments in map_*().

    map_dbl(mtcars, mean, na.rm = TRUE)
  • Suppose you want to get the output of summary() on each column.

    map(mtcars, summary)
  • Exercise (RDS 21.5.3.1): Write code that uses one of the map functions to:

    1. Determine the type of each column in nycflights13::flights.
    2. Compute the number of unique values in each column of palmerpenguins::penguins.
    3. Generate 10 random normals for each of \(\mu = -10, 0, 10, \ldots, 100\).

Shortcuts

  • Instead of specifying a built-in funciton, you can create an anonymous function to map over.

  • Anonymous functions are non-named functions that are used as inputs to other functions.

  • Typically,t hey are one-liners and are of the form

    function(args) code-using-args
  • E.g., an anonymous function that outputs the interquartile range of a vector is

    function(x) quantile(x, 0.75) - quantile(x, 0.25)
    ## function(x) quantile(x, 0.75) - quantile(x, 0.25)
  • R 4.0 and above allows for a shorter syntax for anonymous functions.

    \(x) quantile(x, 0.75) - quantile(x, 0.25)
    ## \(x) quantile(x, 0.75) - quantile(x, 0.25)
  • For example, the following are three equivalent ways to calculate the mean of each column in mtcars.

    map_dbl(mtcars, mean)
    map_dbl(mtcars, function(x) mean(x))
    map_dbl(mtcars, \(x) mean(x))
  • You can think about this as purrr creating an anonymous function

    .f <- function(.) {
      mean(.)
    }

    and then calling this function in map().

    map_dbl(mtcars, .f)
  • Why is this useful? Consider the following chunk of code which allows us to fit many simple linear regression models:

    mtcars |>
      nest(.by = cyl) |>
      mutate(lmout = map(data, \(df) lm(mpg ~ wt, data = df))) ->
      sumdf
    • nest(.by = cyl) will create a new data frame containing a list column of data frames, where each data frame has the same value of cyl for all units within that data frame.
    • \(df) lm(mpg ~ wt, data = df) defines a function (called an “anonymous function”) that will fit a linear model of mpg on wt where those variables are in the data frame df.
    • The map() call fits that linear model to each of the three data frames in the list-column called data created by nest().
    • What is returned is a data frame containing a new list column called lmout that contains the three lm objects that you can use to get fits and summaries.
    summary(sumdf$lmout[[1]])
    ## 
    ## Call:
    ## lm(formula = mpg ~ wt, data = df)
    ## 
    ## Residuals:
    ##      1      2      3      4      5      6      7 
    ## -0.125  0.584  1.929 -0.690  0.355 -1.045 -1.008 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)
    ## (Intercept)    28.41       4.18    6.79   0.0011
    ## wt             -2.78       1.33   -2.08   0.0918
    ## 
    ## Residual standard error: 1.17 on 5 degrees of freedom
    ## Multiple R-squared:  0.465,  Adjusted R-squared:  0.357 
    ## F-statistic: 4.34 on 1 and 5 DF,  p-value: 0.0918
  • We can use map() to get a list of summaries.

    sumdf |>
      mutate(sumlm = map(lmout, summary)) ->
      sumdf
    
    sumdf$sumlm[[1]]
    ## 
    ## Call:
    ## lm(formula = mpg ~ wt, data = df)
    ## 
    ## Residuals:
    ##      1      2      3      4      5      6      7 
    ## -0.125  0.584  1.929 -0.690  0.355 -1.045 -1.008 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)
    ## (Intercept)    28.41       4.18    6.79   0.0011
    ## wt             -2.78       1.33   -2.08   0.0918
    ## 
    ## Residual standard error: 1.17 on 5 degrees of freedom
    ## Multiple R-squared:  0.465,  Adjusted R-squared:  0.357 
    ## F-statistic: 4.34 on 1 and 5 DF,  p-value: 0.0918
  • If you want to extract the \(R^2\), you can do this using map as well

    sumdf$sumlm[[1]]$r.squared ## only gets one R^2 out.
    ## [1] 0.4645
    ## Gets all R^2 out
    sumdf |>
      mutate(rsquared = map_dbl(sumlm, "r.squared")) ->
      sumdf
    
    sumdf$rsquared
    ## [1] 0.4645 0.5086 0.4230
  • Exercise: A \(t\)-test is used to test for differences in population means. R implements this with t.test(). For example, if I want to test for differences between the mean mpg’s of automatics and manuals (coded in variable am), I would use the following syntax.

    t.test(mpg ~ am, data = mtcars)$p.value

    Use map() to get the \(p\)-value for this test within each group of cyl.

keep() and discard().

  • keep() selects all variables that return TRUE according to some function.

  • E.g. let’s keep all numeric variables and calculate their means in the palmerpenguins::penguins data frame.

    library(palmerpenguins)
    data("penguins")
    penguins |>
      keep(is.numeric) |>
      map_dbl(mean, na.rm = TRUE)
    ##    bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
    ##             43.92             17.15            200.92           4201.75 
    ##              year 
    ##           2008.03
  • discard() will select all variables that return FALSE according to some function.

  • Let’s count the number of each value for each categorical variable:

    penguins |>
      discard(is.numeric) |>
      map(table)
    ## $species
    ## 
    ##    Adelie Chinstrap    Gentoo 
    ##       152        68       124 
    ## 
    ## $island
    ## 
    ##    Biscoe     Dream Torgersen 
    ##       168       124        52 
    ## 
    ## $sex
    ## 
    ## female   male 
    ##    165    168
  • Other less useful functions are available in Section 21.9 of RDS.

  • Exercise: In the mtcars data frame, keep only variables that have a mean greater than 10 and calculate their mean. Hint: You’ll have to use some of the shortcuts above.