library(tidyverse)Iteration
Learning Objectives
- For-loops.
- Iteration.
- Chapter 11 of HOPR
- Purrr Cheat Sheet.
- Purrr Overview.
For Loops
Load the tidyverse
Iteration is the repetition of some amount of code.
If we didn’t know the
sum()function, how would we add up the elements of a vector?x <- c(8, 1, 3, 1, 3)We could manually add the elements.
x[1] + x[2] + x[3] + x[4] + x[5][1] 16But this is prone to error (through copy and paste). Also, what if
xhas 10,000 elements?For loops to the rescue!
sumval <- 0 for (i in seq_along(x)) { sumval <- sumval + x[[i]] } sumval[1] 16Each for loop contains the following elements:
- Output: This is
sumvalabove. We allocate the space for the output before the for loop. - Sequence: This is
seq_along(x)above, which evaluates to1 2 3 4 5. These are the values thatiwill go through each iteration. - Body: This is the code between the curly braces
{}. This is the code that will be evaluated each iteration with a new value ofi.
- Output: This is
In the above sequence, R internally transforms the code to:
sumval <- 0 sumval <- sumval + x[[1]] sumval <- sumval + x[[2]] sumval <- sumval + x[[3]] sumval <- sumval + x[[4]] sumval <- sumval + x[[5]] sumval[1] 16You often want to fill a vector with values. You should create this vector beforehand using the
vector()function.For example, let’s calculate a vector of cumulative sums of
x.cumvec <- vector(mode = "double", length = length(x)) cumvec[1] 0 0 0 0 0for (i in seq_along(cumvec)) { if (i == 1) { cumvec[[i]] <- x[[i]] } else { cumvec[[i]] <- cumvec[[i - 1]] + x[[i]] } } cumvec[1] 8 9 12 13 16## Same as cumsum(x) cumsum(x)[1] 8 9 12 13 16Exercise: The first two numbers of the Fibonacci Sequence are 0 and 1. Each succeeding number is the sum of the previous two numbers in the sequence. For example, the third element is 1 = 0 + 1, while the fourth elements is 2 = 1 + 1, and the fifth element is 3 = 2 + 1. Use a for loop to calculate the first 100 Fibonacci Numbers. Sanity Check: The \(\log_2\) of the 100th Fibonacci Number is about 67.57.
Looping is often done over the columns of a data frame.
Note: for a data frame
df,seq_along(df)is the same as1:ncol(df)which is the same as1:length(df)(since data frames are special cases of lists).Let’s calculate the mean of each column of
mtcarsdata("mtcars") mean_vec <- vector(mode = "numeric", length = length(mtcars)) for (i in seq_along(mtcars)) { mean_vec[[i]] <- mean(mtcars[[i]], na.rm = TRUE) } mean_vec[1] 20.0906 6.1875 230.7219 146.6875 3.5966 3.2172 17.8487 0.4375 [9] 0.4062 3.6875 2.8125colMeans(mtcars)mpg cyl disp hp drat wt qsec vs 20.0906 6.1875 230.7219 146.6875 3.5966 3.2172 17.8488 0.4375 am gear carb 0.4062 3.6875 2.8125Why not just use
colMeans()? Well, there is no “colSDs” function, so iteration is important for applying non-implemented functions to multiple elements in R.Exercise: Use a for loop to calculate the standard deviation of each penguin trait in the
penguinsdata frame from thepalmerpenguinspackage.
purrr
Basic Mappings
R is a functional programming language. Which means that you can pass functions to functions.
Suppose on
mtcarswe want to calculate the column-wise mean, the column-wise median, the column-wise standard deviation, the column-wise maximum, the column-wise minimum, and the column-wise MAD. The for-loop would look very similarfunvec <- rep(NA, length = length(mtcars)) for (i in seq_along(funvec)) { funvec[i] <- fun(mtcars[[i]], na.rm = TRUE) } funvecIdeally, we would like to just tell R what function to apply to each column of
mtcars. This is what the purrr package allows us to do.purrr is a part of the tidyverse, and so does not need to be loaded separately.
map_*()takes a vector (or list or data frame) as input, applies a provided function on each element of that vector, and outputs a vector of the same length.map()returns a list.map_lgl()returns a logical vector.map_int()returns an integer vector.map_dbl()returns a double vector.map_chr()returns a character vector.
map_dbl(mtcars, mean) map_dbl(mtcars, median) map_dbl(mtcars, sd) map_dbl(mtcars, mad) map_dbl(mtcars, min) map_dbl(mtcars, max)You can pass on more arguments in
map_*().map_dbl(mtcars, mean, na.rm = TRUE)Suppose you want to get the output of
summary()on each column.map(mtcars, summary)Exercise (RDS 21.5.3.1): Write code that uses one of the map functions to:
- Determine the type of each column in
nycflights13::flights. - Compute the number of unique values in each column of
palmerpenguins::penguins. - Generate 10 random normals for each of \(\mu = -10, 0, 10, \ldots, 100\).
- Determine the type of each column in
Shortcuts
Instead of specifying a built-in funciton, you can create an anonymous function to map over.
Anonymous functions are non-named functions that are used as inputs to other functions.
Typically, they are one-liners and are of the form
function(args) code-using-argsE.g., an anonymous function that outputs the interquartile range of a vector is
function(x) quantile(x, 0.75) - quantile(x, 0.25)function (x) quantile(x, 0.75) - quantile(x, 0.25)R 4.0 and above allows for a shorter syntax for anonymous functions.
\(x) quantile(x, 0.75) - quantile(x, 0.25)function (x) quantile(x, 0.75) - quantile(x, 0.25)For example, the following are three equivalent ways to calculate the mean of each column in
mtcars.map_dbl(mtcars, mean) map_dbl(mtcars, function(x) mean(x)) map_dbl(mtcars, \(x) mean(x))You can think about this as purrr creating an anonymous function
.f <- function(.) { mean(.) }and then calling this function in
map().map_dbl(mtcars, .f)Why is this useful? Consider the following chunk of code which allows us to fit many simple linear regression models:
mtcars |> nest(.by = cyl) |> mutate(lmout = map(data, \(df) lm(mpg ~ wt, data = df))) -> sumdfnest(.by = cyl)will create a new data frame containing a list column of data frames, where each data frame has the same value ofcylfor all units within that data frame.\(df) lm(mpg ~ wt, data = df)defines a function (called an “anonymous function”) that will fit a linear model ofmpgonwtwhere those variables are in the data framedf.- The
map()call fits that linear model to each of the three data frames in the list-column calleddatacreated bynest(). - What is returned is a data frame containing a new list column called
lmoutthat contains the threelmobjects that you can use to get fits and summaries.
summary(sumdf$lmout[[1]])Call: lm(formula = mpg ~ wt, data = df) Residuals: 1 2 3 4 5 6 7 -0.125 0.584 1.929 -0.690 0.355 -1.045 -1.008 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 28.41 4.18 6.79 0.0011 wt -2.78 1.33 -2.08 0.0918 Residual standard error: 1.17 on 5 degrees of freedom Multiple R-squared: 0.465, Adjusted R-squared: 0.357 F-statistic: 4.34 on 1 and 5 DF, p-value: 0.0918We can use
map()to get a list of summaries.sumdf |> mutate(sumlm = map(lmout, summary)) -> sumdf sumdf$sumlm[[1]]Call: lm(formula = mpg ~ wt, data = df) Residuals: 1 2 3 4 5 6 7 -0.125 0.584 1.929 -0.690 0.355 -1.045 -1.008 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 28.41 4.18 6.79 0.0011 wt -2.78 1.33 -2.08 0.0918 Residual standard error: 1.17 on 5 degrees of freedom Multiple R-squared: 0.465, Adjusted R-squared: 0.357 F-statistic: 4.34 on 1 and 5 DF, p-value: 0.0918If you want to extract the \(R^2\), you can do this using map as well
sumdf$sumlm[[1]]$r.squared ## only gets one R^2 out.[1] 0.4645## Gets all R^2 out sumdf |> mutate(rsquared = map_dbl(sumlm, "r.squared")) -> sumdf sumdf$rsquared[1] 0.4645 0.5086 0.4230Exercise: A \(t\)-test is used to test for differences in population means. R implements this with
t.test(). For example, if I want to test for differences between the meanmpg’s of automatics and manuals (coded in variableam), I would use the following syntax.t.test(mpg ~ am, data = mtcars)$p.valueUse
map()to get the \(p\)-value for this test within each group ofcyl.
keep() and discard().
keep()selects all variables that returnTRUEaccording to some function.E.g. let’s keep all numeric variables and calculate their means in the
palmerpenguins::penguinsdata frame.library(palmerpenguins)Attaching package: 'palmerpenguins'The following objects are masked from 'package:datasets': penguins, penguins_rawdata("penguins") penguins |> keep(is.numeric) |> map_dbl(mean, na.rm = TRUE)bill_length_mm bill_depth_mm flipper_length_mm body_mass_g 43.92 17.15 200.92 4201.75 year 2008.03discard()will select all variables that returnFALSEaccording to some function.Let’s count the number of each value for each categorical variable:
penguins |> discard(is.numeric) |> map(table)$species Adelie Chinstrap Gentoo 152 68 124 $island Biscoe Dream Torgersen 168 124 52 $sex female male 165 168Other less useful functions are available in Section 21.9 of RDS.
Exercise: In the
mtcarsdata frame, keep only variables that have a mean greater than10and calculate their mean. Hint: You’ll have to use some of the shortcuts above.