library(tidyverse)
Iteration
Learning Objectives
- For-loops.
- Iteration.
- Chapter 11 of HOPR
- Purrr Cheat Sheet.
- Purrr Overview.
For Loops
Load the tidyverse
Iteration is the repetition of some amount of code.
If we didn’t know the
sum()
function, how would we add up the elements of a vector?<- c(8, 1, 3, 1, 3) x
We could manually add the elements.
1] + x[2] + x[3] + x[4] + x[5] x[
[1] 16
But this is prone to error (through copy and paste). Also, what if
x
has 10,000 elements?For loops to the rescue!
<- 0 sumval for (i in seq_along(x)) { <- sumval + x[[i]] sumval } sumval
[1] 16
Each for loop contains the following elements:
- Output: This is
sumval
above. We allocate the space for the output before the for loop. - Sequence: This is
seq_along(x)
above, which evaluates to1 2 3 4 5
. These are the values thati
will go through each iteration. - Body: This is the code between the curly braces
{}
. This is the code that will be evaluated each iteration with a new value ofi
.
- Output: This is
In the above sequence, R internally transforms the code to:
<- 0 sumval <- sumval + x[[1]] sumval <- sumval + x[[2]] sumval <- sumval + x[[3]] sumval <- sumval + x[[4]] sumval <- sumval + x[[5]] sumval sumval
[1] 16
You often want to fill a vector with values. You should create this vector beforehand using the
vector()
function.For example, let’s calculate a vector of cumulative sums of
x
.<- vector(mode = "double", length = length(x)) cumvec cumvec
[1] 0 0 0 0 0
for (i in seq_along(cumvec)) { if (i == 1) { <- x[[i]] cumvec[[i]] else { } <- cumvec[[i - 1]] + x[[i]] cumvec[[i]] } } cumvec
[1] 8 9 12 13 16
## Same as cumsum(x) cumsum(x)
[1] 8 9 12 13 16
Exercise: The first two numbers of the Fibonacci Sequence are 0 and 1. Each succeeding number is the sum of the previous two numbers in the sequence. For example, the third element is 1 = 0 + 1, while the fourth elements is 2 = 1 + 1, and the fifth element is 3 = 2 + 1. Use a for loop to calculate the first 100 Fibonacci Numbers. Sanity Check: The \(\log_2\) of the 100th Fibonacci Number is about 67.57.
Looping is often done over the columns of a data frame.
Note: for a data frame
df
,seq_along(df)
is the same as1:ncol(df)
which is the same as1:length(df)
(since data frames are special cases of lists).Let’s calculate the mean of each column of
mtcars
data("mtcars") <- vector(mode = "numeric", length = length(mtcars)) mean_vec for (i in seq_along(mtcars)) { <- mean(mtcars[[i]], na.rm = TRUE) mean_vec[[i]] } mean_vec
[1] 20.0906 6.1875 230.7219 146.6875 3.5966 3.2172 17.8487 0.4375 [9] 0.4062 3.6875 2.8125
colMeans(mtcars)
mpg cyl disp hp drat wt qsec vs 20.0906 6.1875 230.7219 146.6875 3.5966 3.2172 17.8488 0.4375 am gear carb 0.4062 3.6875 2.8125
Why not just use
colMeans()
? Well, there is no “colSDs
” function, so iteration is important for applying non-implemented functions to multiple elements in R.Exercise: Use a for loop to calculate the standard deviation of each penguin trait in the
penguins
data frame from thepalmerpenguins
package.
purrr
Basic Mappings
R is a functional programming language. Which means that you can pass functions to functions.
Suppose on
mtcars
we want to calculate the column-wise mean, the column-wise median, the column-wise standard deviation, the column-wise maximum, the column-wise minimum, and the column-wise MAD. The for-loop would look very similar<- rep(NA, length = length(mtcars)) funvec for (i in seq_along(funvec)) { <- fun(mtcars[[i]], na.rm = TRUE) funvec[i] } funvec
Ideally, we would like to just tell R what function to apply to each column of
mtcars
. This is what the purrr package allows us to do.purrr is a part of the tidyverse, and so does not need to be loaded separately.
map_*()
takes a vector (or list or data frame) as input, applies a provided function on each element of that vector, and outputs a vector of the same length.map()
returns a list.map_lgl()
returns a logical vector.map_int()
returns an integer vector.map_dbl()
returns a double vector.map_chr()
returns a character vector.
map_dbl(mtcars, mean) map_dbl(mtcars, median) map_dbl(mtcars, sd) map_dbl(mtcars, mad) map_dbl(mtcars, min) map_dbl(mtcars, max)
You can pass on more arguments in
map_*()
.map_dbl(mtcars, mean, na.rm = TRUE)
Suppose you want to get the output of
summary()
on each column.map(mtcars, summary)
Exercise (RDS 21.5.3.1): Write code that uses one of the map functions to:
- Determine the type of each column in
nycflights13::flights.
- Compute the number of unique values in each column of
palmerpenguins::penguins
. - Generate 10 random normals for each of \(\mu = -10, 0, 10, \ldots, 100\).
- Determine the type of each column in
Shortcuts
Instead of specifying a built-in funciton, you can create an anonymous function to map over.
Anonymous functions are non-named functions that are used as inputs to other functions.
Typically, they are one-liners and are of the form
function(args) code-using-args
E.g., an anonymous function that outputs the interquartile range of a vector is
function(x) quantile(x, 0.75) - quantile(x, 0.25)
function (x) quantile(x, 0.75) - quantile(x, 0.25)
R 4.0 and above allows for a shorter syntax for anonymous functions.
quantile(x, 0.75) - quantile(x, 0.25) \(x)
function (x) quantile(x, 0.75) - quantile(x, 0.25)
For example, the following are three equivalent ways to calculate the mean of each column in
mtcars
.map_dbl(mtcars, mean) map_dbl(mtcars, function(x) mean(x)) map_dbl(mtcars, \(x) mean(x))
You can think about this as purrr creating an anonymous function
<- function(.) { .f mean(.) }
and then calling this function in
map()
.map_dbl(mtcars, .f)
Why is this useful? Consider the following chunk of code which allows us to fit many simple linear regression models:
|> mtcars nest(.by = cyl) |> mutate(lmout = map(data, \(df) lm(mpg ~ wt, data = df))) -> sumdf
nest(.by = cyl)
will create a new data frame containing a list column of data frames, where each data frame has the same value ofcyl
for all units within that data frame.\(df) lm(mpg ~ wt, data = df)
defines a function (called an “anonymous function”) that will fit a linear model ofmpg
onwt
where those variables are in the data framedf
.- The
map()
call fits that linear model to each of the three data frames in the list-column calleddata
created bynest()
. - What is returned is a data frame containing a new list column called
lmout
that contains the threelm
objects that you can use to get fits and summaries.
summary(sumdf$lmout[[1]])
Call: lm(formula = mpg ~ wt, data = df) Residuals: 1 2 3 4 5 6 7 -0.125 0.584 1.929 -0.690 0.355 -1.045 -1.008 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 28.41 4.18 6.79 0.0011 wt -2.78 1.33 -2.08 0.0918 Residual standard error: 1.17 on 5 degrees of freedom Multiple R-squared: 0.465, Adjusted R-squared: 0.357 F-statistic: 4.34 on 1 and 5 DF, p-value: 0.0918
We can use
map()
to get a list of summaries.|> sumdf mutate(sumlm = map(lmout, summary)) -> sumdf $sumlm[[1]] sumdf
Call: lm(formula = mpg ~ wt, data = df) Residuals: 1 2 3 4 5 6 7 -0.125 0.584 1.929 -0.690 0.355 -1.045 -1.008 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 28.41 4.18 6.79 0.0011 wt -2.78 1.33 -2.08 0.0918 Residual standard error: 1.17 on 5 degrees of freedom Multiple R-squared: 0.465, Adjusted R-squared: 0.357 F-statistic: 4.34 on 1 and 5 DF, p-value: 0.0918
If you want to extract the \(R^2\), you can do this using map as well
$sumlm[[1]]$r.squared ## only gets one R^2 out. sumdf
[1] 0.4645
## Gets all R^2 out |> sumdf mutate(rsquared = map_dbl(sumlm, "r.squared")) -> sumdf $rsquared sumdf
[1] 0.4645 0.5086 0.4230
Exercise: A \(t\)-test is used to test for differences in population means. R implements this with
t.test()
. For example, if I want to test for differences between the meanmpg
’s of automatics and manuals (coded in variableam
), I would use the following syntax.t.test(mpg ~ am, data = mtcars)$p.value
Use
map()
to get the \(p\)-value for this test within each group ofcyl
.
keep()
and discard()
.
keep()
selects all variables that returnTRUE
according to some function.E.g. let’s keep all numeric variables and calculate their means in the
palmerpenguins::penguins
data frame.library(palmerpenguins)
Attaching package: 'palmerpenguins'
The following objects are masked from 'package:datasets': penguins, penguins_raw
data("penguins") |> penguins keep(is.numeric) |> map_dbl(mean, na.rm = TRUE)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g 43.92 17.15 200.92 4201.75 year 2008.03
discard()
will select all variables that returnFALSE
according to some function.Let’s count the number of each value for each categorical variable:
|> penguins discard(is.numeric) |> map(table)
$species Adelie Chinstrap Gentoo 152 68 124 $island Biscoe Dream Torgersen 168 124 52 $sex female male 165 168
Other less useful functions are available in Section 21.9 of RDS.
Exercise: In the
mtcars
data frame, keep only variables that have a mean greater than10
and calculate their mean. Hint: You’ll have to use some of the shortcuts above.