Load the tidyverse
library(tidyverse)
Iteration is the repetition of some amount of code.
If we didn’t know the sum()
function, how would we
add up the elements of a vector?
x <- c(8, 1, 3, 1, 3)
We could manually add the elements.
x[1] + x[2] + x[3] + x[4] + x[5]
## [1] 16
But this is prone to error (through copy and paste). Also, what if
x
has 10,000 elements?
For loops to the rescue!
sumval <- 0
for (i in seq_along(x)) {
sumval <- sumval + x[[i]]
}
sumval
## [1] 16
Each for loop contains the following elements:
sumval
above. We
allocate the space for the output before the for loop.seq_along(x)
above,
which evaluates to 1 2 3 4 5
. These are the values that
i
will go through each iteration.{}
. This is the code that will be evaluated each iteration
with a new value of i
.In the above sequence, R internally transforms the code to:
sumval <- 0
sumval <- sumval + x[[1]]
sumval <- sumval + x[[2]]
sumval <- sumval + x[[3]]
sumval <- sumval + x[[4]]
sumval <- sumval + x[[5]]
sumval
## [1] 16
You often want to fill a vector with values. You should create
this vector beforehand using the vector()
function.
For example, let’s calculate a vector of cumulative sums of
x
.
cumvec <- vector(mode = "double", length = length(x))
cumvec
## [1] 0 0 0 0 0
for (i in seq_along(cumvec)) {
if (i == 1) {
cumvec[[i]] <- x[[i]]
} else {
cumvec[[i]] <- cumvec[[i - 1]] + x[[i]]
}
}
cumvec
## [1] 8 9 12 13 16
## Same as cumsum(x)
cumsum(x)
## [1] 8 9 12 13 16
Exercise: The first two numbers of the Fibonacci Sequence are 0 and 1. Each succeeding number is the sum of the previous two numbers in the sequence. For example, the third element is 1 = 0 + 1, while the fourth elements is 2 = 1 + 1, and the fifth element is 3 = 2 + 1. Use a for loop to calculate the first 100 Fibonacci Numbers. Sanity Check: The \(\log_2\) of the 100th Fibonacci Number is about 67.57.
Looping is often done over the columns of a data frame.
Note: for a data frame df
,
seq_along(df)
is the same as 1:ncol(df)
which
is the same as 1:length(df)
(since data frames are special
cases of lists).
Let’s calculate the mean of each column of
mtcars
data("mtcars")
mean_vec <- vector(mode = "numeric", length = length(mtcars))
for (i in seq_along(mtcars)) {
mean_vec[[i]] <- mean(mtcars[[i]], na.rm = TRUE)
}
mean_vec
## [1] 20.0906 6.1875 230.7219 146.6875 3.5966 3.2172 17.8487 0.4375
## [9] 0.4062 3.6875 2.8125
colMeans(mtcars)
## mpg cyl disp hp drat wt qsec vs
## 20.0906 6.1875 230.7219 146.6875 3.5966 3.2172 17.8487 0.4375
## am gear carb
## 0.4062 3.6875 2.8125
Why not just use colMeans()
? Well, there is no
“colSDs
” function, so iteration is important for applying
non-implemented functions to multiple elements in R.
Exercise: Use a for loop to calculate the
standard deviation of each penguin trait in the penguins
data frame from the palmerpenguins
package.
R is a functional programming language. Which means that you can pass functions to functions.
Suppose on mtcars
we want to calculate the
column-wise mean, the column-wise median, the column-wise standard
deviation, the column-wise maximum, the column-wise minimum, and the
column-wise MAD.
The for-loop would look very similar
funvec <- rep(NA, length = length(mtcars))
for (i in seq_along(funvec)) {
funvec[i] <- fun(mtcars[[i]], na.rm = TRUE)
}
funvec
Ideally, we would like to just tell R what function to apply to
each column of mtcars
. This is what the purrr package
allows us to do.
purrr is a part of the tidyverse, and so does not need to be loaded separately.
map_*()
takes a vector (or list or data frame) as
input, applies a provided function on each element of that vector, and
outputs a vector of the same length.
map()
returns a list.map_lgl()
returns a logical vector.map_int()
returns an integer vector.map_dbl()
returns a double vector.map_chr()
returns a character vector.map_dbl(mtcars, mean)
map_dbl(mtcars, median)
map_dbl(mtcars, sd)
map_dbl(mtcars, mad)
map_dbl(mtcars, min)
map_dbl(mtcars, max)
You can pass on more arguments in map_*()
.
map_dbl(mtcars, mean, na.rm = TRUE)
Suppose you want to get the output of summary()
on
each column.
map(mtcars, summary)
Exercise (RDS 21.5.3.1): Write code that uses one of the map functions to:
nycflights13::flights.
palmerpenguins::penguins
.Instead of specifying a built-in funciton, you can create an anonymous function to map over.
Anonymous functions are non-named functions that are used as inputs to other functions.
Typically,t hey are one-liners and are of the form
function(args) code-using-args
E.g., an anonymous function that outputs the interquartile range of a vector is
function(x) quantile(x, 0.75) - quantile(x, 0.25)
## function(x) quantile(x, 0.75) - quantile(x, 0.25)
R 4.0 and above allows for a shorter syntax for anonymous functions.
\(x) quantile(x, 0.75) - quantile(x, 0.25)
## \(x) quantile(x, 0.75) - quantile(x, 0.25)
For example, the following are three equivalent ways to calculate
the mean of each column in mtcars
.
map_dbl(mtcars, mean)
map_dbl(mtcars, function(x) mean(x))
map_dbl(mtcars, \(x) mean(x))
You can think about this as purrr creating an anonymous function
.f <- function(.) {
mean(.)
}
and then calling this function in map()
.
map_dbl(mtcars, .f)
Why is this useful? Consider the following chunk of code which allows us to fit many simple linear regression models:
mtcars |>
nest(.by = cyl) |>
mutate(lmout = map(data, \(df) lm(mpg ~ wt, data = df))) ->
sumdf
nest(.by = cyl)
will create a new data frame containing
a list column of data frames, where each data frame has the same value
of cyl
for all units within that data frame.\(df) lm(mpg ~ wt, data = df)
defines a function
(called an “anonymous function”) that will fit a linear model of
mpg
on wt
where those variables are in the
data frame df
.map()
call fits that linear model to each of the
three data frames in the list-column called data
created by
nest()
.lmout
that contains the three lm
objects that
you can use to get fits and summaries.summary(sumdf$lmout[[1]])
##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Residuals:
## 1 2 3 4 5 6 7
## -0.125 0.584 1.929 -0.690 0.355 -1.045 -1.008
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.41 4.18 6.79 0.0011
## wt -2.78 1.33 -2.08 0.0918
##
## Residual standard error: 1.17 on 5 degrees of freedom
## Multiple R-squared: 0.465, Adjusted R-squared: 0.357
## F-statistic: 4.34 on 1 and 5 DF, p-value: 0.0918
We can use map()
to get a list of summaries.
sumdf |>
mutate(sumlm = map(lmout, summary)) ->
sumdf
sumdf$sumlm[[1]]
##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Residuals:
## 1 2 3 4 5 6 7
## -0.125 0.584 1.929 -0.690 0.355 -1.045 -1.008
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.41 4.18 6.79 0.0011
## wt -2.78 1.33 -2.08 0.0918
##
## Residual standard error: 1.17 on 5 degrees of freedom
## Multiple R-squared: 0.465, Adjusted R-squared: 0.357
## F-statistic: 4.34 on 1 and 5 DF, p-value: 0.0918
If you want to extract the \(R^2\), you can do this using map as well
sumdf$sumlm[[1]]$r.squared ## only gets one R^2 out.
## [1] 0.4645
## Gets all R^2 out
sumdf |>
mutate(rsquared = map_dbl(sumlm, "r.squared")) ->
sumdf
sumdf$rsquared
## [1] 0.4645 0.5086 0.4230
Exercise: A \(t\)-test is used to test for
differences in population means. R implements this with
t.test()
. For example, if I want to test for differences
between the mean mpg
’s of automatics and manuals (coded in
variable am
), I would use the following syntax.
t.test(mpg ~ am, data = mtcars)$p.value
Use map()
to get the \(p\)-value for this test within each group
of cyl
.
keep()
and discard()
.keep()
selects all variables that return
TRUE
according to some function.
E.g. let’s keep all numeric variables and calculate their means
in the palmerpenguins::penguins
data frame.
library(palmerpenguins)
data("penguins")
penguins |>
keep(is.numeric) |>
map_dbl(mean, na.rm = TRUE)
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 43.92 17.15 200.92 4201.75
## year
## 2008.03
discard()
will select all variables that return
FALSE
according to some function.
Let’s count the number of each value for each categorical variable:
penguins |>
discard(is.numeric) |>
map(table)
## $species
##
## Adelie Chinstrap Gentoo
## 152 68 124
##
## $island
##
## Biscoe Dream Torgersen
## 168 124 52
##
## $sex
##
## female male
## 165 168
Other less useful functions are available in Section 21.9 of RDS.
Exercise: In the mtcars
data frame,
keep only variables that have a mean greater than 10
and
calculate their mean. Hint: You’ll have to use some of the shortcuts
above.