Many Models

Author

David Gerard

Published

June 3, 2025

Learning Objectives

Use a list-column to run many models at once.
Chapter 25 from RDS.

Motivation

The gapminder data

library(tidyverse)
library(broom)
library(gapminder)
data("gapminder")
glimpse(gapminder)

Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.80, 30.33, 32.00, 34.02, 36.09, 38.44, 39.85, 40.82, 41.6…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4, 820.9, 853.1, 836.2, 740.0, 786.1, 978.0, 852.4, 649.…

Suppose we want to look at how life expectancy has changed over time in each country:

gapminder %>%
  ggplot(aes(x = year, y = lifeExp, group = country)) +
  geom_line(alpha = 1/3) +
  xlab("Year") +
  ylab("Life Expectancy")

General trend is going up. But there are some countries where this doesn’t happen. How do we quantify the trend? We can do this with one country using a linear model:

gapminder %>%
  filter(country == "United States") ->
  usdf

ggplot(usdf, aes(x = year, y = lifeExp)) +
  geom_line() +
  geom_smooth(method = "lm", se = FALSE) +
  geom_line(alpha = 1/3) +
  xlab("Year") +
  ylab("Life Expectancy")

`geom_smooth()` using formula = 'y ~ x'

us_lmout <- lm(lifeExp ~ year, data = usdf)
tidy_uslm <- tidy(us_lmout)
tidy_uslm

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept) -291.     13.8         -21.1 1.25e- 9
2 year           0.184   0.00696      26.5 1.37e-10

So each year, the US has been increasing its life expectancy by 0.1842 years.
How can we get these coefficient estimates for each country?
For the exercises in this lesson plan, you should log the pop variable now:
```
gapminder %>%
  mutate(logpop = log2(pop)) ->
  gapminder
```
Exercise: Create a line plot for year vs log-population for each country.
Exercise: For China, fit a linear model for log-population on year. Make sure the assumptions of the linear model are fulfilled. Interpret the coefficients

Nest

The nest() function from the tidyr package will turn a grouped data frame into a data frame where each group is a row. All of the observations are placed in the data variable as a data frame. Let’s look at an example.

gapminder %>%
  group_by(country, continent) %>%
  nest() ->
  nested_gap_df

nested_gap_df

# A tibble: 142 × 3
# Groups:   country, continent [142]
   country     continent data             
   <fct>       <fct>     <list>           
 1 Afghanistan Asia      <tibble [12 × 5]>
 2 Albania     Europe    <tibble [12 × 5]>
 3 Algeria     Africa    <tibble [12 × 5]>
 4 Angola      Africa    <tibble [12 × 5]>
 5 Argentina   Americas  <tibble [12 × 5]>
 6 Australia   Oceania   <tibble [12 × 5]>
 7 Austria     Europe    <tibble [12 × 5]>
 8 Bahrain     Asia      <tibble [12 × 5]>
 9 Bangladesh  Asia      <tibble [12 × 5]>
10 Belgium     Europe    <tibble [12 × 5]>
# ℹ 132 more rows

The data variable is a list. Each element is a data frame that contains all of the observations in the group we created by country.

Observations from Afghanistan

nested_gap_df$data[[1]]

# A tibble: 12 × 5
    year lifeExp      pop gdpPercap logpop
   <int>   <dbl>    <int>     <dbl>  <dbl>
 1  1952    28.8  8425333      779.   23.0
 2  1957    30.3  9240934      821.   23.1
 3  1962    32.0 10267083      853.   23.3
 4  1967    34.0 11537966      836.   23.5
 5  1972    36.1 13079460      740.   23.6
 6  1977    38.4 14880372      786.   23.8
 7  1982    39.9 12881816      978.   23.6
 8  1987    40.8 13867957      852.   23.7
 9  1992    41.7 16317921      649.   24.0
10  1997    41.8 22227415      635.   24.4
11  2002    42.1 25268405      727.   24.6
12  2007    43.8 31889923      975.   24.9

Observations from Albania

nested_gap_df$data[[2]]

# A tibble: 12 × 5
    year lifeExp     pop gdpPercap logpop
   <int>   <dbl>   <int>     <dbl>  <dbl>
 1  1952    55.2 1282697     1601.   20.3
 2  1957    59.3 1476505     1942.   20.5
 3  1962    64.8 1728137     2313.   20.7
 4  1967    66.2 1984060     2760.   20.9
 5  1972    67.7 2263554     3313.   21.1
 6  1977    68.9 2509048     3533.   21.3
 7  1982    70.4 2780097     3631.   21.4
 8  1987    72   3075321     3739.   21.6
 9  1992    71.6 3326498     2497.   21.7
10  1997    73.0 3428038     3193.   21.7
11  2002    75.7 3508512     4604.   21.7
12  2007    76.4 3600523     5937.   21.8

Notice that I used double brackets to extract the elements of data because it is a list.
Exercise: Extract the United States data frame from nested_gap_df. This code should be general so that the location of the US in the data frame does not matter.

Fit a model

We can now use purrr to fit a model on all of the elements of data.

nested_gap_df %>%
  mutate(lmout = map(data, ~lm(lifeExp ~ year, data = .))) ->
  nested_gap_df

Recall: the data = . argument says that we should place all of the elements from the data column in the nested_gap_df data frame as the data argument in lm().

Let’s look at the results

nested_gap_df

# A tibble: 142 × 4
# Groups:   country, continent [142]
   country     continent data              lmout 
   <fct>       <fct>     <list>            <list>
 1 Afghanistan Asia      <tibble [12 × 5]> <lm>  
 2 Albania     Europe    <tibble [12 × 5]> <lm>  
 3 Algeria     Africa    <tibble [12 × 5]> <lm>  
 4 Angola      Africa    <tibble [12 × 5]> <lm>  
 5 Argentina   Americas  <tibble [12 × 5]> <lm>  
 6 Australia   Oceania   <tibble [12 × 5]> <lm>  
 7 Austria     Europe    <tibble [12 × 5]> <lm>  
 8 Bahrain     Asia      <tibble [12 × 5]> <lm>  
 9 Bangladesh  Asia      <tibble [12 × 5]> <lm>  
10 Belgium     Europe    <tibble [12 × 5]> <lm>  
# ℹ 132 more rows

Each element of the lmout column is the model output from fitting lm() to each data frame in the data column.

## Afghanistan fit
nested_gap_df$lmout[[1]]


Call:
lm(formula = lifeExp ~ year, data = .)

Coefficients:
(Intercept)         year  
   -507.534        0.275

## Albania fit
nested_gap_df$lmout[[2]]


Call:
lm(formula = lifeExp ~ year, data = .)

Coefficients:
(Intercept)         year  
   -594.073        0.335

Exercise: For each country, fit a linear model of log-population on year. Save this fit as the lmpop column in nested_gap_df.

Get model summaries

Also use map to create columns of model summaries using tidy(), augment(), or glance().

nested_gap_df %>%
  mutate(tidyout = map(lmout, ~tidy(., conf.int = TRUE)),
         augmentout = map(lmout, augment),
         glanceout = map(lmout, glance)) ->
  nested_gap_df

We now have model summaries within all groups

## Afghanistan summary
nested_gap_df$tidyout[[1]]

# A tibble: 2 × 7
  term        estimate std.error statistic      p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>    <dbl>     <dbl>
1 (Intercept) -508.      40.5        -12.5 0.000000193  -598.     -417.   
2 year           0.275    0.0205      13.5 0.0000000984    0.230     0.321

## Albania summary
nested_gap_df$tidyout[[2]]

# A tibble: 2 × 7
  term        estimate std.error statistic    p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>    <dbl>     <dbl>
1 (Intercept) -594.      65.7        -9.05 0.00000394 -740.     -448.   
2 year           0.335    0.0332     10.1  0.00000146    0.261     0.409

Exercise: Use tidy() to get model summaries of your linear model fits of log-population on year. Add these summaries as another column in nested_gap_df.

Unnest

Use the unnest() function, followed by the ungroup() function, to unnest a single list column.

unnest(nested_gap_df, tidyout) %>%
  ungroup() ->
  model_summary_df

model_summary_df

# A tibble: 284 × 13
   country  continent data     lmout term  estimate std.error statistic  p.value
   <fct>    <fct>     <list>   <lis> <chr>    <dbl>     <dbl>     <dbl>    <dbl>
 1 Afghani… Asia      <tibble> <lm>  (Int… -5.08e+2  40.5        -12.5  1.93e- 7
 2 Afghani… Asia      <tibble> <lm>  year   2.75e-1   0.0205      13.5  9.84e- 8
 3 Albania  Europe    <tibble> <lm>  (Int… -5.94e+2  65.7         -9.05 3.94e- 6
 4 Albania  Europe    <tibble> <lm>  year   3.35e-1   0.0332      10.1  1.46e- 6
 5 Algeria  Africa    <tibble> <lm>  (Int… -1.07e+3  43.8        -24.4  3.07e-10
 6 Algeria  Africa    <tibble> <lm>  year   5.69e-1   0.0221      25.7  1.81e-10
 7 Angola   Africa    <tibble> <lm>  (Int… -3.77e+2  46.6         -8.08 1.08e- 5
 8 Angola   Africa    <tibble> <lm>  year   2.09e-1   0.0235       8.90 4.59e- 6
 9 Argenti… Americas  <tibble> <lm>  (Int… -3.90e+2   9.68       -40.3  2.14e-12
10 Argenti… Americas  <tibble> <lm>  year   2.32e-1   0.00489     47.4  4.22e-13
# ℹ 274 more rows
# ℹ 4 more variables: conf.low <dbl>, conf.high <dbl>, augmentout <list>,
#   glanceout <list>

You do this so you can plot some cool things.

library(ggthemes)
model_summary_df %>%
  filter(term == "year") %>%
  mutate(country = fct_reorder(country, estimate)) %>%
  ggplot(aes(x = country, y = estimate, color = continent)) +
  geom_point() +
  geom_segment(aes(x = country, xend = country, y = conf.low, yend = conf.high), 
               color = "black", 
               alpha = 1/3) +
  ylab("Estimate of Rate of Life Expectancy Increase") +
  xlab("Country") +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank()) +
  scale_color_colorblind()

You can check all of the models’ qualities by looking at the residuals

unnest(nested_gap_df, augmentout) ->
  augment_df

augment_df %>%
  ggplot(aes(x = year, y = .resid, group = country)) +
  geom_line(alpha = 1/3) +
  facet_wrap(.~continent) +
  geom_hline(yintercept = 0, lty = 2, col = "blue") +
  xlab("Residuals") +
  ylab("Year")

Above, we see there is a lot of curvature we are not accounting for.
Exercise: Unnest the tidy() output from the linear model fit of log-population on year. Which countries have seen the greatest increase in population? Have any seen a decline?

`mtcars` exercise

Recall how we used the split() function in the iterators worksheet on the mtcars dataset to get the regression coefficient estimates of a regression of mpg on wt.

data("mtcars")
mtcars %>% 
  split(.$cyl) %>% 
  map(~lm(mpg ~ wt, data = .)) %>%
  map(~tidy(.)) %>%
  map_dbl(~.$estimate[2])

     4      6      8 
-5.647 -2.780 -2.192

Redo this exercise using the nest()-map()-unnest() workflow we just went through.