Many Models

Author

David Gerard

Published

June 3, 2025

Learning Objectives

  • Use a list-column to run many models at once.
  • Chapter 25 from RDS.

Motivation

  • The gapminder data

    library(tidyverse)
    library(broom)
    library(gapminder)
    data("gapminder")
    glimpse(gapminder)
    Rows: 1,704
    Columns: 6
    $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
    $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
    $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
    $ lifeExp   <dbl> 28.80, 30.33, 32.00, 34.02, 36.09, 38.44, 39.85, 40.82, 41.6…
    $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
    $ gdpPercap <dbl> 779.4, 820.9, 853.1, 836.2, 740.0, 786.1, 978.0, 852.4, 649.…
  • Suppose we want to look at how life expectancy has changed over time in each country:

    gapminder %>%
      ggplot(aes(x = year, y = lifeExp, group = country)) +
      geom_line(alpha = 1/3) +
      xlab("Year") +
      ylab("Life Expectancy")

  • General trend is going up. But there are some countries where this doesn’t happen. How do we quantify the trend? We can do this with one country using a linear model:

    gapminder %>%
      filter(country == "United States") ->
      usdf
    
    ggplot(usdf, aes(x = year, y = lifeExp)) +
      geom_line() +
      geom_smooth(method = "lm", se = FALSE) +
      geom_line(alpha = 1/3) +
      xlab("Year") +
      ylab("Life Expectancy")
    `geom_smooth()` using formula = 'y ~ x'

    us_lmout <- lm(lifeExp ~ year, data = usdf)
    tidy_uslm <- tidy(us_lmout)
    tidy_uslm
    # A tibble: 2 × 5
      term        estimate std.error statistic  p.value
      <chr>          <dbl>     <dbl>     <dbl>    <dbl>
    1 (Intercept) -291.     13.8         -21.1 1.25e- 9
    2 year           0.184   0.00696      26.5 1.37e-10
  • So each year, the US has been increasing its life expectancy by 0.1842 years.

  • How can we get these coefficient estimates for each country?

  • For the exercises in this lesson plan, you should log the pop variable now:

    gapminder %>%
      mutate(logpop = log2(pop)) ->
      gapminder
  • Exercise: Create a line plot for year vs log-population for each country.

  • Exercise: For China, fit a linear model for log-population on year. Make sure the assumptions of the linear model are fulfilled. Interpret the coefficients

Nest

  • The nest() function from the tidyr package will turn a grouped data frame into a data frame where each group is a row. All of the observations are placed in the data variable as a data frame. Let’s look at an example.

    gapminder %>%
      group_by(country, continent) %>%
      nest() ->
      nested_gap_df
    
    nested_gap_df
    # A tibble: 142 × 3
    # Groups:   country, continent [142]
       country     continent data             
       <fct>       <fct>     <list>           
     1 Afghanistan Asia      <tibble [12 × 5]>
     2 Albania     Europe    <tibble [12 × 5]>
     3 Algeria     Africa    <tibble [12 × 5]>
     4 Angola      Africa    <tibble [12 × 5]>
     5 Argentina   Americas  <tibble [12 × 5]>
     6 Australia   Oceania   <tibble [12 × 5]>
     7 Austria     Europe    <tibble [12 × 5]>
     8 Bahrain     Asia      <tibble [12 × 5]>
     9 Bangladesh  Asia      <tibble [12 × 5]>
    10 Belgium     Europe    <tibble [12 × 5]>
    # ℹ 132 more rows
  • The data variable is a list. Each element is a data frame that contains all of the observations in the group we created by country.

  • Observations from Afghanistan

    nested_gap_df$data[[1]]
    # A tibble: 12 × 5
        year lifeExp      pop gdpPercap logpop
       <int>   <dbl>    <int>     <dbl>  <dbl>
     1  1952    28.8  8425333      779.   23.0
     2  1957    30.3  9240934      821.   23.1
     3  1962    32.0 10267083      853.   23.3
     4  1967    34.0 11537966      836.   23.5
     5  1972    36.1 13079460      740.   23.6
     6  1977    38.4 14880372      786.   23.8
     7  1982    39.9 12881816      978.   23.6
     8  1987    40.8 13867957      852.   23.7
     9  1992    41.7 16317921      649.   24.0
    10  1997    41.8 22227415      635.   24.4
    11  2002    42.1 25268405      727.   24.6
    12  2007    43.8 31889923      975.   24.9
  • Observations from Albania

    nested_gap_df$data[[2]]
    # A tibble: 12 × 5
        year lifeExp     pop gdpPercap logpop
       <int>   <dbl>   <int>     <dbl>  <dbl>
     1  1952    55.2 1282697     1601.   20.3
     2  1957    59.3 1476505     1942.   20.5
     3  1962    64.8 1728137     2313.   20.7
     4  1967    66.2 1984060     2760.   20.9
     5  1972    67.7 2263554     3313.   21.1
     6  1977    68.9 2509048     3533.   21.3
     7  1982    70.4 2780097     3631.   21.4
     8  1987    72   3075321     3739.   21.6
     9  1992    71.6 3326498     2497.   21.7
    10  1997    73.0 3428038     3193.   21.7
    11  2002    75.7 3508512     4604.   21.7
    12  2007    76.4 3600523     5937.   21.8
  • Notice that I used double brackets to extract the elements of data because it is a list.

  • Exercise: Extract the United States data frame from nested_gap_df. This code should be general so that the location of the US in the data frame does not matter.

Fit a model

  • We can now use purrr to fit a model on all of the elements of data.

    nested_gap_df %>%
      mutate(lmout = map(data, ~lm(lifeExp ~ year, data = .))) ->
      nested_gap_df
  • Recall: the data = . argument says that we should place all of the elements from the data column in the nested_gap_df data frame as the data argument in lm().

  • Let’s look at the results

    nested_gap_df
    # A tibble: 142 × 4
    # Groups:   country, continent [142]
       country     continent data              lmout 
       <fct>       <fct>     <list>            <list>
     1 Afghanistan Asia      <tibble [12 × 5]> <lm>  
     2 Albania     Europe    <tibble [12 × 5]> <lm>  
     3 Algeria     Africa    <tibble [12 × 5]> <lm>  
     4 Angola      Africa    <tibble [12 × 5]> <lm>  
     5 Argentina   Americas  <tibble [12 × 5]> <lm>  
     6 Australia   Oceania   <tibble [12 × 5]> <lm>  
     7 Austria     Europe    <tibble [12 × 5]> <lm>  
     8 Bahrain     Asia      <tibble [12 × 5]> <lm>  
     9 Bangladesh  Asia      <tibble [12 × 5]> <lm>  
    10 Belgium     Europe    <tibble [12 × 5]> <lm>  
    # ℹ 132 more rows
  • Each element of the lmout column is the model output from fitting lm() to each data frame in the data column.

    ## Afghanistan fit
    nested_gap_df$lmout[[1]]
    
    Call:
    lm(formula = lifeExp ~ year, data = .)
    
    Coefficients:
    (Intercept)         year  
       -507.534        0.275  
    ## Albania fit
    nested_gap_df$lmout[[2]]
    
    Call:
    lm(formula = lifeExp ~ year, data = .)
    
    Coefficients:
    (Intercept)         year  
       -594.073        0.335  
  • Exercise: For each country, fit a linear model of log-population on year. Save this fit as the lmpop column in nested_gap_df.

Get model summaries

  • Also use map to create columns of model summaries using tidy(), augment(), or glance().

    nested_gap_df %>%
      mutate(tidyout = map(lmout, ~tidy(., conf.int = TRUE)),
             augmentout = map(lmout, augment),
             glanceout = map(lmout, glance)) ->
      nested_gap_df
  • We now have model summaries within all groups

    ## Afghanistan summary
    nested_gap_df$tidyout[[1]]
    # A tibble: 2 × 7
      term        estimate std.error statistic      p.value conf.low conf.high
      <chr>          <dbl>     <dbl>     <dbl>        <dbl>    <dbl>     <dbl>
    1 (Intercept) -508.      40.5        -12.5 0.000000193  -598.     -417.   
    2 year           0.275    0.0205      13.5 0.0000000984    0.230     0.321
    ## Albania summary
    nested_gap_df$tidyout[[2]]
    # A tibble: 2 × 7
      term        estimate std.error statistic    p.value conf.low conf.high
      <chr>          <dbl>     <dbl>     <dbl>      <dbl>    <dbl>     <dbl>
    1 (Intercept) -594.      65.7        -9.05 0.00000394 -740.     -448.   
    2 year           0.335    0.0332     10.1  0.00000146    0.261     0.409
  • Exercise: Use tidy() to get model summaries of your linear model fits of log-population on year. Add these summaries as another column in nested_gap_df.

Unnest

  • Use the unnest() function, followed by the ungroup() function, to unnest a single list column.

    unnest(nested_gap_df, tidyout) %>%
      ungroup() ->
      model_summary_df
    
    model_summary_df
    # A tibble: 284 × 13
       country  continent data     lmout term  estimate std.error statistic  p.value
       <fct>    <fct>     <list>   <lis> <chr>    <dbl>     <dbl>     <dbl>    <dbl>
     1 Afghani… Asia      <tibble> <lm>  (Int… -5.08e+2  40.5        -12.5  1.93e- 7
     2 Afghani… Asia      <tibble> <lm>  year   2.75e-1   0.0205      13.5  9.84e- 8
     3 Albania  Europe    <tibble> <lm>  (Int… -5.94e+2  65.7         -9.05 3.94e- 6
     4 Albania  Europe    <tibble> <lm>  year   3.35e-1   0.0332      10.1  1.46e- 6
     5 Algeria  Africa    <tibble> <lm>  (Int… -1.07e+3  43.8        -24.4  3.07e-10
     6 Algeria  Africa    <tibble> <lm>  year   5.69e-1   0.0221      25.7  1.81e-10
     7 Angola   Africa    <tibble> <lm>  (Int… -3.77e+2  46.6         -8.08 1.08e- 5
     8 Angola   Africa    <tibble> <lm>  year   2.09e-1   0.0235       8.90 4.59e- 6
     9 Argenti… Americas  <tibble> <lm>  (Int… -3.90e+2   9.68       -40.3  2.14e-12
    10 Argenti… Americas  <tibble> <lm>  year   2.32e-1   0.00489     47.4  4.22e-13
    # ℹ 274 more rows
    # ℹ 4 more variables: conf.low <dbl>, conf.high <dbl>, augmentout <list>,
    #   glanceout <list>
  • You do this so you can plot some cool things.

    library(ggthemes)
    model_summary_df %>%
      filter(term == "year") %>%
      mutate(country = fct_reorder(country, estimate)) %>%
      ggplot(aes(x = country, y = estimate, color = continent)) +
      geom_point() +
      geom_segment(aes(x = country, xend = country, y = conf.low, yend = conf.high), 
                   color = "black", 
                   alpha = 1/3) +
      ylab("Estimate of Rate of Life Expectancy Increase") +
      xlab("Country") +
      theme(axis.text.x = element_blank(),
            axis.ticks.x = element_blank()) +
      scale_color_colorblind()

  • You can check all of the models’ qualities by looking at the residuals

    unnest(nested_gap_df, augmentout) ->
      augment_df
    
    augment_df %>%
      ggplot(aes(x = year, y = .resid, group = country)) +
      geom_line(alpha = 1/3) +
      facet_wrap(.~continent) +
      geom_hline(yintercept = 0, lty = 2, col = "blue") +
      xlab("Residuals") +
      ylab("Year")

  • Above, we see there is a lot of curvature we are not accounting for.

  • Exercise: Unnest the tidy() output from the linear model fit of log-population on year. Which countries have seen the greatest increase in population? Have any seen a decline?

mtcars exercise

Recall how we used the split() function in the iterators worksheet on the mtcars dataset to get the regression coefficient estimates of a regression of mpg on wt.

data("mtcars")
mtcars %>% 
  split(.$cyl) %>% 
  map(~lm(mpg ~ wt, data = .)) %>%
  map(~tidy(.)) %>%
  map_dbl(~.$estimate[2])
     4      6      8 
-5.647 -2.780 -2.192 

Redo this exercise using the nest()-map()-unnest() workflow we just went through.