The gapminder data
library(tidyverse)
library(broom)
library(gapminder)
data("gapminder")
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.80, 30.33, 32.00, 34.02, 36.09, 38.44, 39.85, 40.82, 41.6…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4, 820.9, 853.1, 836.2, 740.0, 786.1, 978.0, 852.4, 649.…
Suppose we want to look at how life expectancy has changed over time in each country:
gapminder %>%
ggplot(aes(x = year, y = lifeExp, group = country)) +
geom_line(alpha = 1/3) +
xlab("Year") +
ylab("Life Expectancy")
General trend is going up. But there are some countries where this doesn’t happen. How do we quantify the trend? We can do this with one country using a linear model:
gapminder %>%
filter(country == "United States") ->
usdf
ggplot(usdf, aes(x = year, y = lifeExp)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE) +
geom_line(alpha = 1/3) +
xlab("Year") +
ylab("Life Expectancy")
## `geom_smooth()` using formula = 'y ~ x'
us_lmout <- lm(lifeExp ~ year, data = usdf)
tidy_uslm <- tidy(us_lmout)
tidy_uslm
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -291. 13.8 -21.1 1.25e- 9
## 2 year 0.184 0.00696 26.5 1.37e-10
So each year, the US has been increasing its life expectancy by 0.1842 years.
How can we get these coefficient estimates for each country?
For the exercises in this lesson plan, you should log the
pop
variable now:
gapminder %>%
mutate(logpop = log2(pop)) ->
gapminder
Exercise: Create a line plot for year vs log-population for each country.
Exercise: For China, fit a linear model for log-population on year. Make sure the assumptions of the linear model are fulfilled. Interpret the coefficients
The nest()
function from the tidyr package will turn
a grouped data frame into a data frame where each group is a
row. All of the observations are placed in the data
variable as a data frame. Let’s look at an example.
gapminder %>%
group_by(country, continent) %>%
nest() ->
nested_gap_df
nested_gap_df
## # A tibble: 142 × 3
## # Groups: country, continent [142]
## country continent data
## <fct> <fct> <list>
## 1 Afghanistan Asia <tibble [12 × 5]>
## 2 Albania Europe <tibble [12 × 5]>
## 3 Algeria Africa <tibble [12 × 5]>
## 4 Angola Africa <tibble [12 × 5]>
## 5 Argentina Americas <tibble [12 × 5]>
## 6 Australia Oceania <tibble [12 × 5]>
## 7 Austria Europe <tibble [12 × 5]>
## 8 Bahrain Asia <tibble [12 × 5]>
## 9 Bangladesh Asia <tibble [12 × 5]>
## 10 Belgium Europe <tibble [12 × 5]>
## # … with 132 more rows
The data
variable is a list. Each element is a data
frame that contains all of the observations in the group we created by
country
.
Observations from Afghanistan
nested_gap_df$data[[1]]
## # A tibble: 12 × 5
## year lifeExp pop gdpPercap logpop
## <int> <dbl> <int> <dbl> <dbl>
## 1 1952 28.8 8425333 779. 23.0
## 2 1957 30.3 9240934 821. 23.1
## 3 1962 32.0 10267083 853. 23.3
## 4 1967 34.0 11537966 836. 23.5
## 5 1972 36.1 13079460 740. 23.6
## 6 1977 38.4 14880372 786. 23.8
## 7 1982 39.9 12881816 978. 23.6
## 8 1987 40.8 13867957 852. 23.7
## 9 1992 41.7 16317921 649. 24.0
## 10 1997 41.8 22227415 635. 24.4
## 11 2002 42.1 25268405 727. 24.6
## 12 2007 43.8 31889923 975. 24.9
Observations from Albania
nested_gap_df$data[[2]]
## # A tibble: 12 × 5
## year lifeExp pop gdpPercap logpop
## <int> <dbl> <int> <dbl> <dbl>
## 1 1952 55.2 1282697 1601. 20.3
## 2 1957 59.3 1476505 1942. 20.5
## 3 1962 64.8 1728137 2313. 20.7
## 4 1967 66.2 1984060 2760. 20.9
## 5 1972 67.7 2263554 3313. 21.1
## 6 1977 68.9 2509048 3533. 21.3
## 7 1982 70.4 2780097 3631. 21.4
## 8 1987 72 3075321 3739. 21.6
## 9 1992 71.6 3326498 2497. 21.7
## 10 1997 73.0 3428038 3193. 21.7
## 11 2002 75.7 3508512 4604. 21.7
## 12 2007 76.4 3600523 5937. 21.8
Notice that I used double brackets to extract the elements of
data
because it is a list.
Exercise: Extract the United States data frame
from nested_gap_df
. This code should be general so
that the location of the US in the data frame does not matter.
We can now use purrr to fit a model on all of the elements of
data
.
nested_gap_df %>%
mutate(lmout = map(data, ~lm(lifeExp ~ year, data = .))) ->
nested_gap_df
Recall: the data = .
argument says that we should
place all of the elements from the data
column in the
nested_gap_df
data frame as the data
argument
in lm()
.
Let’s look at the results
nested_gap_df
## # A tibble: 142 × 4
## # Groups: country, continent [142]
## country continent data lmout
## <fct> <fct> <list> <list>
## 1 Afghanistan Asia <tibble [12 × 5]> <lm>
## 2 Albania Europe <tibble [12 × 5]> <lm>
## 3 Algeria Africa <tibble [12 × 5]> <lm>
## 4 Angola Africa <tibble [12 × 5]> <lm>
## 5 Argentina Americas <tibble [12 × 5]> <lm>
## 6 Australia Oceania <tibble [12 × 5]> <lm>
## 7 Austria Europe <tibble [12 × 5]> <lm>
## 8 Bahrain Asia <tibble [12 × 5]> <lm>
## 9 Bangladesh Asia <tibble [12 × 5]> <lm>
## 10 Belgium Europe <tibble [12 × 5]> <lm>
## # … with 132 more rows
Each element of the lmout
column is the model output
from fitting lm()
to each data frame in the
data
column.
## Afghanistan fit
nested_gap_df$lmout[[1]]
##
## Call:
## lm(formula = lifeExp ~ year, data = .)
##
## Coefficients:
## (Intercept) year
## -507.534 0.275
## Albania fit
nested_gap_df$lmout[[2]]
##
## Call:
## lm(formula = lifeExp ~ year, data = .)
##
## Coefficients:
## (Intercept) year
## -594.073 0.335
Exercise: For each country, fit a linear model
of log-population on year. Save this fit as the lmpop
column in nested_gap_df
.
Also use map to create columns of model summaries using
tidy()
, augment()
, or
glance()
.
nested_gap_df %>%
mutate(tidyout = map(lmout, ~tidy(., conf.int = TRUE)),
augmentout = map(lmout, augment),
glanceout = map(lmout, glance)) ->
nested_gap_df
We now have model summaries within all groups
## Afghanistan summary
nested_gap_df$tidyout[[1]]
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -508. 40.5 -12.5 0.000000193 -598. -417.
## 2 year 0.275 0.0205 13.5 0.0000000984 0.230 0.321
## Albania summary
nested_gap_df$tidyout[[2]]
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -594. 65.7 -9.05 0.00000394 -740. -448.
## 2 year 0.335 0.0332 10.1 0.00000146 0.261 0.409
Exercise: Use tidy()
to get model
summaries of your linear model fits of log-population on year. Add these
summaries as another column in nested_gap_df
.
Use the unnest()
function, followed by the
ungroup()
function, to unnest a single list column.
unnest(nested_gap_df, tidyout) %>%
ungroup() ->
model_summary_df
model_summary_df
## # A tibble: 284 × 13
## country continent data lmout term estimate std.e…¹ stati…² p.value
## <fct> <fct> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia <tibble> <lm> (Int… -5.08e+2 4.05e+1 -12.5 1.93e- 7
## 2 Afghanistan Asia <tibble> <lm> year 2.75e-1 2.05e-2 13.5 9.84e- 8
## 3 Albania Europe <tibble> <lm> (Int… -5.94e+2 6.57e+1 -9.05 3.94e- 6
## 4 Albania Europe <tibble> <lm> year 3.35e-1 3.32e-2 10.1 1.46e- 6
## 5 Algeria Africa <tibble> <lm> (Int… -1.07e+3 4.38e+1 -24.4 3.07e-10
## 6 Algeria Africa <tibble> <lm> year 5.69e-1 2.21e-2 25.7 1.81e-10
## 7 Angola Africa <tibble> <lm> (Int… -3.77e+2 4.66e+1 -8.08 1.08e- 5
## 8 Angola Africa <tibble> <lm> year 2.09e-1 2.35e-2 8.90 4.59e- 6
## 9 Argentina Americas <tibble> <lm> (Int… -3.90e+2 9.68e+0 -40.3 2.14e-12
## 10 Argentina Americas <tibble> <lm> year 2.32e-1 4.89e-3 47.4 4.22e-13
## # … with 274 more rows, 4 more variables: conf.low <dbl>, conf.high <dbl>,
## # augmentout <list>, glanceout <list>, and abbreviated variable names
## # ¹​std.error, ²​statistic
You do this so you can plot some cool things.
library(ggthemes)
model_summary_df %>%
filter(term == "year") %>%
mutate(country = fct_reorder(country, estimate)) %>%
ggplot(aes(x = country, y = estimate, color = continent)) +
geom_point() +
geom_segment(aes(x = country, xend = country, y = conf.low, yend = conf.high),
color = "black",
alpha = 1/3) +
ylab("Estimate of Rate of Life Expectancy Increase") +
xlab("Country") +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
scale_color_colorblind()
You can check all of the models’ qualities by looking at the residuals
unnest(nested_gap_df, augmentout) ->
augment_df
augment_df %>%
ggplot(aes(x = year, y = .resid, group = country)) +
geom_line(alpha = 1/3) +
facet_wrap(.~continent) +
geom_hline(yintercept = 0, lty = 2, col = "blue") +
xlab("Residuals") +
ylab("Year")
Above, we see there is a lot of curvature we are not accounting for.
Exercise: Unnest the tidy()
output
from the linear model fit of log-population on year. Which countries
have seen the greatest increase in population? Have any seen a
decline?
mtcars
exerciseRecall how we used the split()
function in the iterators
worksheet on the mtcars
dataset to get the regression
coefficient estimates of a regression of mpg
on
wt
.
data("mtcars")
mtcars %>%
split(.$cyl) %>%
map(~lm(mpg ~ wt, data = .)) %>%
map(~tidy(.)) %>%
map_dbl(~.$estimate[2])
## 4 6 8
## -5.647 -2.780 -2.192
Redo this exercise using the
nest()
-map()
-unnest()
workflow we
just went through.