mtcars
library(broom)
For the most popular model output (t-tests, linear models,
generalized linear models), the broom
package provides
three functions that aid in data analysis.
tidy()
: Provides summary information of the model (such
as parameter estimates and \(p\)-values) in a tidy format. We used this
last class.augment()
: Creates data derived from the model and adds
it to the original data, such as residuals and fits. It formats this
augmented data in a tidy format.glance()
: Prints a single row summary of the model
fit.These three functions are very useful and incorporate well with the tidyverse.
You will see examples of using these functions for the linear models below.
mtcars
For this lesson, we will use the (infamous) mtcars
dataset that comes with R by default.
library(tidyverse)
data("mtcars")
glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
It’s “infamous” because every R class uses it. But it’s a really nice dataset to showcase some statistical methods.
The goal of this dataset is to determine which variables affect
fuel consumption (the mpg
variable in the
dataset).
To begin, we’ll look at the association between the variables log-weight and mpg (more on why we used log-weight later).
ggplot(mtcars, aes(x = wt, mpg)) +
geom_point() +
scale_x_log10() +
xlab("Weight") +
ylab("Miles Per Gallon")
It seems that log-weight is negatively associated with mpg.
It seems that the data approximately fall on a line.
Every line may be represented by a formula of the form \[ Y = \beta_0 + \beta_1 X \]
\(Y\) = response variable on \(y\)-axis
\(X\) = explanatory variable on the \(x\)-axis
\(\beta_1\) = slope (rise over run)
\(\beta_0\) = \(y\)-intercept (the value of the line at \(X = 0\))
You can represent any line in terms of its slope and its \(y\)-intercept:
If \(\beta_1 < 0\) then the line slopes downward. If \(\beta_1 > 0\) then the line slopes upward. If \(\beta_1 = 0\) then the line is horizontal.
Exercise: Suppose we consider the line defined by the following equation: \[ Y = 2 + 4X \]
A line does not exactly fit the mtcars
dataset. But a line does approximate the mtcars
data.
Model: Response variable = line + noise. \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \]
We typically assume that the noise (\(\epsilon_i\)’s) for each individual has mean 0 and some variance \(\sigma^2\). We estimate \(\sigma^2\).
LINEAR MODEL IN A NUTSHELL:
Given \(X_i\), mean of \(Y_i\) is \(\beta_0 + \beta_1 X_i\). Points vary about this mean.
Some intuition:
Interpretation:
Exercise: What is the interpretation of \(\beta_0\)?
How do we estimate \(\beta_0\) and \(\beta_1\)?
Ordinary Least Squares
Bad Fit:
Better Fit:
Best Fit (OLS Fit):
How to find OLS fits in R:
Make sure you have the explanatory variables in the format you want:
mtcars %>%
mutate(logwt = log(wt)) ->
mtcars
Use lm()
lmout <- lm(mpg ~ logwt, data = mtcars)
lmtide <- tidy(lmout)
select(lmtide, term, estimate)
## # A tibble: 2 × 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 39.3
## 2 logwt -17.1
The first argument in lm()
is a
formula, where the response variable is to the left of
the tilde and the explanatory variable is to the right of the tilde.
response ~ explanatory
This formula says that lm()
should find the OLS
estimates of the following model:
response = \(\beta_0\) + \(\beta_1\)explanatory + noise
The data
argument tells lm()
where to
find the response and explanatory variables.
We often put a “hat” over the coefficient names to denote that they are estimates:
Thus, the estimated line is:
Exercise: Fit a linear model of displacement on weight. What are the regression coefficient estimates. Interpret them.
We assume that the variance of \(Y_i\) is the same for each \(X_i\).
Call this variance \(\sigma^2\).
We estimate it by the variability in the residuals.
The variance of residuals is the estimated variance in the data. This is a general technique.
In R, use the broom function glance()
function to
get the estimated standard deviation. It’s the value in the
sigma
column.
glance(lmout) %>%
select(sigma)
## # A tibble: 1 × 1
## sigma
## <dbl>
## 1 2.67
Estimating the variance/standard deviation is important because
it is a component in the standard error of \(\hat{\beta}_1\) and \(\hat{\beta}_0\). These standard errors are
output by tidy()
.
tidy(lmout) %>%
select(term, std.error)
## # A tibble: 2 × 2
## term std.error
## <chr> <dbl>
## 1 (Intercept) 1.76
## 2 logwt 1.51
The variance is also used when calculating prediction intervals.
The sign of \(\beta_1\) denotes different types of relationships between the two quantitative variables:
Hypothesis Testing:
Graphic:
The sampling distribution of \(\hat{\beta}_1\) comes from statistical theory. The \(t\)-statistic is \(\hat{\beta}_1 / SE(\hat{\beta}_1)\). It has a \(t\)-distribution with \(n-2\) degrees of freedom.
The confidence intervals for \(\beta_0\) and \(\beta_1\) are easy to obtain from the
output of tidy()
if you set
conf.int = TRUE
.
lmtide <- tidy(lmout, conf.int = TRUE)
select(lmtide, conf.low, conf.high)
## # A tibble: 2 × 2
## conf.low conf.high
## <dbl> <dbl>
## 1 35.7 42.8
## 2 -20.2 -14.0
Exercise (8.18 in OpenIntro 4th Edition): Which is higher? Determine if (i) or (ii) is higher or if they are equal. Explain your reasoning. For a regression line, the uncertainty associated with the slope estimate, \(\hat{\beta}_1\) , is higher when
Exercise: Load in the bac
dataset
from the openintro R package.
Create an appropriate plot to visualize the association between the number of beers and the BAC.
Does the relationship appear positive or negative?
Write out equation of the OLS line.
Do we have evidence that the number of beers is associated with BAC? Formally justify.
Interpret the coefficient estimates.
What happens to the standard errors of the estimates when we force the intercept to be 0? Why?
Definitions
Interpolation
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
Extrapolation
Why is extrapolation bad?
To make a prediction:
You need a data frame with the exact same variable name as the explanatory variable.
newdf <- tribble(~logwt,
1,
1.5)
Then you use the predict()
function to obtain
predictions.
newdf %>%
mutate(predictions = predict(object = lmout, newdata = newdf)) ->
newdf
newdf
## # A tibble: 2 × 2
## logwt predictions
## <dbl> <dbl>
## 1 1 22.2
## 2 1.5 13.6
Exercise: Derive the predictions above by hand
(not using predict()
).
Exercise: from the BAC dataset, suppose someone
had 5 beers. Use predict()
to predict their BAC.
The linear model has many assumptions.
You should always check these assumptions.
Assumptions in decreasing order of importance
Problems when violated
Exercise: What assumptions are made about the distribution of the explanatory variable (the \(x_i\)’s)?
Think about the problem.
Example of non-independence: The temperature today and the temperature tomorrow. If it is warm today, it is probably warm tomorrow.
Example of non-independence: You are collecting a survey. To obtain individuals, you select a house at random and then ask all participants in this house to answer the survey. The participants’ responses inside each house are probably not independent because they probably share similar beliefs/backgrounds/situations.
Example of independence: You are collecting a survey. To obtain individuals, you randomly dial phone numbers until an individual picks up the phone.
Evaluate issues by plotting the residuals.
The residuals are the observed values minus the predicted values. \[ r_i = y_i - \hat{y}_i \]
In the linear model, \(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_i\).
Obtain the residuals by using augment()
from broom.
They will be the .resid
variable.
aout <- augment(lmout)
glimpse(aout)
## Rows: 32
## Columns: 9
## $ .rownames <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive…
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2,…
## $ logwt <dbl> 0.9632, 1.0561, 0.8416, 1.1678, 1.2355, 1.2413, 1.2726, 1.1…
## $ .fitted <dbl> 22.80, 21.21, 24.88, 19.30, 18.15, 18.05, 17.51, 19.44, 19.…
## $ .resid <dbl> -1.79984, -0.21293, -2.07761, 2.09684, 0.55260, 0.05164, -3…
## $ .hat <dbl> 0.03930, 0.03263, 0.05637, 0.03193, 0.03539, 0.03582, 0.038…
## $ .sigma <dbl> 2.694, 2.715, 2.686, 2.686, 2.713, 2.715, 2.646, 2.548, 2.6…
## $ .cooksd <dbl> 9.677e-03, 1.109e-04, 1.917e-02, 1.051e-02, 8.149e-04, 7.21…
## $ .std.resid <dbl> -0.68788, -0.08110, -0.80119, 0.79833, 0.21077, 0.01970, -1…
You should always make the following scatterplots. The residuals always go on the \(y\)-axis.
In the simple linear model, you can probably evaluate these issues by plotting the data (\(x_i\) vs \(y_i\)). But residual plots generalize to much more complicated models, whereas just plotting the data does not.
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
Generate fake data:
set.seed(1)
x <- rexp(100)
x <- x - min(x) + 0.5
y <- log(x) * 20 + rnorm(100, sd = 4)
df_fake <- tibble(x, y)
## `geom_smooth()` using formula = 'y ~ x'
Curved (but always increasing) relationship between \(x\) and \(y\).
Variance looks equal for all \(x\)
Residual plot has a parabolic shape.
Solution: These indicate a \(\log\) transformation of \(x\) could help.
df_fake %>%
mutate(logx = log(x)) ->
df_fake
lm_fake <- lm(y ~ logx, data = df_fake)
Generate fake data:
set.seed(1)
x <- rnorm(100)
y <- -x^2 + rnorm(100)
df_fake <- tibble(x, y)
## `geom_smooth()` using formula = 'y ~ x'
Curved relationship between \(x\) and \(y\)
Sometimes the relationship is increasing, sometimes it is decreasing.
Variance looks equal for all \(x\)
Residual plot has a parabolic form.
Solution: Include a squared term in the model (or hire a statistician).
lmout <- lm(y ~ x^2, data = df_fake)
Generate fake data:
set.seed(1)
x <- rnorm(100)
y <- exp(x + rnorm(100, sd = 1/2))
df_fake <- tibble(x, y)
## `geom_smooth()` using formula = 'y ~ x'
Curved relationship between \(x\) and \(y\)
Variance looks like it increases as \(y\) increases
Residual plot has a parabolic form.
Residual plot variance looks larger to the right and smaller to the left.
Solution: Take a log-transformation of \(y\).
df_fake %>%
mutate(logy = log(y)) ->
df_fake
lm_fake <- lm(logy ~ x, data = df_fake)
## `geom_smooth()` using formula = 'y ~ x'
Generate fake data:
set.seed(1)
x <- runif(100) * 10
y <- 0.85 * x + rnorm(100, sd = (x - 5) ^ 2)
df_fake <- tibble(x, y)
## `geom_smooth()` using formula = 'y ~ x'
Linear relationship between \(x\) and \(y\).
Variance is different for different values of \(x\).
Residual plots really good at showing this.
Solution: The modern solution is to use sandwich estimates of the standard errors (hire a statistician).
library(sandwich)
lm_fake <- lm(y ~ x, data = df_fake)
semat <- sandwich(lm_fake)
tidy(lm_fake) %>%
mutate(sandwich_se = sqrt(diag(semat)),
sandwich_t = estimate / sandwich_se,
sandwich_p = 2 * pt(-abs(sandwich_t), df = df.residual(lm_fake)))
## # A tibble: 2 × 8
## term estimate std.error statistic p.value sandwich_se sandwi…¹ sandw…²
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.86 2.01 -1.43 0.157 2.78 -1.03 0.307
## 2 x 1.37 0.345 3.97 0.000137 0.508 2.70 0.00827
## # … with abbreviated variable names ¹sandwich_t, ²sandwich_p
Exercise: From the mtcars
data
frame, run a regression of MPG on displacement. Evaluate the regression
fit and fix any issues. Interpret the coefficient estimates.
Exercise: Consider the data frame
case0801
from the Sleuth3 R package:
library(Sleuth3)
data("case0801")
Find an appropriate linear regression model for relating the effect of island area on species number. Find the regression estimates. Interpret them.
Generally, when you use logs, you interpret associations on a multiplicative scale instead of an additive scale.
No log:
Log \(x\):
Log \(y\):
Log both:
Exercise: Re-interpret the regression
coefficients estimates you calculated using the case0801
dataset.
Exercise Re-interpret the regression coefficient
estimates you calculated in the mtcars
data frame when you
were exploring the effect of displacement on mpg.
augment()
:
$.resid
$.fitted
tidy()
:
$term
$estimate
$std.error
$statistic
$p.value
glance()
:
$r.squared
$AIC
$BIC