Learning Objectives

Motivation

Broom

library(broom)

mtcars

Line Review

Simple Linear Regression Model

Estimating Coefficients

Estimating Variance

Hypothesis Testing

Prediction (Interpolation)

Assumptions

Assumptions and Violations

  • The linear model has many assumptions.

  • You should always check these assumptions.

  • Assumptions in decreasing order of importance

    1. Linearity - The relationship looks like a straight line.
    2. Independence - The knowledge of the value of one observation does not give you any information on the value of another.
    3. Equal Variance - The spread is the same for every value of \(x\)
    4. Normality - The distribution of the errors isn’t too skewed and there aren’t any too extreme points. (Only an issue if you have outliers and a small number of observations because of the central limit theorem).
  • Problems when violated

    1. Linearity violated - Linear regression line does not pick up actual relationship. Results aren’t meaningful.
    2. Independence violated - Linear regression line is unbiased, but standard errors are off. Your \(p\)-values are too small.
    3. Equal Variance violated - Linear regression line is unbiased, but standard errors are off. Your \(p\)-values may be too small, or too large.
    4. Normality violated - Unstable results if outliers are present and sample size is small. Not usually a big deal.
  • Exercise: What assumptions are made about the distribution of the explanatory variable (the \(x_i\)’s)?

Evaluating Independence

  • Think about the problem.

    • Were different responses measured on the same observational/experimental unit?
    • Were data collected in groups?
  • Example of non-independence: The temperature today and the temperature tomorrow. If it is warm today, it is probably warm tomorrow.

  • Example of non-independence: You are collecting a survey. To obtain individuals, you select a house at random and then ask all participants in this house to answer the survey. The participants’ responses inside each house are probably not independent because they probably share similar beliefs/backgrounds/situations.

  • Example of independence: You are collecting a survey. To obtain individuals, you randomly dial phone numbers until an individual picks up the phone.

Evaluating other assumptions

  • Evaluate issues by plotting the residuals.

  • The residuals are the observed values minus the predicted values. \[ r_i = y_i - \hat{y}_i \]

  • In the linear model, \(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_i\).

  • Obtain the residuals by using augment() from broom. They will be the .resid variable.

    aout <- augment(lmout)
    glimpse(aout)
    ## Rows: 32
    ## Columns: 9
    ## $ .rownames  <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive…
    ## $ mpg        <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2,…
    ## $ logwt      <dbl> 0.9632, 1.0561, 0.8416, 1.1678, 1.2355, 1.2413, 1.2726, 1.1…
    ## $ .fitted    <dbl> 22.80, 21.21, 24.88, 19.30, 18.15, 18.05, 17.51, 19.44, 19.…
    ## $ .resid     <dbl> -1.79984, -0.21293, -2.07761, 2.09684, 0.55260, 0.05164, -3…
    ## $ .hat       <dbl> 0.03930, 0.03263, 0.05637, 0.03193, 0.03539, 0.03582, 0.038…
    ## $ .sigma     <dbl> 2.694, 2.715, 2.686, 2.686, 2.713, 2.715, 2.646, 2.548, 2.6…
    ## $ .cooksd    <dbl> 9.677e-03, 1.109e-04, 1.917e-02, 1.051e-02, 8.149e-04, 7.21…
    ## $ .std.resid <dbl> -0.68788, -0.08110, -0.80119, 0.79833, 0.21077, 0.01970, -1…
  • You should always make the following scatterplots. The residuals always go on the \(y\)-axis.

    • Fits \(\hat{y}_i\) vs residuals \(r_i\).
    • Response \(y_i\) vs residuals \(r_i\).
    • Explanatory variable \(x_i\) vs residuals \(r_i\).
  • In the simple linear model, you can probably evaluate these issues by plotting the data (\(x_i\) vs \(y_i\)). But residual plots generalize to much more complicated models, whereas just plotting the data does not.

Example 1: A perfect residual plot

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

  • Means are straight lines
  • Residuals seem to be centered at 0 for all \(x\)
  • Variance looks equal for all \(x\)
  • Everything looks perfect

Example 2: Curved Monotone Relationship, Equal Variances

  • Generate fake data:

    set.seed(1)
    x <- rexp(100)
    x <- x - min(x) + 0.5
    y <- log(x) * 20 + rnorm(100, sd = 4)
    df_fake <- tibble(x, y)
## `geom_smooth()` using formula = 'y ~ x'

  • Curved (but always increasing) relationship between \(x\) and \(y\).

  • Variance looks equal for all \(x\)

  • Residual plot has a parabolic shape.

  • Solution: These indicate a \(\log\) transformation of \(x\) could help.

    df_fake %>%
      mutate(logx = log(x)) ->
      df_fake
    lm_fake <- lm(y ~ logx, data = df_fake)

Example 3: Curved Non-monotone Relationship, Equal Variances

  • Generate fake data:

    set.seed(1)
    x <- rnorm(100)
    y <- -x^2 + rnorm(100)
    df_fake <- tibble(x, y)
## `geom_smooth()` using formula = 'y ~ x'

  • Curved relationship between \(x\) and \(y\)

  • Sometimes the relationship is increasing, sometimes it is decreasing.

  • Variance looks equal for all \(x\)

  • Residual plot has a parabolic form.

  • Solution: Include a squared term in the model (or hire a statistician).

    lmout <- lm(y ~ x^2, data = df_fake)

Example 4: Curved Relationship, Variance Increases with \(Y\)

  • Generate fake data:

    set.seed(1)
    x <- rnorm(100)
    y <- exp(x + rnorm(100, sd = 1/2))
    df_fake <- tibble(x, y)
## `geom_smooth()` using formula = 'y ~ x'

  • Curved relationship between \(x\) and \(y\)

  • Variance looks like it increases as \(y\) increases

  • Residual plot has a parabolic form.

  • Residual plot variance looks larger to the right and smaller to the left.

  • Solution: Take a log-transformation of \(y\).

    df_fake %>%
      mutate(logy = log(y)) ->
      df_fake
    lm_fake <- lm(logy ~ x, data = df_fake)

Example 5: Linear Relationship, Equal Variances, Skewed Distribution

## `geom_smooth()` using formula = 'y ~ x'

  • Straight line relationship between \(x\) and \(y\).
  • Variances about equal for all \(x\)
  • Skew for all \(x\)
  • Residual plots show skew.
  • Solution: Do nothing, but report skew (usually OK to do)

Example 6: Linear Relationship, Unequal Variances

  • Generate fake data:

    set.seed(1)
    x <- runif(100) * 10
    y <- 0.85 * x + rnorm(100, sd = (x - 5) ^ 2)
    df_fake <- tibble(x, y)
## `geom_smooth()` using formula = 'y ~ x'

  • Linear relationship between \(x\) and \(y\).

  • Variance is different for different values of \(x\).

  • Residual plots really good at showing this.

  • Solution: The modern solution is to use sandwich estimates of the standard errors (hire a statistician).

    library(sandwich)
    lm_fake <- lm(y ~ x, data = df_fake)
    semat <- sandwich(lm_fake)
    tidy(lm_fake) %>%
      mutate(sandwich_se = sqrt(diag(semat)),
             sandwich_t  = estimate / sandwich_se,
             sandwich_p  = 2 * pt(-abs(sandwich_t), df = df.residual(lm_fake)))
    ## # A tibble: 2 × 8
    ##   term        estimate std.error statistic  p.value sandwich_se sandwi…¹ sandw…²
    ##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>       <dbl>    <dbl>   <dbl>
    ## 1 (Intercept)    -2.86     2.01      -1.43 0.157          2.78     -1.03 0.307  
    ## 2 x               1.37     0.345      3.97 0.000137       0.508     2.70 0.00827
    ## # … with abbreviated variable names ¹​sandwich_t, ²​sandwich_p

Some Exercises

Interpreting Coefficients when you use logs

Summary of R commands