This review assumes that you have had some basic statistics course and have seen all of these concepts in the past.
If you haven’t taken a basic statistics course, you will find this review very difficult and incomplete.
If you are having difficulty remembering these topics, then you can review them in the free OpenIntro book: https://leanpub.com/openintro-statistics. The most pertinent sections (from the third edition) are 3.1, 4.1, 4.2, 4.3, 5.1, 5.2, 5.3, 6.1, 6.2, and 6.5.
Distribution: The possible values of a variable and how often it takes those values.
A density describes the distribution of a quantitative variable. You can think of it as approximating a histogram. It is a curve where
The density of birthweights in America:
The distribution of many variables in Statistics approximate the normal distribution.
Normal densities with different means.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
Normal densities with different standard deviations
Density Function (height of curve, NOT probability of a value).
dnorm(x = 2, mean = 1, sd = 1)
## [1] 0.242
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
Random Generation (generate samples from a given normal distribution).
samp <- rnorm(n = 1000, mean = 1, sd = 1)
head(samp)
## [1] 0.3735 1.1836 0.1644 2.5953 1.3295 0.1795
Cumulative Distribution Function (probability of being less than or equal to some value).
pnorm(q = 2, mean = 1, sd = 1)
## [1] 0.8413
Quantile function (find value that has a given the probability of being less than or equal to it).
qnorm(p = 0.8413, mean = 1, sd = 1)
## [1] 2
Exercise: Use rnorm()
to generate
10,000 random draws from a normal distribution with mean 5 and standard
deviation 2. What proportion are less than 3? Can you think up a way to
approximate this proportion using a different function?
Exercise: In Hong Kong, human male height is approximately normally distributed with mean 171.5 cm and standard deviation 5.5 cm. What proportion of the Hong Kong population is between 170 cm and 180 cm?
The \(t\)-distribution shows up a lot in Statistics.
\(t\)-distributions with different degrees of freedom:
Density Function
dt(x = -6, df = 2)
## [1] 0.004269
Random Generation
samp <- rt(n = 1000, df = 2)
head(samp)
## [1] 0.89857 -1.07176 0.09639 0.79371 -0.42428 -0.64561
Cumulative Distribution Function
pt(q = 2, df = 2)
## [1] 0.9082
Quantile Function
qt(p = 0.9082, df = 2)
## [1] 1.999
There are many other distributions implemented in R. To see the most common, run:
help("Distributions")
Exercise: Calculate the 0.75 quantile of the \(t\)-distribution for degrees of freedom \(1, 2, 3,\ldots,50\). Reproduce this plot:
Can you use a function to come up with the asymptotic value of the 0.75 quantile as the degrees of freedom approaches infinity?
Observational/experimental Units: The people/places/things/animals/groups that we collect information about. Also known as “individuals” or “cases”. Sometimes I just say “units”.
Variable: A property of the observational/experimental units.
Value: The specific level of a variable for an observational/experimental unit.
Quantitative Variable: The variable takes on numerical values where arithmetic operations (plus/minus/divide/times) make sense.
Categorical Variable: The variable puts observational/experimental units into different groups/categories based on the values of that variable.
Binary Variable: A categorical variable that takes on only two values.
Population: The collection of all observational units we are interested in.
Parameter: A numerical summary of the population.
Sample: A subset of the population (some observational units, but not all of them).
Statistic: A numeric summary of the sample.
Graphic:
Sampling Distribution: The distribution of a statistic over many hypothetical random samples from the population.
All of Statistics: We see a pattern in the sample.
We will go through a lot of examples of this below
For the examples below, we will use the data from the Sleuth3 package in R.
library(Sleuth3)
library(tidyverse)
library(broom)
Exercise: Read the help page of the
ex0126
data frame from the Sleuth3 package. What are the
observational units? What are the variables? Which are quantitative and
which are categorical?
Exercise: Read the help page of the
ex0223
data frame from the Sleuth3 package. What are the
observational units? What are the variables? Which are quantitative and
which are categorical?
Example: Researchers measured the volume of the left hippocampus in 15 twins where one twin had schizophrenia and the other did not. They were interested in whether the left hippocampus differed in size between the normal and schizophrenic twin.
Observational Units: The twins.
Population: All twins where one has schizophrenia and the other does not.
Sample: The 15 twins in our study.
Variable: The difference in volume in the left hippocampus between the twins. We derived this quantitative variable by subtracting one volume from another.
case0202 %>%
as_tibble() %>%
mutate(diff = Unaffected - Affected) %>%
select(diff) ->
schizo
glimpse(schizo)
## Rows: 15
## Columns: 1
## $ diff <dbl> 0.67, -0.19, 0.09, 0.19, 0.13, 0.40, 0.04, 0.10, 0.50, 0.07, 0.23…
Pattern: Use a histogram/boxplot to visualize the shift from 0.
ggplot(schizo, aes(x = diff)) +
geom_histogram(bins = 15, fill = "white", color = "black") +
geom_vline(xintercept = 0, lty = 2) +
xlab("Difference in Brain Volumes")
Graphic:
Parameter of interest: Mean difference in left hippocampus volumes for all twins.
Estimate: Use sample mean
schizo %>%
summarize(meandiff = mean(diff))
## # A tibble: 1 × 1
## meandiff
## <dbl>
## 1 0.199
0.199 is our “best guess” for the parameter, but it is almost certainly not the value of the parameter (since we didn’t measure everyone).
Hypothesis Testing:
Graphic:
The distribution of possible null sample means is given by statistical theory. Specifically, the \(t\)-statistic (mean divided by the standard deviation of the sampling distribution of the mean) has a \(t\) distribution with \(n - 1\) degrees of freedom (\(n\) is the sample size). It works as long as your data aren’t too skewed or if you have a large enough sample size.
Function: t.test()
tout <- t.test(schizo$diff)
tout
##
## One Sample t-test
##
## data: schizo$diff
## t = 3.2, df = 14, p-value = 0.006
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.0667 0.3306
## sample estimates:
## mean of x
## 0.1987
The tidy()
function from the broom package will
format the output of common procedures to a convenient data frame.
tdf <- tidy(tout)
tdf$estimate
## mean of x
## 0.1987
tdf$p.value
## [1] 0.006062
We often want a range of “likely” values. These are called
confidence intervals. t.test()
will return these confidence
intervals, giving lowest and highest likely values for the mean
difference in volumes:
tdf$conf.low
## [1] 0.0667
tdf$conf.high
## [1] 0.3306
Interpreting confidence intervals:
Graphic:
Intuition: Statistical theory tells us that the sample mean will be within (approximately) 2 standard deviations of the population mean in 95% of repeated samples. This is two standard deviations of the sampling distribution of the sample mean, not two standard deviations of the sample. So we just add and subtract (approximately) two standard deviations of the sampling distribution from the sample mean.
Exercise: An investor sued his broker for lack of diversification. Below are the rates of return (in percent) for the investor’s portfolio over 39 months (data from Moore, McCabe, and Craig (2017)). The average of the S&P 500 stock index for the same period was 0.95%. Does the broker perform worse than average? Explore the data, set up hypotheses, and run the appropriate test.
stock <- tibble(return = c(-8.36, 1.63, -2.27, -2.93, -2.70,
-2.93, -9.14, -2.64, 6.82, -2.35,
-3.58, 6.13, 7.00, -15.25, -8.66,
-1.03, -9.16, -1.25, -1.22, -10.27,
-5.11, -0.80, -1.44, 1.28, -0.65,
4.34, 12.22, -7.21, -0.09, 7.34,
5.04, -7.24, -2.14, -1.01, -1.41,
12.03, -2.53, 4.33, 1.35))
Example: Beaks depths were measured on Daphne Major finches in 1976 (before a harsh drought) and in 1978 (after a harsh drought). The researchers hypothesized that finches with deeper peaks were more likely to survive.
data("case0201")
case0201 %>%
mutate(Year = as.factor(Year)) ->
finch
glimpse(finch)
## Rows: 178
## Columns: 2
## $ Year <fct> 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976…
## $ Depth <dbl> 6.2, 6.8, 7.1, 7.1, 7.4, 7.8, 7.9, 8.0, 8.2, 8.4, 8.4, 8.4, 8.5,…
Observational Units: The finches.
Population: All finches.
Sample: The 178 finches that the researches measured.
Variables: The year the finch was measured (binary/categorical) and the beak depth (quantitative). Possible to also treat year as quantitative.
Pattern: Use a boxplot to see if the groups differ.
ggplot(finch, aes(x = Year, y = Depth)) +
geom_boxplot(coef = Inf) +
geom_jitter(alpha = 1/3)
Parameter of interest: Difference in mean beak depths between 1976 finches and 1978 finches.
Estimate: The difference in mean beak depths between 1976 finches and 1978 finches in our sample.
finch %>%
group_by(Year) %>%
summarize(meandepth = mean(Depth)) %>%
spread(key = Year, value = meandepth) %>%
mutate(diff = `1978` - `1976`) ## do you remember why we need back-ticks?
## # A tibble: 1 × 3
## `1976` `1978` diff
## <dbl> <dbl> <dbl>
## 1 9.47 10.1 0.669
Hypothesis Test:
Graphic:
The distribution of possible null sample means comes from statistical theory. The t-statistic has a \(t\) distribution with a complicated degrees of freedom.
Function: t.test()
. The quantitative variable goes
to the left of the tilde and the binary variable goes to the right of
the tilde.
tout <- t.test(Depth ~ Year, data = finch)
tdf <- tidy(tout)
tdf$estimate
## [1] -0.6685
tdf$p.value
## [1] 8.739e-06
t.test()
also returns a 95% confidence interval for
the difference in means. This has the exact same interpretation as in
the previous section.
c(tdf$conf.low, tdf$conf.high)
## [1] -0.9564 -0.3806
Assumptions (in decreasing order of importance):
Exercise: The Armed Forces Qualifying Test is
designed for evaluating suitability of military recruits. There are
different subjects tested: arithmetic, word knowledge, paragraph
comprehension, and mathematics. They also provide a composite score.
These data are stored in the ex0222
data frame from the
Sleuth3 package. For which subjects do we have evidence that the genders
differ? Make appropriate exploratory plots, set up the appropriate
hypotheses, and run appropriate hypothesis tests.
Assumptions of \(t\)-tools in decreasing order of importance:
Hire a statistician if the independence assumption is violated.
If data are very skewed (violating assumption 2) and the sample size is small and the quantitative variable is always positive, then take a log to make the data more symmetric.
Example: Seeded clouds with either silver iodide or a control substance. Measured rainfall.
data("case0301")
ggplot(case0301, aes(x = Treatment, y = Rainfall)) +
geom_boxplot()
ggplot(case0301, aes(x = Treatment, y = Rainfall)) +
geom_boxplot() +
scale_y_log10()
case0301 %>%
mutate(logRainfall = log(Rainfall)) %>%
t.test(logRainfall ~ Treatment, data = .) %>%
tidy() ->
tout
Inference is then stated in terms of medians and ratios of medians since the log of the median is the median of the log (which is not true of means). Equivalently, the exponentiation of the median is equal to the median of the exponentiation. Also, when data are symmetric, median = mean.
t.test()
.\(p\)-value:
tout$p.value
## [1] 0.01408
Estimate ratio of median rainfall in seeded group to median rainfall in unseeded group:
exp(tout$estimate)
## [1] 3.139
Confidence interval for ratio of median rainfall in seeded group to median rainfall in unseeded group:
exp(c(low = tout$conf.low, high = tout$conf.high))
## low high
## 1.272 7.742
Exercise: Read up on the ex0330
dataset from the Sleuth3 R package. Determine if education level is
associated with income. Interpret any estimates and confidence intervals
you derive.
Example: A 2013 Gallup poll (https://news.gallup.com/poll/161198/favor-russian-nuclear-arms-reductions.aspx) surveyed 1,028 U.S. adults. Of the respondents, 576 said they they support nuclear arms reduction
Observational Units: U.S. adults
Population: All U.S. adults.
Sample: The 1028 U.S. adults in the survey.
Variable: Did they support nuclear arms reduction (yes or no). This is a binary variable.
Pattern: Calculate sample proportion.
576 / 1028
## [1] 0.5603
Parameter of interest: Proportion of U.S. adults who support nuclear arms reduction.
Estimate with sample proportion, 0.56.
Hypothesis Testing:
Graphic:
The distribution of possible null sample proportions comes from statistical theory. The number of successes has a binomial distribution with success probability 0.5 and size parameter equal to the sample size. The sample proportion is the number successes divided by the sample size.
Function: prop.test()
(when you have a large number
of both successes and failures) or binom.test()
(for any
number of successes and failures).
bout <- tidy(binom.test(x = 576, n = 1028, p = 0.5))
bout %>%
select(estimate, p.value, conf.low, conf.high)
## # A tibble: 1 × 4
## estimate p.value conf.low conf.high
## <dbl> <dbl> <dbl> <dbl>
## 1 0.560 0.000123 0.529 0.591
pout <- tidy(prop.test(x = 576, n = 1028, p = 0.5))
pout %>%
select(estimate, p.value, conf.low, conf.high)
## # A tibble: 1 × 4
## estimate p.value conf.low conf.high
## <dbl> <dbl> <dbl> <dbl>
## 1 0.560 0.000125 0.529 0.591
Exercise (from OpenIntro): The 2010 General Social Survey asked 1,259 US residents: “Do you think the use of marijuana should be made legal, or not?” 48% of the respondents said it should be made legal.
Example: 86 lung cancer patients and 86 controls were interviewed on their smoking history in a 1954 study. The data can be found at: https://data-science-master.github.io/lectures/data/smoke.csv.
smoke <- read_csv(file = "../data/smoke.csv")
glimpse(smoke)
## Rows: 172
## Columns: 3
## $ individual <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ smoke <chr> "smoker", "smoker", "smoker", "smoker", "smoker", "smoker",…
## $ cancer <chr> "cancer", "cancer", "control", "cancer", "cancer", "cancer"…
Usually, for two categorical variables, data are presented/provided in two-way tables:
tabdat <- table(smoke$smoke, smoke$cancer)
tabdat
##
## cancer control
## nonsmoker 3 14
## smoker 83 72
addmargins(tabdat)
##
## cancer control Sum
## nonsmoker 3 14 17
## smoker 83 72 155
## Sum 86 86 172
Or two-way tables of proportions.
prop.table(tabdat)
##
## cancer control
## nonsmoker 0.01744 0.08140
## smoker 0.48256 0.41860
Observational Units: People.
Population: All people.
Sample: The 172 people in the study.
Variables: The smoking status of the person (smoke
)
and the cancer status of the person (cancer
).
Pattern: Visualize by calculating conditional proportions. For example, the following code calculates the proportion of cancer cases that are smokers, and the proportion of controls that are smokers.
prop.table(tabdat, margin = 2)
##
## cancer control
## nonsmoker 0.03488 0.16279
## smoker 0.96512 0.83721
If you need a fancy plot for your boss, try a mosaic plot:
as.data.frame(tabdat) %>%
ggplot(aes(y = Freq, x = Var2, fill = Var1)) +
geom_col(position = "fill") +
scale_fill_colorblind(name = "Smoking\nStatus") +
xlab("Cancer Status") +
ylab("Proportion")
Intuition: If the two variables are not related, then we would expect about the same proportion of smokers in each group. But if these conditional proportions are very different, then to variables might be related.
Null hypothesis: cancer status and smoking status are independent.
Alternative hypothesis: cancer status and smoking status are dependent.
Function:fisher.test()
tests for whether we see more
smokers in one group versus the other. Or, symmetrically, whether we see
more cancer cases in the smoker group versus the nonsmoker group. We
won’t get into the test statistic or the null distribution of this
test.
tidy(fisher.test(tabdat)) %>%
select(p.value, estimate)
## # A tibble: 1 × 2
## p.value estimate
## <dbl> <dbl>
## 1 0.00882 0.188
The value in the “estimate” column is the odds ratio. It’s the multiplicative change in the odds of being a nonsmoker when you consider the cancer versus control groups. Specifically \[ \frac{3/86}{83/86} \div \frac{14/86}{72/86} \]
You might see a chi-squared test in the wild. They are only appropriate for large numbers of counts in each cell. I don’t see a reason to use it since we have computers, but it’s fine most of the time to use:
tidy(chisq.test(tabdat)) %>%
select(p.value)
## # A tibble: 1 × 1
## p.value
## <dbl>
## 1 0.0106
Exercise: Researches collected information on women 60 years old or older. They asked them about their sexual activity and their coffee drinking status. Data are from: Diokno, Ananias C., Morton B. Brown, and A. Regula Herzog. “Sexual function in the elderly.” Archives of Internal Medicine 150, no. 1 (1990): 197-200. The table is below:
tabdat <- as.table(matrix(c(15, 115, 25, 70), nrow = 2))
dimnames(tabdat) <- list(coffee = c("yes", "no"), sex = c("yes", "no"))
tabdat
## sex
## coffee yes no
## yes 15 25
## no 115 70
Does there appear to be an association between coffee drinking and sexual activity? Create a table to explore this association. Then run a hypothesis test. What is the null hypothesis?