Biostatistics
Dr. Lea Hildebrandt
We will first focus on modeling categorical relationships (of variables that are qualitative!).
These data are usually expressed in terms of counts.
Example: Candy colors
Bag of candy: 30 chocolates, 33 licorices, and 37 gumballs.
Is the distribution fair (i.e. 1/3rd of the bag = each candy) and the fact that there are only 30 chocolates a random accident?
What is the likelihood that the count would come out this way if the true probability of each candy type is the averaged proportion of 1/3 each?
The Chi² test tests whether observed counts differ from expected values (\(H_0\)).
\[ \chi^2 = \sum_i\frac{(observed_i - expected_i)^2}{expected_i} \]
The null hypothesis in our example is that the proportion of each type of candy is equal (1/3 or ~33.33).
If we plug in our values from above, we would calculate \(\chi^2\) like this:
\[ \chi^2 = \frac{(30 - 33.33)^2}{33.33} + \frac{(33 - 33.33)^2}{33.33} + \frac{(37 - 33.33)^2}{33.33} = 0.74 \]
On its own, the \(\chi^2\) statistic is not interpretable - it depends on its distribution.
The shape of the chi-squared distribution depends on the degrees of freedom, which is the number of variables added.
For the candy example, we use a chi-squared distribution with DFs = k-1 (k being 3 for the 3 candy categories - we lost one DF when we calculated the mean for the expected values). If we’d look at the distribution and found \(\chi^2 = .74\) on the x-axis, we would see that it does not fall far into the tail of the distribution but is rather in the middle. If we calculate the p-value, we’d get \(P(\chi^2 > .74) = 0.691\).
It is thus not particularly surprising to find this distribution of candies and we would not reject \(H_0\) (equal proportions).
The \(\chi^2\) test is also used to test whether two categorical variables are related to each other.
Example: Are Black drivers more likely to be pulled over by police than white drivers?
We have two variables: Skin color (black vs white) and being pulled over (true vs. false). We can represent the data in a contingency table:
searched | Black | White | Black (relative) | White (relative) |
---|---|---|---|---|
FALSE | 36244 | 239241 | 0.1295298 | 0.8550062 |
TRUE | 1219 | 3108 | 0.0043565 | 0.0111075 |
If there is no relationship between skin color and being searched, the frequencies would be the same for both skin colors. This would be our expected values. We can determine them using probabilities:
Black | White | ||
---|---|---|---|
Not searched | P(NS)*P(B) | P(NS)*P(W) | P(NS) |
Searched | P(S)*P(B) | P(S)*P(W) | P(S) |
P(B) | P(W) |
If we compute the standardized squared difference between observed and expected values, we can sum them up to get \(\chi^2 = 828.3\)
searched | driver_race | n | expected | stdSqDiff |
---|---|---|---|---|
FALSE | Black | 36244 | 36883.67 | 11.09 |
TRUE | Black | 1219 | 579.33 | 706.31 |
FALSE | White | 239241 | 238601.33 | 1.71 |
TRUE | White | 3108 | 3747.67 | 109.18 |
We can then compute the p-value using a chi-squared distribution with \(DF = (nRows - 1) * (nColumns - 1) = (2-1) * (2-1) = 1\)
We can also calculate a \(\chi^2\) test easily in R:
Pearson's Chi-squared test
data: summaryDf2wayTable
X-squared = 828.3, df = 1, p-value < 2.2e-16
The results indicate that the data are highly unlikely if there was no true relationship between skin color and police searches! We would thus reject \(H_0\).
If we want to know not only whether but also how the data differ from what we would expect under \(H_0\), we can examine the residuals of the model.
The residuals tell us for each cell how much the observed data deviates from the expected data.
To make the residuals better comparable, we will look at the standardized residuals:
\[ \text{standardized residual}_{ij} = \frac{observed_{ij} - expected_{ij}}{\sqrt{expected_{ij}}} \]
where \(i\) and \(j\) are the rows and columns respectively.
searched | driver_race | Standardized residuals |
---|---|---|
FALSE | Black | -3.330746 |
TRUE | Black | 26.576456 |
FALSE | White | 1.309550 |
TRUE | White | -10.449072 |
Alternatively, we can represent the relative likelihood of different outcomes as odds ratios:
\[ odds_{searched|black} = \frac{N_{searched\cap black}}{N_{not\ searched\cap black}} = \frac{1219}{36244} = 0.034 \]
\[ odds_{searched|white} = \frac{N_{searched\cap white}}{N_{not\ searched\cap white}} = \frac{3108}{239241} = 0.013 \]
\[ odds\ ratio = \frac{odds_{searched|black}}{odds_{searched|white}} = 2.59 \]
The odds of being searched are 2.59x higher for Black vs. white drivers!
The Simpson’s Paradox is a great example of misleading summaries.
If we look at the baseball data below, we see that David Justice has a better batting average in every single year, but Derek Jeter has a better overall batting average:
Player | 1995 | 1996 | 1997 | Combined | ||||
---|---|---|---|---|---|---|---|---|
Derek Jeter | 12/48 | .250 | 183/582 | .314 | 190/654 | .291 | 385/1284 | .300 |
David Justice | 104/411 | .253 | 45/140 | .321 | 163/495 | .329 | 312/1046 | .298 |
How can this be?
Simpson’s Paradox: A pattern is present in the combined dataset but may be different in subsets of the data.
Happens if another (lurking) variable changes across subsets (e.g. number of at-bats/denominator).
We want to look at a dataset that was used for an analysis of the relationship between income inequality (Gini index) and the prevalence of hate crimes in the USA.
It looks like there is a positive relationship between the variables. How can we quantify this relationship?
Covariance: How much do two variables co-vary with each other?
Variance (single variable): \(s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{N - 1}\)
Covariance (two variables): \(covariance = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{N - 1}\)
“Is there are relation between the deviations of two different variables (from their means) across observations?”
Will be far from 0 (and positive) if data points deviate in similar ways, will be negative if deviations are in opposite directions.
Covariance varies with overall level of variance in the data, so not that useful to describe relationship.
Correlation coefficient (Pearson correlation): Scales the covariance by the standard deviations of the two variables and thus standardizes it (=varies between -1 and 1!)!
\[ r = \frac{covariance}{s_xs_y} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{(N - 1)s_x s_y} \]
The correlation between income inequality and hate crimes is \(r = .42\), which seems to be a reasonably strong (positive) relationship.
We can test whether such a relationship could occur by chance, even if there is actually no relationship. In this case, our null hypothesis is \(H_0: r = 0\).
To test whether there is a significant relationship, we can transform the \(r\) statistic into a \(t\) statistic:
\[ t_r = \frac{r\sqrt{N-2}}{\sqrt{1-r^2}} \]
We can compute this with R:
# perform correlation test on hate crime data
cor.test(
hateCrimes$avg_hatecrimes_per_100k_fbi,
hateCrimes$gini_index
)
Pearson's product-moment correlation
data: hateCrimes$avg_hatecrimes_per_100k_fbi and hateCrimes$gini_index
t = 3.2182, df = 48, p-value = 0.002314
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1619097 0.6261922
sample estimates:
cor
0.4212719
The p-value is quite small , which indicates that it is quite unlikely to find a r value this extreme or more. We would thus reject \(H_0: r = 0\).
In the plot, we have seen an outlier the District of Columbia was quite different from the other data points.
The Pearson’s correlation coefficient \(r\) is highly sensitive to outliers:
(simulated data)
We can use a different correlation coefficient, though, which is less sensitive to outliers: Spearman correlation.
It is based on ranking (i.e. ordering) the data and using the ranks (instead of the data) for the correlation.
The correlation is now -0.45, the influence of the outlier is thus greatly reduced, and it is no longer significant!
The original correlation is also much lower and non-significant!
corTestSpearman <- cor.test(hateCrimes$avg_hatecrimes_per_100k_fbi,
hateCrimes$gini_index,
method = "spearman")
corTestSpearman
Spearman's rank correlation rho
data: hateCrimes$avg_hatecrimes_per_100k_fbi and hateCrimes$gini_index
S = 20146, p-value = 0.8221
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.03261836
Doing a well controlled, randomized experiment (RCT) is extremely helpful to gather causal evidence.
However, it is not always possible or ethical to do an experiment!
We can still collect (observational) data. However, if we correlate two variables, we can’t conclude that one causes the other: They might be related but there could also be a third variable that causes both.
If we have observational data, causal graphs can be helpful for interpreting causality:
Green arrow: positive relationship, red: negative
Circle: observed, rectangle: latent (unobservable)
ExamGrade and FinishTime seem negative related if we ignore others!
Knowledge = mediator
StudyTime = proxy for knowledge, if we “control” it/hold it constant/only use people with same amount, we would see that EG and FT are unrelated!
We sometimes might want to know whether a single value, the mean of a group, differs from a specific value, e.g. whether the blood pressure in the sample differs from or is bigger than 80.
We can test this using the \(t\)-test, which we have already encountered in the “Hypothesis Testing” session!
\[ t = \frac{\hat{X} - \mu}{SEM} \]
\[ SEM = \frac{\hat{\sigma}}{\sqrt{n}} \]
\(\hat{X}\) is the mean of our sample, \(\mu\) the hypothesized population mean (e.g. the value we want to test against, such as 80 for the blood pressure example).
We can easily calculate the \(t\)-test in R:
More often, we want to know whether there is a difference between the means of two groups.
Example: Do regular marijuana smokers watch more television?
In this example, we expect that they watch more TV, which leads us to the following directed hypotheses:
\(H_0\) = marijuana smokers watch less or equally often TV,
\(H_A\) = marijuana smokers watch more TV.
If the observations are independent (i.e. you really have two unrelated groups), you can use a very similar formula to calculate the \(t\) statistic:
\[ t = \frac{\bar{X_1} - \bar{X_2}}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}} \]
whereby \(\hat{X_1}\) and \(\hat{X_2}\) are the group means, \(S_1^2\) and \(S_2^2\) the group variances, and \(n_1\) and \(n_2\) the group sizes.
Here are the results from a one-tailed \(t\)-test in R:
Welch Two Sample t-test
data: TVHrsNum by RegularMarij
t = -1.2147, df = 116.9, p-value = 0.1135
alternative hypothesis: true difference in means between group No and group Yes is less than 0
95 percent confidence interval:
-Inf 0.09866006
sample estimates:
mean in group No mean in group Yes
2.02963 2.30000
If we have repeated observations of the same subject (i.e. a within-subject design), we might want to compare the same subject thus on multiple, repeated measurements.
If we want to test whether blood pressure differs between the first and second measurement session across individuals, we can use a paired \(t\)-test.
A paired \(t\)-test is equivalent to a one-sample \(t\)-test, only using the difference between the two means as \(\hat{X}\) and e.g. 0 (if we expect no difference for \(H0\)) for \(\mu\).
In R, we would run a paired-samples \(t\)-test like this:
Paired t-test
data: BPsys by timepoint
t = 2.7369, df = 199, p-value = 0.006763
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
0.2850857 1.7549143
sample estimates:
mean difference
1.02
Often, we want to compare more than two means, e.g. different treatment groups or timepoints.
Example: Different treatments for blood pressure
\(H_0\): all means are equal
\(H_A\): not all means are equal (e.g. at least one differs)
We can partition the variance in the data into different parts:
\(SS_{total}\) = total variance in the data
\(SS_{model}\) = Variance explained by the model*
\(SS_{error}\) = Variance not explained by the model
We can use those to calculate the mean squares for the model and the error:
\(MS_{model} =\frac{SS_{model}}{df_{model}}= \frac{SS_{model}}{p-1}\) (\(p\) is the number of means we have calculated)
\(MS_{error} = \frac{SS_{error}}{df_{error}} = \frac{SS_{error}}{N - p}\)
We want to test whether the variance accounted for by the model is greater than expected by chance (\(H_0\): no difference).
In R, we would run an ANOVA like this:
df <-
df %>%
mutate(group2=fct_relevel(group,c("placebo","drug1","drug2")))
# reorder the factor levels so that "placebo" is the control condition/intercept!
# test model without separate duymmies
lmResultAnovaBasic <- lm(sysBP ~ group2, data=df)
summary(lmResultAnovaBasic)
Call:
lm(formula = sysBP ~ group2, data = df)
Residuals:
Min 1Q Median 3Q Max
-29.0838 -7.7452 -0.0978 7.6872 23.4313
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 141.595 1.656 85.502 < 2e-16 ***
group2drug1 -10.237 2.342 -4.371 2.92e-05 ***
group2drug2 -2.027 2.342 -0.865 0.389
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.936 on 105 degrees of freedom
Multiple R-squared: 0.1695, Adjusted R-squared: 0.1537
F-statistic: 10.71 on 2 and 105 DF, p-value: 5.83e-05
We can see a \(t\)-test for every drug. This is because the factor group2
is automatically dummy coded by R: We always compare one drug against the intercept, which is the mean of the placebo group!
The \(t\)-test shows us that drug1 differs significantly from placebo.
The \(F\)-statistic (also called omnibus test) actually tests our hypothesis of no difference between conditions.
Learning objectives:
What are contingency tables and how do we use \(\chi^2\)-tests to assess a significant relationship between categorical variables?
Be able to describe what a correlation coefficient is and compute & interpret a correlation.
Know what a \(t\)-test & ANOVA is and how to compute and interpret it.
Next:
Practice exercises!
General Linear Model