21 Bits and pieces
21.1 Code to generate course art
# install.packages('readr')
# install.packages('tidyverse')
# install.packages("devtools")
# devtools::install_github("BlakeRMills/MetBrewer")
library(readr)
library(MetBrewer)
library(tidyverse)
course_code <- "STA303"
my_colours <- c(met.brewer("Cross", n = 8), met.brewer("Cross", n = 9))
set.seed(parse_number(course_code))
ngroup=17
names=paste("G_",seq(1,ngroup),sep="")
DAT=data.frame()
for(i in seq(1:30)){
data=data.frame( matrix(0, ngroup , 3))
data[,1]=i
data[,2]=sample(names, nrow(data))
data[,3]=prop.table(sample( c(rep(0,100),c(1:ngroup)) ,nrow(data)))
DAT=rbind(DAT,data)
}
colnames(DAT)=c("Year","Group","Value")
DAT=DAT[order( DAT$Year, DAT$Group) , ]
ggplot(DAT, aes(x=Year, y=rev(Value), fill=Group )) +
geom_area(alpha=1 )+
theme_bw() +
scale_fill_manual(values = my_colours)+
theme(
text = element_blank(),
line = element_blank(),
title = element_blank(),
legend.position="none",
panel.border = element_blank(),
panel.background = element_blank(),
plot.margin = margin(0, -2.7, 0, -2.7, "cm"))
ggsave(paste0(course_code, "-base.png"), width = 24, height = 2)
21.2 M1 supporting information on matrices (not assessed)
21.2.1 Background
21.2.1.1 Some true things about matrices
- The rank of a matrix is the number of linearly independent columns your matrix has.
- If the number of columns = the rank of the matrix all the columns are linearly independent. If the umber of columns is > the rank of the matrix, all the columns are not linearly independent.
- You can only invert a square matrix if all its columns are linearly independent. (Determinant non-zero).
Why do we care? In linear regression, to estimate \(\boldsymbol\beta\), our vector of coefficients, we calculate \((\boldsymbol X^T\boldsymbol X)^{-1}\boldsymbol X^T\boldsymbol Y\). The elements of \(\boldsymbol\beta\) can’t be estimated if \(\boldsymbol X^T\boldsymbol X\) (a square matrix) isn’t invertible.
Clarification to what I said in the tests activity: We usually perform regression with an intercept because we don’t want to assume out line passes through the origin, 0. So, if there is an intercept, (column of 1s in the model matrix) we must convert every categorical variable with \(k\) levels into \(k-1\) dummy variables to have the intercept and still satisfies linear independence. IF we ditch the intercept, we can have k dummies, but this is only usually useful in the specific case of ANOVA.
21.2.2 Example
library(tidyverse)
library(palmerpenguins)
# function to replace NAs with 0 and text with 1
<- function(x){
dummify if_else(is.na(x), 0, 1)
}
# create a smaller toy version of the penguin dataset (just for diplay purposes)
set.seed(24601)
<- penguins %>%
pengwings group_by(species) %>%
sample_n(4) %>%
select(body_mass_g, species)
pengwings
## # A tibble: 12 × 2
## # Groups: species [3]
## body_mass_g species
## <int> <fct>
## 1 3900 Adelie
## 2 3700 Adelie
## 3 3450 Adelie
## 4 3175 Adelie
## 5 4500 Chinstrap
## 6 3850 Chinstrap
## 7 3650 Chinstrap
## 8 4550 Chinstrap
## 9 5150 Gentoo
## 10 5000 Gentoo
## 11 5700 Gentoo
## 12 4600 Gentoo
# creating a version of the data where there is a dummy column for each level of species instead of one species column
<- pengwings %>%
wider pivot_wider(id_cols = everything(), names_from = species, values_from = species, names_prefix = "species.") %>%
mutate_at(vars(starts_with("species.")), dummify)
wider
## # A tibble: 12 × 4
## body_mass_g species.Adelie species.Chinstrap species.Gentoo
## <int> <dbl> <dbl> <dbl>
## 1 3900 1 0 0
## 2 3700 1 0 0
## 3 3450 1 0 0
## 4 3175 1 0 0
## 5 4500 0 1 0
## 6 3850 0 1 0
## 7 3650 0 1 0
## 8 4550 0 1 0
## 9 5150 0 0 1
## 10 5000 0 0 1
## 11 5700 0 0 1
## 12 4600 0 0 1
21.2.2.1 Classic regression, k-1 dummies (intercept)
<- lm(body_mass_g ~ species, data = pengwings)
mod1 summary(mod1)
##
## Call:
## lm(formula = body_mass_g ~ species, data = pengwings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -512.50 -310.94 -34.37 348.44 587.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3556.3 206.8 17.199 3.42e-08 ***
## speciesChinstrap 581.2 292.4 1.988 0.07809 .
## speciesGentoo 1556.2 292.4 5.322 0.00048 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 413.6 on 9 degrees of freedom
## Multiple R-squared: 0.7627, Adjusted R-squared: 0.71
## F-statistic: 14.46 on 2 and 9 DF, p-value: 0.001545
head(model.matrix(mod1))
## (Intercept) speciesChinstrap speciesGentoo
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 1 0
## 6 1 1 0
Does the rank of the model matrix equal the number of columns?
qr(model.matrix(mod1))$rank == ncol(model.matrix(mod1))
## [1] TRUE
Okay, linearly independent, we’re good to go!
21.2.2.2 Classic regression but trying to force k dummies (with an intercept)
<- lm(body_mass_g ~ species.Adelie + species.Gentoo + species.Chinstrap, data=wider)
mod2 summary(mod2)
##
## Call:
## lm(formula = body_mass_g ~ species.Adelie + species.Gentoo +
## species.Chinstrap, data = wider)
##
## Residuals:
## Min 1Q Median 3Q Max
## -512.50 -310.94 -34.38 348.44 587.50
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4137.5 206.8 20.010 9.04e-09 ***
## species.Adelie -581.2 292.4 -1.988 0.07809 .
## species.Gentoo 975.0 292.4 3.334 0.00874 **
## species.Chinstrap NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 413.6 on 9 degrees of freedom
## Multiple R-squared: 0.7627, Adjusted R-squared: 0.71
## F-statistic: 14.46 on 2 and 9 DF, p-value: 0.001545
model.matrix(mod2)
## (Intercept) species.Adelie species.Gentoo species.Chinstrap
## 1 1 1 0 0
## 2 1 1 0 0
## 3 1 1 0 0
## 4 1 1 0 0
## 5 1 0 0 1
## 6 1 0 0 1
## 7 1 0 0 1
## 8 1 0 0 1
## 9 1 0 1 0
## 10 1 0 1 0
## 11 1 0 1 0
## 12 1 0 1 0
## attr(,"assign")
## [1] 0 1 2 3
qr(model.matrix(mod2))$rank
## [1] 3
Does the rank of the model matrix equal the number of columns?
qr(model.matrix(mod2))$rank == ncol(model.matrix(mod2))
## [1] FALSE
Not linearly independent. Why?
Well, this intercept column is a linear combination of the three species columns!
<- model.matrix(mod2)
m
1] == m[,2] + m[,3] + m[,4] m[,
## 1 2 3 4 5 6 7 8 9 10 11 12
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
21.2.3 Regression NOT classic, actually ANOVA! (no intercept)
<- lm(body_mass_g ~ 0 + species.Adelie + species.Gentoo + species.Chinstrap, data=wider)
mod3 summary(mod3)
##
## Call:
## lm(formula = body_mass_g ~ 0 + species.Adelie + species.Gentoo +
## species.Chinstrap, data = wider)
##
## Residuals:
## Min 1Q Median 3Q Max
## -512.50 -310.94 -34.38 348.44 587.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## species.Adelie 3556.2 206.8 17.20 3.42e-08 ***
## species.Gentoo 5112.5 206.8 24.73 1.39e-09 ***
## species.Chinstrap 4137.5 206.8 20.01 9.04e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 413.6 on 9 degrees of freedom
## Multiple R-squared: 0.9932, Adjusted R-squared: 0.9909
## F-statistic: 435.8 on 3 and 9 DF, p-value: 4.659e-10
model.matrix(mod3)
## species.Adelie species.Gentoo species.Chinstrap
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 0 0 1
## 6 0 0 1
## 7 0 0 1
## 8 0 0 1
## 9 0 1 0
## 10 0 1 0
## 11 0 1 0
## 12 0 1 0
## attr(,"assign")
## [1] 1 2 3
qr(model.matrix(mod3))$rank
## [1] 3
Challenge question for +100 stats respect points: How do you interpret these coefficients?
Does the rank of the model matrix equal the number of columns?
qr(model.matrix(mod3))$rank == ncol(model.matrix(mod3))
## [1] TRUE
Great, back to being linearly independent.
21.2.4 Further reading (if you want it)
As I said at the beginning, I’m not planning to assess you on any of this. If you’re interested in knowing more, or just think matrix algebra is delicious, I think this is a delightfully approachable walk through. https://online.stat.psu.edu/stat462/node/132/
21.3 p values (recap)
Remember: “Small but mighty!” The smaller the p value, the stronger the evidence against the null hypothesis.
- We should never be making claims in in favour/against the alternative hypothesis. Our statistical claims are always about the null. This may inform practical decisions in less strict ways, depending on practical significance, risks, etc., but a correct statistical claim for a p value should ALWAYS be about the null. Absolutes are rare in teaching statistics.
- A common threshold for rejecting or failing to reject the null hypothesis is 0.05. This is mostly from habit/convention, rather than some truly meaningful cosmic value. “Yeah, 1 in 20? Unlikely, sure.” Difference disciplines/sub-disciplines develop norms appropriate to their context.
- Many statisticians—especially in light of the reproducibility crisis and poor public, and even sometimes researcher, understanding of p values—prefer to make statements about the strength of evidence, not just reject/fail to reject.
p value range | Strength comment | ==========================+====================================================+ > 0.1 | No evidence against the null | | ||
0.05 < p value < 0.1 | Weak evidence against the null hypothesis | | ||
0.01 < p value < 0.05 | Moderate/some evidence against the null hypothesis | ||
0.001 < p value < 0.01 | Strong evidence against the null hypothesis | | ||
< 0.001 | Very strong evidence against the null hypothesis |
21.3.1 What if you get a value that is exactly one of these thresholds?
It would be very rare to get a p value of exactly one of these thresholds. If you were to, you could decide based on your judgment of the context. Should you be more or less conservative in your interpretation based on the needs of the project/client?
So, you’ll almost always have something, that while it may round to 0.001 (or another threshold), is really something like 0.001345789 or 0.00098963. What judgment would you make in each case, with this in mind?
But remember, these are fuzzy boundaries based on easy-to-grasp probabilities for humans, particularly humans used to thinking in base 10. 1 in 20, 1 in 100, 1 in 1000, etc. Easy to think about but not universally meaningful thresholds. Honestly, being a little over or under really doesn’t make that much of a difference. p values are continuous random variables. We impose discrete categories for our own ease. Used too rigidly, these guidelines have the same problem as rejecting/failing to reject, but just with a few more levels.