21 Bits and pieces

21.1 Code to generate course art

# install.packages('readr')
# install.packages('tidyverse')
# install.packages("devtools")
# devtools::install_github("BlakeRMills/MetBrewer")

library(readr)
library(MetBrewer)
library(tidyverse)

course_code <- "STA303"

my_colours <- c(met.brewer("Cross", n = 8), met.brewer("Cross", n = 9))

set.seed(parse_number(course_code))

ngroup=17
names=paste("G_",seq(1,ngroup),sep="")
DAT=data.frame()

for(i in seq(1:30)){
  data=data.frame( matrix(0, ngroup , 3))
  data[,1]=i
  data[,2]=sample(names, nrow(data))
  data[,3]=prop.table(sample( c(rep(0,100),c(1:ngroup)) ,nrow(data)))
  DAT=rbind(DAT,data)
}
colnames(DAT)=c("Year","Group","Value")
DAT=DAT[order( DAT$Year, DAT$Group) , ]

ggplot(DAT, aes(x=Year, y=rev(Value), fill=Group )) +
  geom_area(alpha=1  )+
  theme_bw() +
  scale_fill_manual(values = my_colours)+
  theme(
    text = element_blank(),
    line = element_blank(),
    title = element_blank(),
    legend.position="none",
    panel.border = element_blank(),
    panel.background = element_blank(),
    plot.margin = margin(0, -2.7, 0, -2.7, "cm"))

ggsave(paste0(course_code, "-base.png"), width = 24, height = 2)

21.2 M1 supporting information on matrices (not assessed)

21.2.1 Background

21.2.1.1 Some true things about matrices

  • The rank of a matrix is the number of linearly independent columns your matrix has.
  • If the number of columns = the rank of the matrix all the columns are linearly independent. If the umber of columns is > the rank of the matrix, all the columns are not linearly independent.
  • You can only invert a square matrix if all its columns are linearly independent. (Determinant non-zero).

Why do we care? In linear regression, to estimate \(\boldsymbol\beta\), our vector of coefficients, we calculate \((\boldsymbol X^T\boldsymbol X)^{-1}\boldsymbol X^T\boldsymbol Y\). The elements of \(\boldsymbol\beta\) can’t be estimated if \(\boldsymbol X^T\boldsymbol X\) (a square matrix) isn’t invertible.

Clarification to what I said in the tests activity: We usually perform regression with an intercept because we don’t want to assume out line passes through the origin, 0. So, if there is an intercept, (column of 1s in the model matrix) we must convert every categorical variable with \(k\) levels into \(k-1\) dummy variables to have the intercept and still satisfies linear independence. IF we ditch the intercept, we can have k dummies, but this is only usually useful in the specific case of ANOVA.

21.2.2 Example

library(tidyverse)
library(palmerpenguins)
# function to replace NAs with 0 and text with 1
dummify <- function(x){
  if_else(is.na(x), 0, 1)
}

# create a smaller toy version of the penguin dataset (just for diplay purposes)
set.seed(24601)
pengwings <- penguins %>% 
  group_by(species) %>% 
  sample_n(4) %>% 
  select(body_mass_g, species)

pengwings
## # A tibble: 12 × 2
## # Groups:   species [3]
##    body_mass_g species  
##          <int> <fct>    
##  1        3900 Adelie   
##  2        3700 Adelie   
##  3        3450 Adelie   
##  4        3175 Adelie   
##  5        4500 Chinstrap
##  6        3850 Chinstrap
##  7        3650 Chinstrap
##  8        4550 Chinstrap
##  9        5150 Gentoo   
## 10        5000 Gentoo   
## 11        5700 Gentoo   
## 12        4600 Gentoo
# creating a version of the data where there is a dummy column for each level of species instead of one species column
wider <- pengwings %>% 
  pivot_wider(id_cols = everything(), names_from = species, values_from = species, names_prefix = "species.") %>% 
  mutate_at(vars(starts_with("species.")), dummify)

wider
## # A tibble: 12 × 4
##    body_mass_g species.Adelie species.Chinstrap species.Gentoo
##          <int>          <dbl>             <dbl>          <dbl>
##  1        3900              1                 0              0
##  2        3700              1                 0              0
##  3        3450              1                 0              0
##  4        3175              1                 0              0
##  5        4500              0                 1              0
##  6        3850              0                 1              0
##  7        3650              0                 1              0
##  8        4550              0                 1              0
##  9        5150              0                 0              1
## 10        5000              0                 0              1
## 11        5700              0                 0              1
## 12        4600              0                 0              1

21.2.2.1 Classic regression, k-1 dummies (intercept)

mod1 <- lm(body_mass_g ~ species, data = pengwings)
summary(mod1)
## 
## Call:
## lm(formula = body_mass_g ~ species, data = pengwings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -512.50 -310.94  -34.37  348.44  587.50 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3556.3      206.8  17.199 3.42e-08 ***
## speciesChinstrap    581.2      292.4   1.988  0.07809 .  
## speciesGentoo      1556.2      292.4   5.322  0.00048 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 413.6 on 9 degrees of freedom
## Multiple R-squared:  0.7627, Adjusted R-squared:   0.71 
## F-statistic: 14.46 on 2 and 9 DF,  p-value: 0.001545
head(model.matrix(mod1))
##   (Intercept) speciesChinstrap speciesGentoo
## 1           1                0             0
## 2           1                0             0
## 3           1                0             0
## 4           1                0             0
## 5           1                1             0
## 6           1                1             0

Does the rank of the model matrix equal the number of columns?

qr(model.matrix(mod1))$rank == ncol(model.matrix(mod1))
## [1] TRUE

Okay, linearly independent, we’re good to go!

21.2.2.2 Classic regression but trying to force k dummies (with an intercept)

mod2 <- lm(body_mass_g ~ species.Adelie + species.Gentoo + species.Chinstrap, data=wider)
summary(mod2)
## 
## Call:
## lm(formula = body_mass_g ~ species.Adelie + species.Gentoo + 
##     species.Chinstrap, data = wider)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -512.50 -310.94  -34.38  348.44  587.50 
## 
## Coefficients: (1 not defined because of singularities)
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4137.5      206.8  20.010 9.04e-09 ***
## species.Adelie      -581.2      292.4  -1.988  0.07809 .  
## species.Gentoo       975.0      292.4   3.334  0.00874 ** 
## species.Chinstrap       NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 413.6 on 9 degrees of freedom
## Multiple R-squared:  0.7627, Adjusted R-squared:   0.71 
## F-statistic: 14.46 on 2 and 9 DF,  p-value: 0.001545
model.matrix(mod2)
##    (Intercept) species.Adelie species.Gentoo species.Chinstrap
## 1            1              1              0                 0
## 2            1              1              0                 0
## 3            1              1              0                 0
## 4            1              1              0                 0
## 5            1              0              0                 1
## 6            1              0              0                 1
## 7            1              0              0                 1
## 8            1              0              0                 1
## 9            1              0              1                 0
## 10           1              0              1                 0
## 11           1              0              1                 0
## 12           1              0              1                 0
## attr(,"assign")
## [1] 0 1 2 3
qr(model.matrix(mod2))$rank
## [1] 3

Does the rank of the model matrix equal the number of columns?

qr(model.matrix(mod2))$rank == ncol(model.matrix(mod2))
## [1] FALSE

Not linearly independent. Why?

Well, this intercept column is a linear combination of the three species columns!

m <- model.matrix(mod2)

m[,1] == m[,2] + m[,3] + m[,4]
##    1    2    3    4    5    6    7    8    9   10   11   12 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

21.2.3 Regression NOT classic, actually ANOVA! (no intercept)

mod3 <- lm(body_mass_g ~ 0 + species.Adelie + species.Gentoo + species.Chinstrap, data=wider)
summary(mod3)
## 
## Call:
## lm(formula = body_mass_g ~ 0 + species.Adelie + species.Gentoo + 
##     species.Chinstrap, data = wider)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -512.50 -310.94  -34.38  348.44  587.50 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## species.Adelie      3556.2      206.8   17.20 3.42e-08 ***
## species.Gentoo      5112.5      206.8   24.73 1.39e-09 ***
## species.Chinstrap   4137.5      206.8   20.01 9.04e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 413.6 on 9 degrees of freedom
## Multiple R-squared:  0.9932, Adjusted R-squared:  0.9909 
## F-statistic: 435.8 on 3 and 9 DF,  p-value: 4.659e-10
model.matrix(mod3)
##    species.Adelie species.Gentoo species.Chinstrap
## 1               1              0                 0
## 2               1              0                 0
## 3               1              0                 0
## 4               1              0                 0
## 5               0              0                 1
## 6               0              0                 1
## 7               0              0                 1
## 8               0              0                 1
## 9               0              1                 0
## 10              0              1                 0
## 11              0              1                 0
## 12              0              1                 0
## attr(,"assign")
## [1] 1 2 3
qr(model.matrix(mod3))$rank
## [1] 3

Challenge question for +100 stats respect points: How do you interpret these coefficients?

Does the rank of the model matrix equal the number of columns?

qr(model.matrix(mod3))$rank == ncol(model.matrix(mod3))
## [1] TRUE

Great, back to being linearly independent.

21.2.4 Further reading (if you want it)

As I said at the beginning, I’m not planning to assess you on any of this. If you’re interested in knowing more, or just think matrix algebra is delicious, I think this is a delightfully approachable walk through. https://online.stat.psu.edu/stat462/node/132/

21.3 p values (recap)

Remember: “Small but mighty!” The smaller the p value, the stronger the evidence against the null hypothesis.

  • We should never be making claims in in favour/against the alternative hypothesis. Our statistical claims are always about the null. This may inform practical decisions in less strict ways, depending on practical significance, risks, etc., but a correct statistical claim for a p value should ALWAYS be about the null. Absolutes are rare in teaching statistics.
  • A common threshold for rejecting or failing to reject the null hypothesis is 0.05. This is mostly from habit/convention, rather than some truly meaningful cosmic value. “Yeah, 1 in 20? Unlikely, sure.” Difference disciplines/sub-disciplines develop norms appropriate to their context.
  • Many statisticians—especially in light of the reproducibility crisis and poor public, and even sometimes researcher, understanding of p values—prefer to make statements about the strength of evidence, not just reject/fail to reject.
p value range | Strength comment | ==========================+====================================================+ > 0.1 | No evidence against the null |
0.05 < p value < 0.1 | Weak evidence against the null hypothesis |
0.01 < p value < 0.05 | Moderate/some evidence against the null hypothesis
0.001 < p value < 0.01 | Strong evidence against the null hypothesis |
< 0.001 Very strong evidence against the null hypothesis

21.3.1 What if you get a value that is exactly one of these thresholds?

It would be very rare to get a p value of exactly one of these thresholds. If you were to, you could decide based on your judgment of the context. Should you be more or less conservative in your interpretation based on the needs of the project/client?

So, you’ll almost always have something, that while it may round to 0.001 (or another threshold), is really something like 0.001345789 or 0.00098963. What judgment would you make in each case, with this in mind?

But remember, these are fuzzy boundaries based on easy-to-grasp probabilities for humans, particularly humans used to thinking in base 10. 1 in 20, 1 in 100, 1 in 1000, etc. Easy to think about but not universally meaningful thresholds. Honestly, being a little over or under really doesn’t make that much of a difference. p values are continuous random variables. We impose discrete categories for our own ease. Used too rigidly, these guidelines have the same problem as rejecting/failing to reject, but just with a few more levels.