JupyterHub template

Guide to the folders and files

This files is a Markdown based README file. It will be rendered on GitHub as a landing page of sorts to help someone understand this project. It is good practice to have a README file to help orient a viewer to your project.

Folders

The data-raw folder should have the raw data provided by the client, as well where the licensed data for the postcode conversion, the census data and your web scraping results should be saved. This file should NOT be made publicly available.
The data folder should contain any datasets created in the data-prep.Rmd and used in sta303-final-project.Rmd.
The cache folder will appear after you set up the Canadian census data. It doesn’t not need to be submitted as part of you assessment.

Files

README.MD this file that you’re reading! It tells you about the contents and set up of this project.
data-prep.Rmd is where you should write the code for creating the datasets you will use in your main report. It should read in data from raw-data and the data sets you plan to use for the report into data. All other modelling, summaries, etc., should be done in the main Rmd.
- The purpose of this is to make sure that your main report is reproducible with the data you would be allowed to share publicly, should you choose to do so. I.e., you could upload your data folder and your sta303-final-project.Rmd to a public GitHub repo.
sta303-final-project.Rmd is the file that should be updated to create your final report. It should only use data from the data folder.
report.tex is the LaTeX template that helps create the final report. You should not need to edit it.

A note on files paths

All files paths should be relative to sta303-final-project.Rmd and of the form data/filename or raw-data/filename.

Guide to the data

Table 1: Guide to client provided data
Dataset	Variables	Description
`cust_dev.Rds`	`cust_id`, `dev_id`	Matches each customer to their current device.
`customer.Rds`	`cust_id`, `dob`, `postcode`, `sex`, `pronouns`, `emoji_modifier`	General information we held about each customer for use in the app and associated profile and chat features.
`device.Rds`	`dev_id`, `device_name`, `line`, `released`	Additional information about each device in these data. May be useful for connecting with the industry dataset.
`cust_sleep.Rds`	`cust_id`, `date`, `duration`, `flags`	Sleep data for each customer.

Guide to the variables

Table 2: Client data variables
Variable	Description
`cust_id`	Unique ID for each customer.
`dob`	Date of birth, as entered by customer. Year, month, day format.
`postcode`	Postal code of customer.
`sex`	Biological sex, used for calculations of health metrics, like body-mass index and base metabolic rate.
`pronouns`	Pronouns for social profile.
`emoji_modifier`	Code for skin tone modifier for emojis when using the chat and react features of the social component of our app. These are the standard unicode modifiers, based on the Fitzpatrick scale. If not set, this means the default yellow colour used.
`dev_id`	Unique ID for each device registered with our app.
`device_name`	Name of device type.
`line`	Line of products this device belongs to.
`released`	Release date for this particular device type. Year, month, day format.
`date`	For sleep data, date sleep session started. Year, month, day format.
duration	Duration, in minutes, of sleep session.
flags	Number of times there was a quality flag during the sleep session. Flags may occur due to missing data, or due to data being recorded but sufficiently unusual to suggest it may be a sensor error or other data quality issue.

Additional hints

Table 3: Hints for data you must acquire elsewhere
Dataset	Variables	Description
Web scraped device data	`Device name`, `Line`, `Recommended retail price`, `Battery life`, `Water resitance`, `Heart rate sensor`, `Pulse oximiter`, `GPS`, `Sleep tracking`, `Smart notifications`, `Contactless payments`, `Released`, `Brand`	Source: https://fitnesstrackerinfohub.netlify.app/. See Data and hints.
Median income data	`CSDuid`, `hhld_median_inc`, `Population`	Procured through the Census Mapper API. See Data and hints.
Postcode conversion file	`PC`, `CSDuid`	Sourced through U of T Libraries. See Data and hints.

Web scraping a website with one simple table

Here is some sample code, using the rvest package

# Note: In adapting this for your code, 
# please ensure all libraries are in a setup chunk at the beginning

# These are the libraries I find useful for webscraping
library(tidyverse)
library(polite)
library(rvest)

url <- "url you are scarping"

# Make sure this code is updated appropriately to provide 
# informative user_agent details
target <- bow(url,
              user_agent = "liza.bolton@utoronto.ca for STA303/1002 project",
              force = TRUE)

# Any details provided in the robots text on crawl delays and 
# which agents are allowed to scrape
target

html <- scrape(target)

device_data <- html %>% 
  html_elements("table") %>% 
  html_table() %>% 
  pluck(1) # added, in case you're getting a list format

Postal code conversion file

As a university of Toronto student you have access to a Census Canada Postal Code Conversion Files
Ethical considerations
- You are asked to accept a license agreement to get access to this data.
- This is data that should NOT go directly on to your GitHub. Think carefully about how you will process this data, saving only the information you need for any data that is part of your final submission.
I recommend downloading the .sav file version. It says it is for SPSS, but it is easy to read into R.
Choose an appropriate year and make sure you specify/justify it in your write-up.

Uploading the data to the JupyterHub

You may need to compress the .sav file before uploading to ensure a smooth upload. It will automatically decompressed for you, upon upload.

If you are using the JupyterHub, please use the break_glass_in_case_of_emergency.Rds in the data-raw folder. It is the result of the below code (i.e., it is postcode).

# install.packages("haven")
library(haven)
library(tidyverse)
dataset = read_sav("raw-data/pccfNat_fccpNat_082021sav.sav")

postcode <- dataset %>% 
  select(PC, CSDuid)

Getting income data from the Canadian census

Sign up for the cancensus API through https://censusmapper.ca/
Get your API key and replace it below
- You can access your key by clicking your profile picture and choosing ‘Edit profile’. It will be displayed on that page.

# install.packages("cancensus")
library(cancensus)


options(cancensus.api_key = "your API key here",
        cancensus.cache_path = "cache") # this sets a folder for your cache


# get all regions as at the 2016 Census (2020 not up yet)
regions <- list_census_regions(dataset = "CA16")

regions_filtered <-  regions %>% 
  filter(level == "CSD") %>% # Figure out what CSD means in Census data
  as_census_region_list()

# This can take a while
# We want to get household median income
census_data_csd <- get_census(dataset='CA16', regions = regions_filtered,
                          vectors=c("v_CA16_2397"), 
                          level='CSD', geo_format = "sf")

# Simplify to only needed variables
median_income <- census_data_csd %>% 
  as_tibble() %>% 
  select(CSDuid = GeoUID, contains("median"), Population) %>% 
  mutate(CSDuid = parse_number(CSDuid)) %>% 
  rename(hhld_median_inc = 2)

Privacy?

For this project, there are two distinct ways you should think about data protection and privacy:

1) As if you were really consultants working for a client. (Report world)

This will be what you consider in the writing of your report. What would a real consultant do for a real client?

2) As responsible students/statisticians at U of T. (GitHub/portfolio world)

This will be what you consider in sharing your report in the ‘real world’. To access the postcode data you need to agree to the license around it (make sure you do this, even if you need to use the break-glass in the end). This means this file should NOT be shared on your GitHub or as part of any portfolio of your work. ‘Keep it secret, keep it safe’.

A useful distinction:

Private/protected data in data-raw folder and this should NOT go on GitHub.
Publishable prepped data used directly in the report (as set up in data-prep.Rmd) in data folder and this could go on GitHub/your personal portfolio.

Note, you also shouldn’t include your API key anywhere as that is private information about your account.

Other

If you have convergence issues using glmer, consider changing the optimizer: control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e5)). (add this to your glmer call)
You may want to rescale age. scales::rescale is pretty handy for this. (Note: there may be other more appropriate ways to consider age, depending on the research question.)
You can fit equivalent models in glmer and gam, but sometimes gam fits them faster!

Data and hints