Data and hints

JupyterHub template

Pull the data here!

Guide to the folders and files

This files is a Markdown based README file. It will be rendered on GitHub as a landing page of sorts to help someone understand this project. It is good practice to have a README file to help orient a viewer to your project.

Folders

Files

A note on files paths

All files paths should be relative to sta303-final-project.Rmd and of the form data/filename or raw-data/filename.

Guide to the data

Table 1: Guide to client provided data
Dataset Variables Description
cust_dev.Rds cust_id, dev_id Matches each customer to their current device.
customer.Rds cust_id, dob, postcode, sex, pronouns, emoji_modifier General information we held about each customer for use in the app and associated profile and chat features.
device.Rds dev_id, device_name, line, released Additional information about each device in these data. May be useful for connecting with the industry dataset.
cust_sleep.Rds cust_id, date, duration, flags Sleep data for each customer.

Guide to the variables

Table 2: Client data variables
Variable Description
cust_id Unique ID for each customer.
dob Date of birth, as entered by customer. Year, month, day format.
postcode Postal code of customer.
sex Biological sex, used for calculations of health metrics, like body-mass index and base metabolic rate.
pronouns Pronouns for social profile.
emoji_modifier Code for skin tone modifier for emojis when using the chat and react features of the social component of our app. These are the standard unicode modifiers, based on the Fitzpatrick scale. If not set, this means the default yellow colour used.
dev_id Unique ID for each device registered with our app.
device_name Name of device type.
line Line of products this device belongs to.
released Release date for this particular device type. Year, month, day format.
date For sleep data, date sleep session started. Year, month, day format.
duration Duration, in minutes, of sleep session.
flags Number of times there was a quality flag during the sleep session. Flags may occur due to missing data, or due to data being recorded but sufficiently unusual to suggest it may be a sensor error or other data quality issue.

Additional hints

Table 3: Hints for data you must acquire elsewhere
Dataset Variables Description
Web scraped device data Device name, Line, Recommended retail price, Battery life, Water resitance, Heart rate sensor, Pulse oximiter, GPS, Sleep tracking, Smart notifications, Contactless payments, Released, Brand

Source: https://fitnesstrackerinfohub.netlify.app/.

See Data and hints.

Median income data CSDuid, hhld_median_inc, Population

Procured through the Census Mapper API.

See Data and hints.

Postcode conversion file PC, CSDuid

Sourced through U of T Libraries.

See Data and hints.

Web scraping a website with one simple table

Here is some sample code, using the rvest package

# Note: In adapting this for your code, 
# please ensure all libraries are in a setup chunk at the beginning

# These are the libraries I find useful for webscraping
library(tidyverse)
library(polite)
library(rvest)

url <- "url you are scarping"

# Make sure this code is updated appropriately to provide 
# informative user_agent details
target <- bow(url,
              user_agent = "liza.bolton@utoronto.ca for STA303/1002 project",
              force = TRUE)

# Any details provided in the robots text on crawl delays and 
# which agents are allowed to scrape
target

html <- scrape(target)

device_data <- html %>% 
  html_elements("table") %>% 
  html_table() %>% 
  pluck(1) # added, in case you're getting a list format

Postal code conversion file

Uploading the data to the JupyterHub

You may need to compress the .sav file before uploading to ensure a smooth upload. It will automatically decompressed for you, upon upload.

If you are using the JupyterHub, please use the break_glass_in_case_of_emergency.Rds in the data-raw folder. It is the result of the below code (i.e., it is postcode).

# install.packages("haven")
library(haven)
library(tidyverse)
dataset = read_sav("raw-data/pccfNat_fccpNat_082021sav.sav")

postcode <- dataset %>% 
  select(PC, CSDuid)

Getting income data from the Canadian census

# install.packages("cancensus")
library(cancensus)


options(cancensus.api_key = "your API key here",
        cancensus.cache_path = "cache") # this sets a folder for your cache


# get all regions as at the 2016 Census (2020 not up yet)
regions <- list_census_regions(dataset = "CA16")

regions_filtered <-  regions %>% 
  filter(level == "CSD") %>% # Figure out what CSD means in Census data
  as_census_region_list()

# This can take a while
# We want to get household median income
census_data_csd <- get_census(dataset='CA16', regions = regions_filtered,
                          vectors=c("v_CA16_2397"), 
                          level='CSD', geo_format = "sf")

# Simplify to only needed variables
median_income <- census_data_csd %>% 
  as_tibble() %>% 
  select(CSDuid = GeoUID, contains("median"), Population) %>% 
  mutate(CSDuid = parse_number(CSDuid)) %>% 
  rename(hhld_median_inc = 2)

Privacy?

For this project, there are two distinct ways you should think about data protection and privacy:

1) As if you were really consultants working for a client. (Report world)

This will be what you consider in the writing of your report. What would a real consultant do for a real client?

2) As responsible students/statisticians at U of T. (GitHub/portfolio world)

This will be what you consider in sharing your report in the ‘real world’. To access the postcode data you need to agree to the license around it (make sure you do this, even if you need to use the break-glass in the end). This means this file should NOT be shared on your GitHub or as part of any portfolio of your work. ‘Keep it secret, keep it safe’.

A useful distinction:

Note, you also shouldn’t include your API key anywhere as that is private information about your account.

Other