This files is a Markdown based README file. It will be rendered on GitHub as a landing page of sorts to help someone understand this project. It is good practice to have a README file to help orient a viewer to your project.
The data-raw
folder should have the raw data
provided by the client, as well where the licensed data for the postcode
conversion, the census data and your web scraping results should be
saved. This file should NOT be made publicly available.
The data
folder should contain any datasets created
in the data-prep.Rmd
and used in
sta303-final-project.Rmd
.
The cache
folder will appear after you set up the
Canadian census data. It doesn’t not need to be submitted as part of you
assessment.
README.MD
this file that you’re reading! It tells
you about the contents and set up of this project.
data-prep.Rmd
is where you should write the code for
creating the datasets you will use in your main report. It should read
in data from raw-data
and the data sets you plan to use for
the report into data
. All other modelling, summaries, etc.,
should be done in the main Rmd.
data
folder and your sta303-final-project.Rmd
to a public GitHub repo.sta303-final-project.Rmd
is the file that should be
updated to create your final report. It should only use data from the
data
folder.
report.tex
is the LaTeX template that helps create
the final report. You should not need to edit it.
All files paths should be relative to
sta303-final-project.Rmd
and of the form
data/filename
or raw-data/filename
.
Dataset | Variables | Description |
---|---|---|
cust_dev.Rds |
cust_id , dev_id |
Matches each customer to their current device. |
customer.Rds |
cust_id , dob , postcode ,
sex , pronouns ,
emoji_modifier |
General information we held about each customer for use in the app and associated profile and chat features. |
device.Rds |
dev_id , device_name , line ,
released |
Additional information about each device in these data. May be useful for connecting with the industry dataset. |
cust_sleep.Rds |
cust_id , date , duration ,
flags |
Sleep data for each customer. |
Variable | Description |
---|---|
cust_id |
Unique ID for each customer. |
dob |
Date of birth, as entered by customer. Year, month, day format. |
postcode |
Postal code of customer. |
sex |
Biological sex, used for calculations of health metrics, like body-mass index and base metabolic rate. |
pronouns |
Pronouns for social profile. |
emoji_modifier |
Code for skin tone modifier for emojis when using the chat and react features of the social component of our app. These are the standard unicode modifiers, based on the Fitzpatrick scale. If not set, this means the default yellow colour used. |
dev_id |
Unique ID for each device registered with our app. |
device_name |
Name of device type. |
line |
Line of products this device belongs to. |
released |
Release date for this particular device type. Year, month, day format. |
date |
For sleep data, date sleep session started. Year, month, day format. |
duration | Duration, in minutes, of sleep session. |
flags | Number of times there was a quality flag during the sleep session. Flags may occur due to missing data, or due to data being recorded but sufficiently unusual to suggest it may be a sensor error or other data quality issue. |
Dataset | Variables | Description |
---|---|---|
Web scraped device data | Device name , Line ,
Recommended retail price , Battery life ,
Water resitance , Heart rate sensor ,
Pulse oximiter , GPS ,
Sleep tracking , Smart notifications ,
Contactless payments , Released ,
Brand |
Source: https://fitnesstrackerinfohub.netlify.app/. See Data and hints. |
Median income data | CSDuid , hhld_median_inc ,
Population |
Procured through the Census Mapper API. See Data and hints. |
Postcode conversion file | PC , CSDuid |
Sourced through U of T Libraries. See Data and hints. |
Here is some sample code, using the rvest
package
# Note: In adapting this for your code,
# please ensure all libraries are in a setup chunk at the beginning
# These are the libraries I find useful for webscraping
library(tidyverse)
library(polite)
library(rvest)
url <- "url you are scarping"
# Make sure this code is updated appropriately to provide
# informative user_agent details
target <- bow(url,
user_agent = "liza.bolton@utoronto.ca for STA303/1002 project",
force = TRUE)
# Any details provided in the robots text on crawl delays and
# which agents are allowed to scrape
target
html <- scrape(target)
device_data <- html %>%
html_elements("table") %>%
html_table() %>%
pluck(1) # added, in case you're getting a list format
As a university of Toronto student you have access to a Census Canada Postal Code Conversion Files
Ethical considerations
I recommend downloading the .sav file version. It says it is for SPSS, but it is easy to read into R.
Choose an appropriate year and make sure you specify/justify it in your write-up.
You may need to compress the .sav file before uploading
to ensure a smooth upload. It will automatically decompressed
for you, upon upload.
If you are using the JupyterHub, please use the
break_glass_in_case_of_emergency.Rds
in the
data-raw
folder. It is the result of the below code (i.e.,
it is postcode
).
# install.packages("cancensus")
library(cancensus)
options(cancensus.api_key = "your API key here",
cancensus.cache_path = "cache") # this sets a folder for your cache
# get all regions as at the 2016 Census (2020 not up yet)
regions <- list_census_regions(dataset = "CA16")
regions_filtered <- regions %>%
filter(level == "CSD") %>% # Figure out what CSD means in Census data
as_census_region_list()
# This can take a while
# We want to get household median income
census_data_csd <- get_census(dataset='CA16', regions = regions_filtered,
vectors=c("v_CA16_2397"),
level='CSD', geo_format = "sf")
# Simplify to only needed variables
median_income <- census_data_csd %>%
as_tibble() %>%
select(CSDuid = GeoUID, contains("median"), Population) %>%
mutate(CSDuid = parse_number(CSDuid)) %>%
rename(hhld_median_inc = 2)
For this project, there are two distinct ways you should think about data protection and privacy:
1) As if you were really consultants working for a client. (Report world)
This will be what you consider in the writing of your report. What would a real consultant do for a real client?
2) As responsible students/statisticians at U of T. (GitHub/portfolio world)
This will be what you consider in sharing your report in the ‘real
world’. To access the postcode data you need to agree to the license
around it (make sure you do this, even if you need to use the
break-glass
in the end). This means this file should NOT be
shared on your GitHub or as part of any portfolio of your work. ‘Keep it
secret, keep it safe’.
A useful distinction:
data-raw
folder and this
should NOT go on GitHub.data-prep.Rmd
) in data
folder and this could
go on GitHub/your personal portfolio.Note, you also shouldn’t include your API key anywhere as that is private information about your account.
glmer
, consider
changing the optimizer:
control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e5))
.
(add this to your glmer
call)scales::rescale
is pretty
handy for this. (Note: there may be other more appropriate ways to
consider age, depending on the research question.)glmer
and
gam
, but sometimes gam
fits them faster!