TV Golden Age

Data manipulation, Logistic regression

Published

January 8, 2019

Notable topics: Data manipulation, Logistic regression

Recorded on: 2019-01-08

Timestamps by: Alex Cookson

View code

Screencast

Timestamps

Quick tip on how to start exploring a new dataset

Investigating inconsistency of shows having a count of seasons that is different from the number of seasons given in the data

%in%all

Using %in% operator and all function to only get shows that have a first season and don't have skipped seasons in the data

Asking, "Which seasons have the most variation in ratings?"

facet_wrap

Using facet_wrap function to separate different shows on a line graph into multiple small graphs

Writing custom embedded function to get width of breaks on the x-axis to always be even (e.g., season 2, 4, 6, etc.)

Committing, finding, and explaining a common error of using the same variable name when summarizing multiple things

%/%

Using truncated division operator %/% to bin data into two-year bins instead of annual (e.g., 1990 and 1991 get binned to 1990)

mutate

Using subsetting (with square brackets) within the mutate function to calculate mean on only a subset of data (without needing to filter)

gather

Using gather function (now pivot_longer) to get metrics as columns into tidy format, in order to graph them all at once with a facet_wrap

pmin

Using pmin function to lump all seasons after 4 into one row (it still shows "4", but it represents "4+")

Asking, "If season 1 is good, do you get a second season?" (show survival)

paste0spread

Using paste0 and spread functions to get season 1-3 ratings into three columns, one for each season

distinct

Using distinct function with .keep_all argument remove duplicates by only keeping the first one that appears

glm

Using logistic regression to answer, "Does season 1 rating affect the probability of getting a second season?" (note he forgets to specify the family argument, fixed at 57:25)

ntilecut

Using ntile function to divide data into N bins (5 in this case), then eventually using cut function instead

Adding year as an independent variable to the logistic regression model

Adding an interaction term (season 1 interacting with year) to the logistic regression model

augment

Using augment function as a method of visualizing and interpreting coefficients of regression model

crossing

Using crossing function to create new data to test the logistic regression model on and interpret model coefficients

splines

Fitting natural splines using the splines package, which would capture a non-linear relationship

Summary of screencast