TV Golden Age

Data manipulation, Logistic regression


January 8, 2019

Notable topics: Data manipulation, Logistic regression

Recorded on: 2019-01-08

Timestamps by: Alex Cookson

View code



Quick tip on how to start exploring a new dataset

Investigating inconsistency of shows having a count of seasons that is different from the number of seasons given in the data


Using %in% operator and all function to only get shows that have a first season and don't have skipped seasons in the data

Asking, "Which seasons have the most variation in ratings?"


Using facet_wrap function to separate different shows on a line graph into multiple small graphs

Writing custom embedded function to get width of breaks on the x-axis to always be even (e.g., season 2, 4, 6, etc.)

Committing, finding, and explaining a common error of using the same variable name when summarizing multiple things


Using truncated division operator %/% to bin data into two-year bins instead of annual (e.g., 1990 and 1991 get binned to 1990)


Using subsetting (with square brackets) within the mutate function to calculate mean on only a subset of data (without needing to filter)


Using gather function (now pivot_longer) to get metrics as columns into tidy format, in order to graph them all at once with a facet_wrap


Using pmin function to lump all seasons after 4 into one row (it still shows "4", but it represents "4+")

Asking, "If season 1 is good, do you get a second season?" (show survival)


Using paste0 and spread functions to get season 1-3 ratings into three columns, one for each season


Using distinct function with .keep_all argument remove duplicates by only keeping the first one that appears


Using logistic regression to answer, "Does season 1 rating affect the probability of getting a second season?" (note he forgets to specify the family argument, fixed at 57:25)


Using ntile function to divide data into N bins (5 in this case), then eventually using cut function instead

Adding year as an independent variable to the logistic regression model

Adding an interaction term (season 1 interacting with year) to the logistic regression model


Using augment function as a method of visualizing and interpreting coefficients of regression model


Using crossing function to create new data to test the logistic regression model on and interpret model coefficients


Fitting natural splines using the splines package, which would capture a non-linear relationship

Summary of screencast