Horror Movies

ANOVA, Text mining (tidytext package), LASSO regression (glmnet package)


October 21, 2019

Notable topics: ANOVA, Text mining (tidytext package), LASSO regression (glmnet package)

Recorded on: 2019-10-21

Timestamps by: Alex Cookson

View code




Extracting digits (release year) from character string using regex, along with good explanation of extract function


Quick check on why parse_number is unable to parse some values -- is it because they are NA or some other reason?

Visually investigating correlation between budget and rating

Investigating correlation between MPAA rating (PG-13, R, etc.) and rating using boxplots


Using pull function to quickly check levels of a factor

Using ANOVA to check difference of variation within groups (MPAA rating) than between groups


Separating genre using separate_rows function (instead of str_split and unnest)


Removing boilerplate "Directed by..." and "With..." part of plot variable and isolating plot, first using regex, then by using separate function with periods as separator

Unnesting word tokens, removing stop words, and counting appearances

Aggregating by word to find words that appear in high- or low-rated movies

Discussing potential confounding factors for ratings associated with specific words

Searching for duplicated movie titles

De-duping using distinct function


Loading in and explaining glmnet package

Using movie titles to pull out ratings using rownmaes and match functions to create an index of which rating to pull out of the original dataset

Actually using glmnet function to create lasso model

Showing built-in plot of lasso lambda against mean-squared error

Explaining when certain terms appeared in the lasso model as the lambda value dropped

Gathering all variables except for title, so that the dataset is very tall


Using unite function to combine two variables (better alternative to paste)

Creating a new lasso with tons of new variables other than plot words