Board Game Reviews

LASSO regression (glmnet package)

Notable topics: LASSO regression (glmnet package)

Recorded on: 2019-03-14

Timestamps by: Alex Cookson

## Screencast

## Timestamps

Starting EDA (exploratory data analysis) with counts of categorical variables

Specifying scale_x_log10 function's breaks argument to get sensisble tick marks for time on histogram

Tweaking geom_histogram function's binwidth argument to get something that makes sense for log scale

Using separate_rows to break down comma-separated values for three different categorical variables

Using top_n to get top 20 observations from each of several categories (not quite right, fixed at 17:47)

Troubleshooting various issues with facetted graph (e.g., ordering, values appearing in multiple categories)

Starting prediction of average rating with a linear model

Splitting data into train/test sets (training/holdout)

Investigating relationship between max number of players and average rating (to determine if it should be in linear model)

Exploring average rating over time ("Do newer games tend to be rated higher/lower?")

Discussing necessity of controlling for year a game was published in the linear model

Non-model approach to exploring relationship between game features (e.g., card game, made in Germany) on average rating

Using geom_boxplot function to create boxplot of average ratings for most common game features

Using unite function to combine multiple variables into one

Introducing Lasso regression as good option when you have many features likely to be correlated with one another

Writing code to set up Lasso regression using glmnet and tidytext packages

Adding average rating to the feature matrix (warning: method is messy)

Using setdiff function to find games that are in one set, but not in another (while setting up matrix for Lasso regression)

Spotting the error stemming from the step above (calling row names from the wrong data)

Explaining what a Lasso regression does, including the penalty parameter lambda

Using a cross-validated Lasso model to choose the level of the penalty parameter (lambda)

Adding non-categorical variables to the Lasso model to control for them (e.g., max number of players)

Using unite function to combine multiple variables into one, separated by a colon

Graphing the top 20 coefficients in the Lasso model that have the biggest effect on predicted average rating

Mentioning the yardstick package as a way to evaluate the model's performance

Discussing drawbacks of linear models like Lasso (can't do non-linear relationships or interaction effects)