Board Game Reviews

LASSO regression (glmnet package)


March 14, 2019

Notable topics: LASSO regression (glmnet package)

Recorded on: 2019-03-14

Timestamps by: Alex Cookson

View code



Starting EDA (exploratory data analysis) with counts of categorical variables


Specifying scale_x_log10 function's breaks argument to get sensisble tick marks for time on histogram


Tweaking geom_histogram function's binwidth argument to get something that makes sense for log scale


Using separate_rows to break down comma-separated values for three different categorical variables


Using top_n to get top 20 observations from each of several categories (not quite right, fixed at 17:47)


Troubleshooting various issues with facetted graph (e.g., ordering, values appearing in multiple categories)


Starting prediction of average rating with a linear model

Splitting data into train/test sets (training/holdout)

Investigating relationship between max number of players and average rating (to determine if it should be in linear model)

Exploring average rating over time ("Do newer games tend to be rated higher/lower?")

Discussing necessity of controlling for year a game was published in the linear model

Non-model approach to exploring relationship between game features (e.g., card game, made in Germany) on average rating


Using geom_boxplot function to create boxplot of average ratings for most common game features


Using unite function to combine multiple variables into one

Introducing Lasso regression as good option when you have many features likely to be correlated with one another

Writing code to set up Lasso regression using glmnet and tidytext packages

Adding average rating to the feature matrix (warning: method is messy)


Using setdiff function to find games that are in one set, but not in another (while setting up matrix for Lasso regression)

Spotting the error stemming from the step above (calling row names from the wrong data)


Explaining what a Lasso regression does, including the penalty parameter lambda


Using a cross-validated Lasso model to choose the level of the penalty parameter (lambda)

Adding non-categorical variables to the Lasso model to control for them (e.g., max number of players)


Using unite function to combine multiple variables into one, separated by a colon

Graphing the top 20 coefficients in the Lasso model that have the biggest effect on predicted average rating


Mentioning the yardstick package as a way to evaluate the model's performance

Discussing drawbacks of linear models like Lasso (can't do non-linear relationships or interaction effects)