Board Game Reviews

LASSO regression (glmnet package)

Published

March 14, 2019

Notable topics: LASSO regression (glmnet package)

Recorded on: 2019-03-14

Timestamps by: Alex Cookson

View code

Screencast

Timestamps

Starting EDA (exploratory data analysis) with counts of categorical variables

scale_x_log10

Specifying scale_x_log10 function's breaks argument to get sensisble tick marks for time on histogram

geom_histogram

Tweaking geom_histogram function's binwidth argument to get something that makes sense for log scale

separate_rows

Using separate_rows to break down comma-separated values for three different categorical variables

top_n

Using top_n to get top 20 observations from each of several categories (not quite right, fixed at 17:47)

facet_wrap

Troubleshooting various issues with facetted graph (e.g., ordering, values appearing in multiple categories)

lm

Starting prediction of average rating with a linear model

Splitting data into train/test sets (training/holdout)

Investigating relationship between max number of players and average rating (to determine if it should be in linear model)

Exploring average rating over time ("Do newer games tend to be rated higher/lower?")

Discussing necessity of controlling for year a game was published in the linear model

Non-model approach to exploring relationship between game features (e.g., card game, made in Germany) on average rating

geom_boxplot

Using geom_boxplot function to create boxplot of average ratings for most common game features

unite

Using unite function to combine multiple variables into one

Introducing Lasso regression as good option when you have many features likely to be correlated with one another

Writing code to set up Lasso regression using glmnet and tidytext packages

Adding average rating to the feature matrix (warning: method is messy)

setdiff

Using setdiff function to find games that are in one set, but not in another (while setting up matrix for Lasso regression)

Spotting the error stemming from the step above (calling row names from the wrong data)

glmnet
glmnet

Explaining what a Lasso regression does, including the penalty parameter lambda

cv.glmnet
glmnet

Using a cross-validated Lasso model to choose the level of the penalty parameter (lambda)

Adding non-categorical variables to the Lasso model to control for them (e.g., max number of players)

unite

Using unite function to combine multiple variables into one, separated by a colon

Graphing the top 20 coefficients in the Lasso model that have the biggest effect on predicted average rating

yardstick

Mentioning the yardstick package as a way to evaluate the model's performance

Discussing drawbacks of linear models like Lasso (can't do non-linear relationships or interaction effects)