Wine Ratings

Text mining (tidytext package), LASSO regression (glmnet package)

Published

May 30, 2019

Notable topics: Text mining (tidytext package), LASSO regression (glmnet package)

Recorded on: 2019-05-30

Timestamps by: Alex Cookson

View code

Screencast

Timestamps

extract
tidyr

Using extract function from tidyr package to pull out year from text field

extract
tidyr

Changing extract function to pull out year column more accurately

Starting to explore prediction of points

fct_lumpfct_relevel

Using fct_lump on country variable to collapse countries into an "Other" category, then fct_relevel to set the baseline category for a linear model

Investigating year as a potential confounding variable

Investigating "taster_name" as a potential confounding variable

tidy
broom

Coefficient (TIE fighter) plot to see effect size of terms in a linear model, using tidy function from broom package

str_replace

Polishing category names for presentation in graph using str_replace function

augment

Using augment function to add predictions of linear model to original data

Plotting predicted points vs. actual points

Using ANOVA to determine the amount of variation that explained by different terms

tidytext

Using tidytext package to set up wine review text for Lasso regression

pairwise_cor
widyr

Setting up and using pairwise_cor function to look at words that appear in reviews together

cast_sparse
tidytext

Creating sparse matrix using cast_sparse function from tidytext package; used to perform a regression on positive/negative words

Checking if rownames of sparse matrix correspond to the wine_id values they represent

glmnet

Setting up sparse matrix for using glmnet package to do sparse regression using Lasso method

glmnet

Actually writing code for doing Lasso regression

Basic explanation of Lasso regression

tidy

Putting Lasso model into tidy format

Explaining how the number of terms increases as lambda (penalty parameter) decreases

Answering how we choose a lambda value (penalty parameter) for Lasso regression

Using parallelization for intensive computations

Adding price (from original linear model) to Lasso regression

glmnet

Shows glmnet.fit piece of a Lasso (glmnet) model

Picking a lambda value (penalty parameter) and explaining which one to pick

Taking most extreme coefficients (positive and negative) by grouping theme by direction

tidytext

Demonstrating tidytext package's sentiment lexicon, then looking at individual reviews to demonstrate the model

Visualizing each coefficient's effect on a single review

str_trunc

Using str_trunc to truncate character strings