The Office

Text mining (tidytext package), LASSO regression (glmnet package)

Published

March 15, 2020

Notable topics: Text mining (tidytext package), LASSO regression (glmnet package)

Recorded on: 2020-03-15

Timestamps by: Alex Cookson

View code

Screencast

Timestamps

Overview of transcripts data

Overview of ratintgs data

fct_inorder

Using fct_inorder function to create a factor with levels based on when they appear in the dataframe

themeelement_text

Using theme and element_text to turn axis labels 90 degrees

geom_linegeom_point

Creating a line graph with points at each observation (using geom_line and geom_point)

Adding text labels to very high and very low-rated episodes

themeelement_blank

Using theme function's panel.grid.major argument to get rid of some extraneous gridlines, using element_blank function

geom_text_repel
ggrepel

Using geom_text_repel from ggrepel package to experiment with different labelling (before abandoning this approach)

row_number

Using row_number function to add episode_number field to make graphing easier

Explanation of why number of ratings (votes) is relevant to interpreting the graph

unnest_tokens
tidytext

Using unnest_tokens function from tidytext package to split full-sentence text field to individual words

anti_join

Using anti_join function to filter out stop words (e.g., and, or, the)

str_remove_all

Using str_remove_all function to get rid of quotation marks from character names (quirks that might pop up when parsing)

bind_tf_idf
tidytext

Asking, "Are there words that are specific to certain characters?" (using bind_tf_idf function)

reorder_withinscale_x_reordered

Using reorder_within function to re-order factors within a grouping (when a term appears in multiple groups) and scale_x_reordered function to graph

Asking, "What effects the popularity of an episode?"

Dealing with inconsistent episode names between datasets

str_remove

Using str_remove function and some regex to remove "(Parts 1&2)" from some episode names

str_to_lower

Using str_to_lower function to further align episode names (addresses inconsistent capitalization)

Setting up dataframe of features for a LASSO regression, with director and writer each being a feature with its own line

separate_rows

Using separate_rows function to separate episodes with multiple writers so that each has their own row

log2

Using log2 function to transform number of lines fields to something more useable (since it is log-normally distributed)

cast_sparse
tidytext

Using cast_sparse function from tidytext package to create a sparse matrix of features by episode

semi_join

Using semi_join function as a "filtering join"

Setting up dataframes (after we have our features) to run LASSO regression

cv.glmnet
glmnet

Using cv.glmnet function from glmnet package to run a cross-validated LASSO regression

Explanation of how to pick a lambda penalty parameter

Explanation of output of LASSO model

Outline of why David likes regularized linear models (which is what LASSO is)

Summary of screencast