Medium Articles

Text mining (tidytext package)

Notable topics: Text mining (tidytext package)

Recorded on: 2018-12-03

Timestamps by: Alex Cookson

## Screencast

## Timestamps

Using summarise_at and starts_with functions to quickly sum up all variables starting with "tag_"

Using gather function (now pivot_longer) to convert topic tag variables from wide to tall (tidy) format

Explanation of how gathering step above will let us find the most/least common tags

Explanation of using median (instead of mean) as measure of central tendency for number of claps an article got

Visualizing log-normal (ish) distribution of number of claps an article gets

Using pmin function to bin reading times of 10 minutes or more to cap out at 10 minutes

Changing scale_x_continuous function's breaks argument to get custom labels and tick marks on a histogram

Discussion of using mean vs. median as measure of central tendency for reading time (he decides on mean)

Starting text mining analysis

Using unnest_tokens function from tidytext package to split character string into individual words

Explanation of stop words and using anti_join function from tidytext package to get rid of them

Using str_detect function to filter out "words" that are just numbers (e.g., "2", "35")

Quick analysis of which individual words are associated with more/fewer claps ("What are the hype words?")

Using geometric mean as alternative to median to get more distinction between words (note 27:33 where he makes a quick fix)

Starting analysis of clusters of related words (e.g., "neural" is linked to "network")

Finding correlations pairs of words using pairwise_cor function from widyr package

Using ggraph and igraph packages to make network plot of correlated pairs of words

Using geom_node_text to add labels for points (vertices) in the network plot

Filtering original data to only include words appear in the network plot (150 word pairs with most correlation)

Adding colour as a dimension to the network plot, representing geometric mean of claps

Changing default colour scale to one with Blue = Low and High = Red with scale_colour_gradient2 function

Adding dark outlines to points on network plot with a hack

Starting to predict number of claps based on title tag (Lasso regression)

Explanation of data format needed to conduct Lasso regression (and using cast_sparse function to get sparse matrix)

Bringing in number of claps to the sparse matrix (un-tidy methods)

Using cv.glmnet function (cv = cross validated) from glmnet package to run Lasso regression

Finding and fixing mistake in defining Lasso model

Explanation of Lasso model

Using tidy function from the broom package to tidy up the Lasso model

Visualizing how specific words affect the prediction of claps as lambda (Lasso's penalty parameter) changes

Summary of screencast