Medium Articles

Text mining (tidytext package)

Published

December 3, 2018

Notable topics: Text mining (tidytext package)

Recorded on: 2018-12-03

Timestamps by: Alex Cookson

View code

Screencast

Timestamps

summarise_atstarts_with

Using summarise_at and starts_with functions to quickly sum up all variables starting with "tag_"

gather

Using gather function (now pivot_longer) to convert topic tag variables from wide to tall (tidy) format

Explanation of how gathering step above will let us find the most/least common tags

median

Explanation of using median (instead of mean) as measure of central tendency for number of claps an article got

Visualizing log-normal (ish) distribution of number of claps an article gets

pmin

Using pmin function to bin reading times of 10 minutes or more to cap out at 10 minutes

scale_x_continuous

Changing scale_x_continuous function's breaks argument to get custom labels and tick marks on a histogram

Discussion of using mean vs. median as measure of central tendency for reading time (he decides on mean)

Starting text mining analysis

unnest_tokens
tidytext

Using unnest_tokens function from tidytext package to split character string into individual words

anti_join
tidytext

Explanation of stop words and using anti_join function from tidytext package to get rid of them

str_detect

Using str_detect function to filter out "words" that are just numbers (e.g., "2", "35")

Quick analysis of which individual words are associated with more/fewer claps ("What are the hype words?")

Using geometric mean as alternative to median to get more distinction between words (note 27:33 where he makes a quick fix)

Starting analysis of clusters of related words (e.g., "neural" is linked to "network")

pairwise_cor
widyr

Finding correlations pairs of words using pairwise_cor function from widyr package

ggraphigraph

Using ggraph and igraph packages to make network plot of correlated pairs of words

geom_node_text

Using geom_node_text to add labels for points (vertices) in the network plot

Filtering original data to only include words appear in the network plot (150 word pairs with most correlation)

Adding colour as a dimension to the network plot, representing geometric mean of claps

scale_colour_gradient2

Changing default colour scale to one with Blue = Low and High = Red with scale_colour_gradient2 function

Adding dark outlines to points on network plot with a hack

Starting to predict number of claps based on title tag (Lasso regression)

cast_sparse

Explanation of data format needed to conduct Lasso regression (and using cast_sparse function to get sparse matrix)

Bringing in number of claps to the sparse matrix (un-tidy methods)

cv.glmnet
glmnet

Using cv.glmnet function (cv = cross validated) from glmnet package to run Lasso regression

Finding and fixing mistake in defining Lasso model

Explanation of Lasso model

tidy
broom

Using tidy function from the broom package to tidy up the Lasso model

Visualizing how specific words affect the prediction of claps as lambda (Lasso's penalty parameter) changes

Summary of screencast