Simpsons Guest Stars

Text mining (tidytext package)

Published

August 29, 2019

Notable topics: Text mining (tidytext package)

Recorded on: 2019-08-29

Timestamps by: Alex Cookson

View code

Screencast

Timestamps

str_detect

Using str_detect function to find guests that played themselves

separate_rows

Using separate_rows function and regex to get delimited values onto different rows (e.g., "Edna Krabappel; Ms. Melon" gets split into two rows)

parse_number

Using parse_number function to convert a numeric variable coded as character to a proper numeric variable

Downloading and importing supplementary dataset of dialogue

semi_join

Using semi_join function to filter dataframe based on values that appear in another dataframe

anti_join

Using anti_join function to check which values in a dataframe do not appear in another dataframe

ifelse

Using ifelse function to recode a single value with another (i.e., "Edna Krapabbel" becomes "Edna Krabappel-Flanders")

Explaining the goal of all the data cleaning steps

sample

Using sample function to get an example line for each character

geom_histogram

Setting geom_histogram function's binwidth and center arguments to get specific bin sizes

unnest_tokensanti_join

tidytext

Using unnest_tokens and anti_join functions from tidytext package to split dialogue into individual words and remove stop words (e.g., "the", "or", "and")

bind_tf_idf

tidytext

Using bind_tf_idf function from tidytext package to get the TF-IDF (term frequency-inverse document frequency) of individual words

top_n

Using top_n function to get the top 1 TF-IDF value for each role

paste0

Using paste0 function to combine two character variables (e.g., "Groundskeeper Willie" and "ach" (separate variables) become "Groundskeeper Willie: ach")

Explanation of what TF-IDF (text frequency-inverse document frequency) tells us and how it is a "catchphrase detector"

Summary of screencast

Screencast

Timestamps

0:4:15

0:7:55

0:9:55

0:14:45

0:16:10

0:18:05

0:20:50

0:26:20

0:31:25

0:33:20

0:37:25

0:38:55

0:42:50

0:44:05

0:48:10

0:56:40