Beyonce and Taylor Swift Lyrics

Text analysis, tf_idf, Log odds ratio, Diverging bar graph, Lollipop graph

Published

September 28, 2020

Notable topics: Text analysis, tf_idf, Log odds ratio, Diverging bar graph, Lollipop graph

Recorded on: 2020-09-28

Timestamps by: Eric Fletcher

View code

Screencast

Timestamps

fct_reorder
forcats

Use fct_reorder from the forcats package to reorder title factor levels by sorting along the sales variable in geom_col plot.

labels
scales

Use labels = dollar from the scales package to format the geom_col x-axis values as currency.

rename_allstr_to_lower
dplyrstringr

Use rename_all(str_to_lower) to convert variable names to lowercase.

unnest_tokens
tidytext

Use unnest_tokens from the tidytext package to split the lyrics into one-lyric-per-row.

anti_join
dplyr

Use anti_join from the tidytext package to find the most common words int he lyrics without stop_words.

bind_tf_idf
tidytext

Use bind_tf_idf from the tidytext package to determine tf - the proportion each word has in each album and idf - how specific each word is to each particular album.

reorder_withinscale_y_reorderedslice_max
tidytextdplyr

Use reorder_within with scale_y_reordered in order to reorder the bars within each facet panel. David replaces top_n with slice_max from the dplyr package in order to show the top 10 words with ties = FALSE.

bind_log_odds
tidylo

Use bind_log_odds from the tidylo package to calculate the log odds ratio of album and words, that is how much more common is the word in a specific album than across all the other albums.

filterstr_length
dplyrstringr

Use filter(str_length(word) <= 3) to come up with a list in order to remove common filler words like ah, uh, ha, ey, eeh, and huh.

distinctmdystr_remove
dplyrlubridatestringr

Use mdy from the lubridate package and str_remove(released, " \\(.*)")) from the stringr package to parse the dates in the released variable.

inner_joinfct_recode
dplyrforcats

Use inner_join from the dplyr package to join taylor_swift_words with release_dates.

David ends up having to use fct_recode since the albums reputation and folklore were nor lowercase in a previous table thus excluding them from the inner_join.

fct_reordergeom_col
forcatsggplot2

Use fct_reorder from the forcats package to reorder album factor levels by sorting along the released variable to be used in the faceted geom_col.

bind_rowsunnest_tokens
dplyrtidytext

Use bind_rows from hte dplyr package to bind ts with beyonce with unnest_tokens from the tidytext package to get one lyric per row per artist.

bind_log_odds
tidylo

Use bind_log_odds to figure out which words are more likely to come from a Taylor Swift or Beyonce song?

slice_maxgeom_colifelsefct_reorder
dplyrggplot2forcats

Use slice_max from the dplyr package to select the top 100 words by num_words_total and then the top 25 by log_odds_weighted. Results are used to create a diverging bar chart showing which words are most common between Beyonce and Taylor Swift songs.

scale_x_continuous
ggplot2

Use scale_x_continuous to make the log_odds_weighted scale more interpretable.

geom_colgeom_pointgeom_vline
ggplot2

Take the previous plot and turn it into a lollipop graph with geom_point(aes(size = num_words_total, color = direction))

ifelse
base

Use ifelse to change the 1x value on the x-axis to same.

pivot_widerclean_namesgeom_ablinegeom_pointslice_maxscale_y_log_10scale_x_log_10geom_text
tidyrggplot2dplyr

Create a geom_point with geom_abline to show the most popular words they use in common.

Summary of screencast.