Friends

Data Manipulation, Linear Modeling, Pairwise Correlation, Text Mining

Published

September 7, 2020

Notable topics: Data Manipulation, Linear Modeling, Pairwise Correlation, Text Mining

Recorded on: 2020-09-07

Timestamps by: Eric Fletcher

View code

Screencast

Timestamps

count

dplyr

Use dplyr package's count function to count the unique values of multiple variables.

geom_colfct_reorder

ggplotforcats

Use geom_col to show how many lines of dialogue there is for each character. Use fct_reorder to reorder the speaker factor levels by sorting along n.

semi_join

dplyr

Use semi_join to join friends dataset with main_cast with by = "speaker returning all rows from friends with a match in main_cast.

uniteinner_joingluefct_reorder

tidyrglueforcats

Use unite to create the episode_number variable which pastes together season and episode with sep = ".".

Then, use inner_join to combine above dataset with friends_info with by = c("season", "episode").

Then, use mutate and the glue package instead to combine { season }.{ episode } { title }.

Then use fct_reorder(episode_title, season + .001 * episode) to order it by season first then episode.

geom_pointas.integergeom_textgeom_line

ggplotbase

Use geom_point to visualize episode_title and us_views_millions.

Use as.integer to change episode_title to integer class.

Add labels to geom_point using geom_text with check_overlap = TRUE so text that overlaps previous text in the same layer will not be plotted.

geom_pointas.integergeom_textgeom_line

ggplotbase

Run the above plot again using imdb_rating instead of us_views_millions

semi_joingeom_boxplotcoord_flipfct_reordercompletefillscale_x_log10

dplyrggplotforcatstidyrtidyr

Ahead of modeling:

Use geom_boxplot to visualize the distribution of speaking for main characters.

Use the complete function with fill = list(n = 0) to replace existing explicit missing values in the data set.

Demonstration of how to account for missing imdb_rating values using the fill function with .direction = "downup" to keep the imdb rating across the same title.

semi_joinsummarizeadd_countgeom_boxplotgeom_smoothgeom_point

dplyrggplot

Ahead of modeling:

Use summarize with cor(log2(n), imdb_rating) to find the correlation between speaker and imdb rating -- the fact that the correlation is positive for all speakers gives David a suspicion that some episodes are longer than others because they're in 2 parts with higher ratings due to important moments. David addresses this confounding factor by including percentage of lines instead of number of lines.

Visualize results with geom_boxplot, geom_point with geom_smooth.

spreadacrosssemi_joinlmaov

tidyrdplyrstats

Use a linear model to predict imdb rating based on various variables.

unnest_tokensanti_joinbind_log_oddssemi_joingeom_colscale_y_reordered

tidytexttidyloggplot

Use the tidytext and tidylo packages to see what words are most common amongst characters, and whether they are said more times than would be expected by chance.

Use geom_col to visualize the most overrepresented words per character according to log_odds_weighted.

unitesemi_joinpairwise_corr

widyrtidyr

Use the widyr package and pairwise correlation to determine which characters tend to appear in the same scences together?

Use geom_col to visualize the correlation between characters.

Summary of screencast.

Screencast

Timestamps

0:7:30

0:9:35

0:12:07

0:12:30

0:15:45

0:19:95

0:21:35

0:26:45

0:34:05

0:42:00

0:54:15

1:00:25