Friends
Data Manipulation, Linear Modeling, Pairwise Correlation, Text Mining
Notable topics: Data Manipulation, Linear Modeling, Pairwise Correlation, Text Mining
Recorded on: 2020-09-07
Timestamps by: Eric Fletcher
Screencast
Timestamps
Use dplyr package's count function to count the unique values of multiple variables.
Use geom_col to show how many lines of dialogue there is for each character. Use fct_reorder to reorder the speaker factor levels by sorting along n.
Use semi_join to join friends dataset with main_cast with by = "speaker returning all rows from friends with a match in main_cast.
Use unite to create the episode_number variable which pastes together season and episode with sep = ".".
Then, use inner_join to combine above dataset with friends_info with by = c("season", "episode").
Then, use mutate and the glue package instead to combine { season }.{ episode } { title }.
Then use fct_reorder(episode_title, season + .001 * episode) to order it by season first then episode.
Use geom_point to visualize episode_title and us_views_millions.
Use as.integer to change episode_title to integer class.
Add labels to geom_point using geom_text with check_overlap = TRUE so text that overlaps previous text in the same layer will not be plotted.
Run the above plot again using imdb_rating instead of us_views_millions
Ahead of modeling:
Use geom_boxplot to visualize the distribution of speaking for main characters.
Use the complete function with fill = list(n = 0) to replace existing explicit missing values in the data set.
Demonstration of how to account for missing imdb_rating values using the fill function with .direction = "downup" to keep the imdb rating across the same title.
Ahead of modeling:
Use summarize with cor(log2(n), imdb_rating) to find the correlation between speaker and imdb rating -- the fact that the correlation is positive for all speakers gives David a suspicion that some episodes are longer than others because they're in 2 parts with higher ratings due to important moments. David addresses this confounding factor by including percentage of lines instead of number of lines.
Visualize results with geom_boxplot, geom_point with geom_smooth.
Use a linear model to predict imdb rating based on various variables.
Use the tidytext and tidylo packages to see what words are most common amongst characters, and whether they are said more times than would be expected by chance.
Use geom_col to visualize the most overrepresented words per character according to log_odds_weighted.
Use the widyr package and pairwise correlation to determine which characters tend to appear in the same scences together?
Use geom_col to visualize the correlation between characters.
Summary of screencast.