Friends

Data Manipulation, Linear Modeling, Pairwise Correlation, Text Mining

Notable topics: Data Manipulation, Linear Modeling, Pairwise Correlation, Text Mining

Recorded on: 2020-09-07

Timestamps by: Eric Fletcher

## Screencast

## Timestamps

Use `dplyr`

package's `count`

function to count the unique values of multiple variables.

Use `geom_col`

to show how many lines of dialogue there is for each character. Use `fct_reorder`

to reorder the `speaker`

factor levels by sorting along `n`

.

Use `semi_join`

to join `friends`

dataset with `main_cast`

with `by = "speaker`

returning all rows from `friends`

with a match in `main_cast`

.

Use `unite`

to create the `episode_number`

variable which pastes together `season`

and `episode`

with `sep = "."`

.

Then, use `inner_join`

to combine above dataset with `friends_info`

with `by = c("season", "episode")`

.

Then, use `mutate`

and the `glue`

package instead to combine `{ season }.{ episode } { title }`

.

Then use `fct_reorder(episode_title, season + .001 * episode)`

to order it by `season`

first then `episode`

.

Use `geom_point`

to visualize `episode_title`

and `us_views_millions`

.

Use `as.integer`

to change `episode_title`

to integer class.

Add labels to `geom_point`

using `geom_text`

with `check_overlap = TRUE`

so text that overlaps previous text in the same layer will not be plotted.

Run the above plot again using `imdb_rating`

instead of `us_views_millions`

Ahead of modeling:

Use `geom_boxplot`

to visualize the distribution of speaking for main characters.

Use the `complete`

function with `fill = list(n = 0)`

to replace existing explicit missing values in the data set.

Demonstration of how to account for missing `imdb_rating`

values using the `fill`

function with `.direction = "downup"`

to keep the imdb rating across the same title.

Ahead of modeling:

Use `summarize`

with `cor(log2(n), imdb_rating)`

to find the correlation between speaker and imdb rating -- the fact that the correlation is positive for all speakers gives David a suspicion that some episodes are longer than others because they're in 2 parts with higher ratings due to important moments. David addresses this `confounding factor`

by including `percentage of lines`

instead of `number of lines`

.

Visualize results with `geom_boxplot`

, `geom_point`

with `geom_smooth`

.

Use a `linear model`

to predict imdb rating based on various variables.

Use the `tidytext`

and `tidylo`

packages to see what words are most common amongst characters, and whether they are said more times than would be expected by chance.

Use `geom_col`

to visualize the most overrepresented words per character according to `log_odds_weighted`

.

Use the `widyr`

package and `pairwise correlation`

to determine which characters tend to appear in the same scences together?

Use `geom_col`

to visualize the correlation between characters.

Summary of screencast.