Chopped

Data manipulation, Modeling (Linear Regression, Random Forest, and Natural Spline)

Published

August 24, 2020

Notable topics: Data manipulation, Modeling (Linear Regression, Random Forest, and Natural Spline)

Recorded on: 2020-08-24

Timestamps by: Eric Fletcher

View code

Screencast

Timestamps

geom_histogram

ggplot

Use geom_histogram to visualize the distribution of episode ratings.

geom_pointgeom_line

ggplot

Use geom_point and geom_line with color = factor(season) to visualize the episode rating for every episode.

group_bysummarize

dplyr

Use group_by and summarize to show the average rating for each season and the number of episodes in each season.

geom_linegeom_point

ggplot

Continuing from previous row:

Use geom_line and geom_point with size = n_episodes to visualize the average rating for each season with point size indicating the total number of episodes (larger = more episodes, smaller = fewer episodes).

fct_reorder

forcats

Use fct_reorder to reorder the episode_name factor levels by sorting along the episode_rating variable.

gnemonolarrange

gplotdplyrglue

Use geom_point to visualize the top episodes by rating.

Use the 'glue' package to place season number and episode number before episode name on the y axis.

pivot_longerseparate_rows

tidyr

Use pivot_longer to combine ingredients into one single column.

Use separate_rows with sep = ", " to separate out the ingredients with each ingredient getting its own row.

fct_lumpfct_reorder

forcats

Use fct_lump to lump ingredients together except for the 10 most frequent.

Use fct_reorder to reorder ingredient factor levels by sorting against n.

geom_col

ggplot

Use geom_col to create a stacked bar plot to visualize the most common ingredients by course.

fct_relevel

forcats

Use fct_relevel to reorder course factor levels to appetizer, entree, dessert.

fct_revscale_fill_discrete

forcatsggplot

Use fct_rev and scale_fill_discrete with guide = guide_legend(reverse = TRUE) to reorder the segments within the stacked bar plot.

add_countdistinctpairwise_cor

widyrdplyr

Use the widyr package and pairwise_cor to find out what ingredients appear together.

Mentioned: David Robinson - The {widyr} Package YouTube Talk at 2020 R Conference

ggraphgeom_edge_linkgeom_node_pointgeom_node_text

widyrggraphtidygraph

Use ggraph , geom_edge_link, geom_node_point, geom_node_text to create an ingredient network diagram to show their makeup and how they interact.

pairwise_count

widyr

Use pairwise_count from widyr to count the number of times each pair of items appear together within a group defined by feature.

unitepairwise_count

tidyrwidyr

Use unite from the tidyr package in order to paste together the episode_course and series_episode columns into one column to figure out if any pairs of ingredients appear together in the same course across episodes.

summarizeminmeanmax

dplyrbase

Use summarize with min, mean, max, and n()to create thefirst_season, avg_season, last_seasonandn_appearances` variables.

slicetail

dplyrbase

Use slice with tail to get the n ingredients that appear in early and late seasons.

semi_joingeom_boxplotfct_reorder

dplyrggplotforcats

Use geom_boxplot to visualize the distribution of each ingredient across all seasons.

pivot_widerlmlinear_regset_enginefitinitial_splittrainingplotbasevfold_cvfit_resamplestune_gridcollect_metricsgeom_linetidyrand_forestclean_namesstep_nstune_gridcollect_metricsprepjuice

tidymodelsstatsrsampleggplotbroomparsnipjanitor

Fit predictive models (linear regression , random forest, and natural spline) to determine if episode rating is explained by the ingredients or season.

Use pivot_wider with values_fill = list(value = 0)) with 1 indicating ingredient was used and 0 indicating it wasn't used.

Summary of screencast.

Screencast

Timestamps

0:5:20

0:6:30

0:7:15

0:7:15

0:10:55

0:10:55

0:15:20

0:18:10

0:18:10

0:19:45

0:21:00

0:23:20

0:26:20

0:28:00

0:30:15

0:31:55

0:34:35

0:35:40

0:36:50

1:17:25