Chopped

Data manipulation, Modeling (Linear Regression, Random Forest, and Natural Spline)

Published

August 24, 2020

Notable topics: Data manipulation, Modeling (Linear Regression, Random Forest, and Natural Spline)

Recorded on: 2020-08-24

Timestamps by: Eric Fletcher

View code

Screencast

Timestamps

geom_histogram
ggplot

Use geom_histogram to visualize the distribution of episode ratings.

geom_pointgeom_line
ggplot

Use geom_point and geom_line with color = factor(season) to visualize the episode rating for every episode.

group_bysummarize
dplyr

Use group_by and summarize to show the average rating for each season and the number of episodes in each season.

geom_linegeom_point
ggplot

Continuing from previous row:

Use geom_line and geom_point with size = n_episodes to visualize the average rating for each season with point size indicating the total number of episodes (larger = more episodes, smaller = fewer episodes).

fct_reorder
forcats

Use fct_reorder to reorder the episode_name factor levels by sorting along the episode_rating variable.

gnemonolarrange
gplotdplyrglue

Use geom_point to visualize the top episodes by rating.

Use the 'glue' package to place season number and episode number before episode name on the y axis.

pivot_longerseparate_rows
tidyr

Use pivot_longer to combine ingredients into one single column.

Use separate_rows with sep = ", " to separate out the ingredients with each ingredient getting its own row.

fct_lumpfct_reorder
forcats

Use fct_lump to lump ingredients together except for the 10 most frequent.

Use fct_reorder to reorder ingredient factor levels by sorting against n.

geom_col
ggplot

Use geom_col to create a stacked bar plot to visualize the most common ingredients by course.

fct_relevel
forcats

Use fct_relevel to reorder course factor levels to appetizer, entree, dessert.

fct_revscale_fill_discrete
forcatsggplot

Use fct_rev and scale_fill_discrete with guide = guide_legend(reverse = TRUE) to reorder the segments within the stacked bar plot.

add_countdistinctpairwise_cor
widyrdplyr

Use the widyr package and pairwise_cor to find out what ingredients appear together.

Mentioned: David Robinson - The {widyr} Package YouTube Talk at 2020 R Conference

ggraphgeom_edge_linkgeom_node_pointgeom_node_text
widyrggraphtidygraph

Use ggraph , geom_edge_link, geom_node_point, geom_node_text to create an ingredient network diagram to show their makeup and how they interact.

pairwise_count
widyr

Use pairwise_count from widyr to count the number of times each pair of items appear together within a group defined by feature.

unitepairwise_count
tidyrwidyr

Use unite from the tidyr package in order to paste together the episode_course and series_episode columns into one column to figure out if any pairs of ingredients appear together in the same course across episodes.

summarizeminmeanmax
dplyrbase

Use summarize with min, mean, max, and n()to create thefirst_season, avg_season, last_seasonandn_appearances` variables.

slicetail
dplyrbase

Use slice with tail to get the n ingredients that appear in early and late seasons.

semi_joingeom_boxplotfct_reorder
dplyrggplotforcats

Use geom_boxplot to visualize the distribution of each ingredient across all seasons.

pivot_widerlmlinear_regset_enginefitinitial_splittrainingplotbasevfold_cvfit_resamplestune_gridcollect_metricsgeom_linetidyrand_forestclean_namesstep_nstune_gridcollect_metricsprepjuice
tidymodelsstatsrsampleggplotbroomparsnipjanitor

Fit predictive models (linear regression , random forest, and natural spline) to determine if episode rating is explained by the ingredients or season.

Use pivot_wider with values_fill = list(value = 0)) with 1 indicating ingredient was used and 0 indicating it wasn't used.

Summary of screencast.