Beach Volleyball
Data cleaning, Logistic regression
Notable topics: Data cleaning, Logistic regression
Recorded on: 2020-05-18
Timestamps by: Eric Fletcher
Screencast
Timestamps
Use pivot_longer from the dplyr package to pivot the data set from wide to long.
Use mutate_at from the dplyr package with starts_with to change the class to character for all columns that start with w_ and l_.
Use separate from the tidyr package to separate the name variable into three columns with extra = merge and fill = right.
Use rename from the dplyr package to rename w_player1, w_player2, l_player1, and l_player2.
Use pivot_wider from the dplyr package to pivot the name variable from long to wide.
Use str_to_upper to convert the winner_loser w and l values to uppercase.
Add unique row numbers for each match using mutate with row_number from the dplyr package.
Separate the score values into multiple rows using separate_rows from the tidyr package.
Use separate from the tidyr package to actual scores into two columns, one for the winners score w_score and another for the losers score l_score.
Use na_if from the dplyr package to change the Forfeit or other value from the score variable to NA.
Use str_remove from the stringr package to remove scores that include retired.
Determine how many times the winners score w_score is greter than the losers score l_score at least 1/3 of the time.
Use summarize from the dplyr package to create the summary statistics including the number of matches, winning percentage, date of first match, date of most recent match.
Use type_convert from the readr package to convert character class variables to numeric.
Use summarize_all from the dplyr package to calculate the calculate which fraction of the data is not NA.
Use summarize from the dplyr package to determine players number of matches, winning percentage, average attacks, average errors, average kills, average aces, average serve errors, and total rows with data for years prior to 2019.
The summary statistics are then used to answer how would we could predict if a player will win in 2019 using geom_point and logistic regression. Initially, David wanted to predict performance based on players first year performance. (NOTE - David mistakingly grouped by year and age. He cathces this around 1:02:00.)
Use year from the lubridate package within a group_by to determine the age for each play given their birthdate.
Turn the summary statistics at timestamp 42:00 into a . DOT %>% PIPE function.
Summary of screencast.