Beach Volleyball

Data cleaning, Logistic regression

Published

May 18, 2020

Notable topics: Data cleaning, Logistic regression

Recorded on: 2020-05-18

Timestamps by: Eric Fletcher

View code

Screencast

Timestamps

pivot_longer
dplyr

Use pivot_longer from the dplyr package to pivot the data set from wide to long.

mutate_at
dplyr

Use mutate_at from the dplyr package with starts_with to change the class to character for all columns that start with w_ and l_.

separate
tidyr

Use separate from the tidyr package to separate the name variable into three columns with extra = merge and fill = right.

rename
dplyr

Use rename from the dplyr package to rename w_player1, w_player2, l_player1, and l_player2.

pivot_wider
dplyr

Use pivot_wider from the dplyr package to pivot the name variable from long to wide.

str_to_upper
stringr

Use str_to_upper to convert the winner_loser w and l values to uppercase.

row_number
dplyr

Add unique row numbers for each match using mutate with row_number from the dplyr package.

separate_rows
tidyr

Separate the score values into multiple rows using separate_rows from the tidyr package.

separate
tidyr

Use separate from the tidyr package to actual scores into two columns, one for the winners score w_score and another for the losers score l_score.

na_if
dplyr

Use na_if from the dplyr package to change the Forfeit or other value from the score variable to NA.

str_remove
stringr

Use str_remove from the stringr package to remove scores that include retired.

mutategroup_bysummarize
dplyr

Determine how many times the winners score w_score is greter than the losers score l_score at least 1/3 of the time.

summarize
dplyr

Use summarize from the dplyr package to create the summary statistics including the number of matches, winning percentage, date of first match, date of most recent match.

type_convert
readr

Use type_convert from the readr package to convert character class variables to numeric.

summarize_all
dplyr

Use summarize_all from the dplyr package to calculate the calculate which fraction of the data is not NA.

summarizeinner_joingeom_pointglmcbind
dplyrggplot2

Use summarize from the dplyr package to determine players number of matches, winning percentage, average attacks, average errors, average kills, average aces, average serve errors, and total rows with data for years prior to 2019.

The summary statistics are then used to answer how would we could predict if a player will win in 2019 using geom_point and logistic regression. Initially, David wanted to predict performance based on players first year performance. (NOTE - David mistakingly grouped by year and age. He cathces this around 1:02:00.)

summarizeyear
lubridate

Use year from the lubridate package within a group_by to determine the age for each play given their birthdate.

Turn the summary statistics at timestamp 42:00 into a . DOT %>% PIPE function.

Summary of screencast.