Beach Volleyball

Data cleaning, Logistic regression

Notable topics: Data cleaning, Logistic regression

Recorded on: 2020-05-18

Timestamps by: Eric Fletcher

## Screencast

## Timestamps

Use `pivot_longer`

from the `dplyr`

package to pivot the data set from `wide`

to `long`

.

Use `mutate_at`

from the `dplyr`

package with `starts_with`

to change the class to `character`

for all columns that start with `w_`

and `l_`

.

Use `separate`

from the `tidyr`

package to separate the `name`

variable into three columns with `extra = merge`

and `fill = right`

.

Use `rename`

from the `dplyr`

package to rename `w_player1`

, `w_player2`

, `l_player1`

, and `l_player2`

.

Use `pivot_wider`

from the `dplyr`

package to pivot the `name`

variable from `long`

to `wide`

.

Use `str_to_upper`

to convert the `winner_loser`

`w`

and `l`

values to uppercase.

Add unique row numbers for each match using `mutate`

with `row_number`

from the `dplyr`

package.

Separate the `score`

values into multiple rows using `separate_rows`

from the `tidyr`

package.

Use `separate`

from the `tidyr`

package to actual scores into two columns, one for the winners score `w_score`

and another for the losers score `l_score`

.

Use `na_if`

from the `dplyr`

package to change the `Forfeit or other`

value from the `score`

variable to `NA`

.

Use `str_remove`

from the `stringr`

package to remove scores that include `retired`

.

Determine how many times the winners score `w_score`

is greter than the losers score `l_score`

at least 1/3 of the time.

Use `summarize`

from the `dplyr`

package to create the summary statistics including the `number of matches`

, `winning percentage`

, `date of first match`

, `date of most recent match`

.

Use `type_convert`

from the `readr`

package to convert `character`

class variables to `numeric`

.

Use `summarize_all`

from the `dplyr`

package to calculate the calculate which fraction of the data is not `NA`

.

Use `summarize`

from the `dplyr`

package to determine players `number of matches`

, `winning percentage`

, `average attacks`

, `average errors`

, `average kills`

, `average aces`

, `average serve errors`

, and `total rows with data`

for years prior to 2019.

The summary statistics are then used to answer how would we could predict if a player will win in 2019 using `geom_point`

and `logistic regression`

. Initially, David wanted to predict performance based on players first year performance. (NOTE - David mistakingly grouped by `year`

and `age`

. He cathces this around 1:02:00.)

Use `year`

from the `lubridate`

package within a `group_by`

to determine the `age`

for each play given their `birthdate`

.

Turn the summary statistics at timestamp `42:00`

into a `.`

DOT `%>%`

PIPE function.

Summary of screencast.