Himalayan Climbers
Data Manipulation, Empirical Bayes, Logistic Regression Model
Notable topics: Data Manipulation, Empirical Bayes, Logistic Regression Model
Recorded on: 2020-09-21
Timestamps by: Eric Fletcher
Screencast
Timestamps
Create a geom_col chart to visualize the top 50 tallest mountains.
Use fct_reorder to reorder the peak_name factor levels by sorting along the height_metres variable.
Use summarize with across to get the total number of climbs, climbers, deaths, and first year climbed.
Use mutate to calculate the percent death rate for members and hired staff.
Use inner_join and select to join with peaks dataset by peak_id.
Touching on statistical noise and how it impacts the death rate for mountains with fewer number of climbs, and how to account for it using various statistical methods including Beta Binomial Regression & Empirical Bayes.
Further description of Empirical Bayes and how to account for not overestimating death rate for mountains with fewer climbers.
Recommended reading: Introduction to Empirical Bayes: Examples from Baseball Statistics by David Robinson
Use the ebbr package (Empirical Bayes for Binomial in R) to create an Empirical Bayes Estimate for each mountain by fitting prior distribution across data and adjusting the death rates down or up based on the prior distributions.
Use a geom_point chart to visualize the difference between the raw death rate and new ebbr fitted death rate.
Use geom_point to visualize how deadly each mountain is with geom_errorbarh representing the 95% credible interval between minimum and maximum values.
Use geom_point to visualize the relationship between death rate and height of mountain.
There is not a clear relationship, but David does briefly mention how one could use Beta Binomial Regression to further inspect for possible relationships / trends.
Use geom_histogram and geom_boxplot to visualize the distribution of time it took climbers to go from basecamp to the mountain’s high point for successful climbs only.
Use mutate to calculate the number of days it took climbers to get from basecamp to the highpoint.
Add column to data using case_when and str_detect to identify strings in termination_reason that contain the word Success and rename them to Success & how to use a vector and %in% to change multiple values in termination_reason to NA and rest to Failed.
Use fct_lump to show the top 10 mountains while lumping the other factor levels (mountains) into other.
For just Mount Everest, use geom_histogram and geom_density with fill = success to visualize the days from basecamp to highpoint for climbs that ended in success, failure or other.
For just Mount Everest, use geom_histogram to see the distribution of climbs per year.
For just Mount Everest, use ‘geom_lineandgeom_pointto visualizepct_death` over time by decade.
Use mutate with pmax and integer division to create a decade variable that lumps together the data for 1970 and before.
Write a function for summary statistics such as n_climbs, pct_success, first_climb, pct_death, ‘pct_hired_staff_death`.
For just Mount Everest, use geom_line and geom_point to visualize pct_success over time by decade.
For just Mount Everest, use geom_line and geom_point to visualize pct_hired_staff_deaths over time by decade.
David decides to visualize the pct_hired_staff_deaths and pct_death charts together on the same plot.
For just Mount Everest, fit a logistic regression model to predict the probability of death with format.pval to calculate the p.value.
Use fct_lump to lump together all expedition_role factors except for the n most frequent.
Use group_by with integer division and summarize to calculate n_climbers and pct_death for age bucketed into decades.
Use geom_point and geom_errorbarh to visualize the logistic regression model with confident intervals.
Summary of screencast