College Majors and Income

Graphing for EDA (Exploratory Data Analysis)

Notable topics: Graphing for EDA (Exploratory Data Analysis)

Recorded on: 2018-10-14

Timestamps by: Alex Cookson

## Screencast

## Timestamps

Using read_csv function to import data directly from Github to R (without cloning the repository)

Creating a histogram (geom_histogram), then a boxplot (geom_boxplot), to explore the distribution of salaries

Using fct_reorder function to sort boxplot of college majors by salary

Using dollar_format function from scales package to convert scientific notation to dollar format (e.g., "4e+04" becomes "$40,000")

Creating a dotplot (geom_point) of 20 top-earning majors (includes adjusting axis, using the colour aesthetic, and adding error bars)

Using str_to_title function to convert string from ALL CAPS to Title Case

Creating a Bland-Altman graph to explore relationship between sample size and median salary

Using geom_text_repel function from ggrepel package to get text labels on scatter plot points

Using count function's wt argument to specify what should be counted (default is number of rows)

Spicing up a dull bar graph by adding a redundant colour aesthetic (trick from Julia Silge)

Starting to explore relationship between gender and salary

Creating a stacked bar graph (geom_col) of gender breakdown within majors

Using summarise_at to aggregate men and women from majors into categories of majors

Graphing scatterplot (geom_point) of share of women and median salary

Using geom_smooth function to add a line of best fit to scatterplot above

Explanation of why not to aggregate first when performing a statistical test (including explanation of Simpson's Paradox)

Fixing geom_smooth so that we get one overall line while still being able to map to the colour aesthetic

Predicting median salary from share of women with weighted linear regression (to take sample sizes into account)

Using nest function and tidy function from the broom package to apply a linear model to many categories at once

Using p.adjust function to adjust p-values to correct for multiple testing (using FDR, False Discovery Rate)

Showing how to add an appendix to an RMarkdown file with code that doesn't run when compiled

Using fct_lump function to aggregate major categories into the top four and an "Other" category

Adding sample size to the size aesthetic within the aes function

Using ggplotly function from plotly package to create an interactive scatterplot (tooltips appear when moused over)

Exploring IQR (Inter-Quartile Range) of salaries by major