College Majors and Income

Graphing for EDA (Exploratory Data Analysis)

Published

October 14, 2018

Notable topics: Graphing for EDA (Exploratory Data Analysis)

Recorded on: 2018-10-14

Timestamps by: Alex Cookson

View code

Screencast

Timestamps

read_csv

Using read_csv function to import data directly from Github to R (without cloning the repository)

geom_histogramgeom_boxplot

Creating a histogram (geom_histogram), then a boxplot (geom_boxplot), to explore the distribution of salaries

fct_reorder

Using fct_reorder function to sort boxplot of college majors by salary

dollar_format
scales

Using dollar_format function from scales package to convert scientific notation to dollar format (e.g., "4e+04" becomes "$40,000")

geom_point

Creating a dotplot (geom_point) of 20 top-earning majors (includes adjusting axis, using the colour aesthetic, and adding error bars)

str_to_title

Using str_to_title function to convert string from ALL CAPS to Title Case

Creating a Bland-Altman graph to explore relationship between sample size and median salary

geom_text_repel
ggrepel

Using geom_text_repel function from ggrepel package to get text labels on scatter plot points

count

Using count function's wt argument to specify what should be counted (default is number of rows)

Spicing up a dull bar graph by adding a redundant colour aesthetic (trick from Julia Silge)

Starting to explore relationship between gender and salary

geom_col

Creating a stacked bar graph (geom_col) of gender breakdown within majors

summarise_at

Using summarise_at to aggregate men and women from majors into categories of majors

geom_point

Graphing scatterplot (geom_point) of share of women and median salary

geom_smooth

Using geom_smooth function to add a line of best fit to scatterplot above

Explanation of why not to aggregate first when performing a statistical test (including explanation of Simpson's Paradox)

geom_smooth

Fixing geom_smooth so that we get one overall line while still being able to map to the colour aesthetic

lm

Predicting median salary from share of women with weighted linear regression (to take sample sizes into account)

nesttidy
broom

Using nest function and tidy function from the broom package to apply a linear model to many categories at once

p.adjust

Using p.adjust function to adjust p-values to correct for multiple testing (using FDR, False Discovery Rate)

Showing how to add an appendix to an RMarkdown file with code that doesn't run when compiled

fct_lump

Using fct_lump function to aggregate major categories into the top four and an "Other" category

Adding sample size to the size aesthetic within the aes function

ggplotly
plotly

Using ggplotly function from plotly package to create an interactive scatterplot (tooltips appear when moused over)

Exploring IQR (Inter-Quartile Range) of salaries by major