Seattle Pet Names

Hypergeometric hypothesis testing, Adjusting for multiple hypothesis testing


March 15, 2019

Notable topics: Hypergeometric hypothesis testing, Adjusting for multiple hypothesis testing

Recorded on: 2019-03-15

Timestamps by: Alex Cookson

View code




Using mdy function from lubridate package to convert character-formatted date to date-class


Exploratory bar graph showing top species of cats, using geom_col function


Specifying facet_wrap function's ncol argument to get graphs stacked vertically (instead of side-by-side)

Asking, "Are some animal names associated with particular dog breeds?"


Explanation of add_count function

Adding up various metrics (e.g., number of names overall, number of breeds overall), but note a mistake that gets fixed at 17:05

Calculating a ratio for names that appear over-represented within a breed, then explaining how small samples can be misleading

Spotting and fixing an aggregation mistake

Explanation of how to investigate which names might be over-represented within a breed

Explanation of how to use hypergeometric distribution to test for name over-representation


Using phyper function to calculate p-values for a one-sided hypergeometric test

Additional explanation of hypergeometric distribution

First investigation of why and how to interpret a p-value histogram (second at 29:45, third at 37:45, and answer at 39:30)

Noticing that we are missing zeros (i.e., having a breed/name combination with 0 dogs), which is important for the hypergeometric test


Using complete function to turn implicit zeros (for breed/name combination) into explicit zeros

Second investigation of p-value histogram (after adding in implicit zeros)


Explanation of multiple hypothesis testing and correction methods (e.g., Bonferroni, Holm), and applying using p.adjust function


Explanation of False Discovery Rate (FDR) control as a method for correcting for multiple hypothesis testing, and applying using p.adjust function

Third investigation of p-value histogram, to hunt for under-represented names

Answer to why the p-value distribution is not well-behaved


Using crossing function to created a simulated dataset to explore how different values affect the p-value

Explanation of how total number of names and total number of breeds affects p-value

More general explanation of what different shapes of p-value histogram might indicate


Renaming variables within a transmute function, using backticks to get names with spaces in them


Using kable function from the knitr package to create a nice-looking table

Explanation of one-side p-value (as opposed to two-sided p-value)

Summary of screencast