Ramen Reviews

Web scraping (rvest package)

Published

June 3, 2019

Notable topics: Web scraping (rvest package)

Recorded on: 2019-06-03

Timestamps by: Alex Cookson

View code

Screencast

Timestamps

Looking at the website the data came from

gather

Using gather function (now pivot_longer) to convert wide data to long (tidy) format

Graphing counts of all categorical variables at once, then exploring them

fct_lump

Using fct_lump function to lump three categorical variables to the top N categories and "Other"

reorder_within

Using reorder_within function to re-order factors that have the same name across multiple facets

lm

Using lm function (linear model) to predict star rating

Visualising effects (and 95% CI) of indendent variables in linear model with a coefficient plot (TIE fighter plot)

fct_relevel

Using fct_relevel function to get "Other" as the base reference level for categorical independent variables in a linear model

extract

Using extract function and regex to split a camelCase variable into two separate variables

facet_wrap

Using facet_wrap function to split coefficient / TIE fighter plot into three separate plots, based on type of coefficient

geom_vline

Using geom_vline function to add reference line to graph

unnest_tokens
tidytext

Using unnest_tokens function from tidytext package to explore the relationship between variety (a sparse categorical variable) and star rating

Explanation of how he would approach variety variable with Lasso regression

rvest

Web scraping the using rvest package and SelectorGadget (Chrome Extension CSS selector)

read_htmlhtml_nodehtml_table
rvest

Actually writing code for web scraping, using read_html, html_node, and html_table functions

clean_names
janitor

Using clean_names function from janitor package to clean up names of variables

Explanation of web scraping task: get full review text using the links from the review summary table scraped above

parse_number

Using parse_number function as alternative to as.integer function to cleverly drop extra weird text in review number

Using SelectorGadget (Chrome Extension CSS selector) to identify part of page that contains review text

html_nodeshtml_textstr_subset
rvest

Using html_nodes, html_text, and str_subset functions to write custom function to scrape review text identified in step above

message

Adding message function to custom scraping function to display URLs as they are being scraped

unnest_tokensanti_join

Using unnest_tokens and anti_join functions to split review text into individual words and remove stop words (e.g., "the", "or", "and")

Catching a mistake in the custom function causing it to read the same URL every time

str_detect

Using str_detect function to filter out review paragraphs without a keyword in it

str_remove

Using str_remove function and regex to get rid of string that follows a specific pattern

possiblysafely
purrr

Explanation of possibly and safely functions in purrr package

Reviewing output of the URL that failed to scrape, including using character(0) as a default null value

pairwise_cor
widyr

Using pairwise_cor function from widyr package to see which words tend to appear in reviews together

igraphggraph

Using igraph and ggraph packages to make network plot of word correlations

geom_node_text
igraphggraph

Using geom_node_text function to add labels to network plot

igraphggraph

Including all words (not just those connected to others) as vertices in the network plot

Tweaking and refining network plot aesthetics (vertex size and colour)

Weird hack for getting a dark outline on hard-to-see vertex points

Summary of screencast