TidyTuesday Tweets

Text mining (tidytext package)

Notable topics: Text mining (tidytext package)

Recorded on: 2019-01-06

Timestamps by: Alex Cookson

## Screencast

## Timestamps

Importing an rds file using read_rds function

Using floor_date function from lubridate package to round dates down (that's what the floor part does) to the month level

Asking, "Which tweets get the most re-tweets?"

Using contains function to select only columns that contain a certain string ("retweet" in this case)

Exploring likes/re-tweets ratio, including dealing with one or the other being 0 (which would cause divide by zero error)

Starting exploration of actual text of tweets

Using unnest_tokens function from tidytext package to break tweets into individual words (using token argument specifically for tweet-style text)

Using anti_join function to filter out stop words (e.g., "and", "or", "the") from tokenized data frame

Calculating summary statistics per word (average retweets and likes), then looking at distributions

Explanation of Poisson log normal distribution (number of retweets fits this distribution)

Additional example of Poisson log normal distribution (number of likes)

Explanation of geometric mean as better summary statistic than median or arithmetic mean

Using floor_date function from lubridate package to floor dates to the week level and tweaking so that a week starts on Monday (default is Sunday)

Asking, "What topic is each week about?" using just the tweet text

Calculating TF-IDF of tweets, with week as the "document"

Using top_n and group_by functions to select the top tf-idf score for each week

Using str_detect function to filter out "words" that are just numbers (e.g., 16, 36)

Using distinct function with .keep_all argument to ensure only top 1 result, as alternative to top_n function (which includes ties)

Making Jenny Bryan disappointed

Using geom_text function to add text labels to graph to show to word associated with each week

Using geom_text_repel function from ggrepel package as an alternative to geom_text function for adding text labels to graph

Using rvest package to scrape web data from a table in Tidy Tuesday README

Starting to look at #rstats tweets

Spotting signs of fake accounts with purchased followers (lots of hashtags)

Explanation of spotting fake accounts

Using str_detect to filter out web URLs

Using str_count function and some regex to count how many hashtags a tweet has

Creating a Bland-Altman plot (total on x-axis, variable of interest on y-axis)

Using geom_text function with check_overlap argument to add labels to scatterplot

Asking, "Who are the most active #rstats tweeters?"

Summary of screncast