TidyTuesday Tweets

Text mining (tidytext package)

Published

January 6, 2019

Notable topics: Text mining (tidytext package)

Recorded on: 2019-01-06

Timestamps by: Alex Cookson

View code

Screencast

Timestamps

Importing an rds file using read_rds function

floor_date

lubridate

Using floor_date function from lubridate package to round dates down (that's what the floor part does) to the month level

Asking, "Which tweets get the most re-tweets?"

contains

Using contains function to select only columns that contain a certain string ("retweet" in this case)

Exploring likes/re-tweets ratio, including dealing with one or the other being 0 (which would cause divide by zero error)

Starting exploration of actual text of tweets

unnest_tokens

tidytext

Using unnest_tokens function from tidytext package to break tweets into individual words (using token argument specifically for tweet-style text)

anti_join

Using anti_join function to filter out stop words (e.g., "and", "or", "the") from tokenized data frame

Calculating summary statistics per word (average retweets and likes), then looking at distributions

Explanation of Poisson log normal distribution (number of retweets fits this distribution)

Additional example of Poisson log normal distribution (number of likes)

Explanation of geometric mean as better summary statistic than median or arithmetic mean

floor_date

lubridate

Using floor_date function from lubridate package to floor dates to the week level and tweaking so that a week starts on Monday (default is Sunday)

Asking, "What topic is each week about?" using just the tweet text

bind_tf_idf

tidytext

Calculating TF-IDF of tweets, with week as the "document"

top_n

Using top_n and group_by functions to select the top tf-idf score for each week

str_detect

Using str_detect function to filter out "words" that are just numbers (e.g., 16, 36)

distinct

Using distinct function with .keep_all argument to ensure only top 1 result, as alternative to top_n function (which includes ties)

Making Jenny Bryan disappointed

geom_text

Using geom_text function to add text labels to graph to show to word associated with each week

geom_text_repel

ggrepel

Using geom_text_repel function from ggrepel package as an alternative to geom_text function for adding text labels to graph

rvest

Using rvest package to scrape web data from a table in Tidy Tuesday README

Starting to look at #rstats tweets

Spotting signs of fake accounts with purchased followers (lots of hashtags)

Explanation of spotting fake accounts

str_detect

Using str_detect to filter out web URLs

str_count

Using str_count function and some regex to count how many hashtags a tweet has

Creating a Bland-Altman plot (total on x-axis, variable of interest on y-axis)

geom_text

Using geom_text function with check_overlap argument to add labels to scatterplot

Asking, "Who are the most active #rstats tweeters?"

Summary of screncast

Screencast

Timestamps

0:1:20

0:2:55

0:5:25

0:5:50

0:8:05

0:11:00

0:11:35

0:12:55

0:14:45

0:16:00

0:17:45

0:18:20

0:25:20

0:30:20

0:31:30

0:33:45

0:37:55

0:41:00

0:42:30

0:42:55

0:44:10

0:46:30

0:51:00

0:56:35

0:59:15

1:00:45

1:03:55

1:07:25

1:08:45

1:12:20

1:15:00