AERO 2015: Making Shared Twitter Links Useful with R

On Friday December 4th, AERO hosted its annual fall conference at the Old Mill. The speakers included:

Dr. Joe Kim, McMaster University, “The Science of Durable Learning”
Don Buchanan, Hamilton Wentworth DSB , E-BEST, “Putting education in ‘educational’ apps: Lessons from the science of learning”
Dr. Daniel Ansari, Western University, “Building blocks of mathematical abilities: Evidence from brain and behaviour”

Twitter was again a staple at the conference (#AEROAOCE) with backroom discussions and sharing/extending resources and articles highlights by the speaker. As with previous years, an archive of the social media exchanges was created using Martin Hawksey’s TAGS 6.0 utility. Twitterfall was also used as a live twitter feed so everyone could see what was resonating.

Although the compilation of tweets is straight forward, it is seldom in a format that I would share with other stakeholders. To facilitate the cleaning process, I use a small R file that extracts the shared URLs and then expands them from the bit.ly or t.co formats. Following are the code snippets with descriptions of each step. If you are more interested in the resources that were shared rather than the process to clean them, scroll down to the bottom of this post.

The following code is saved as the file twittercleaner.r Each time I use it I change the name of the dataframes to reflect the conference tweets that have been compiled (in this case AERO). The file begins by loading the three packages dplyr, stringr and long url.


library(dplyr)
library(stringr)
library(longurl)

Load the data file containing the tweets (a csv extract from the TAGS 6.0 archive):


AERO <- read.csv("C:/00_Data/AERO2015_Enduring_Learning.csv")

Identify the characters that may be contained in a url:

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

Use the stringr package to create a new column in the dataframe and extract the urls into it:

AERO$ContentURL <- str_extract(AERO$text, url_pattern)

Using dplyr to create a new dataframe and then (%>%),
remove the null values (!is.na) and then (%>%),only keep the column with the URLs:

AEROurl <- AERO %>%
filter(!is.na(ContentURL)) %>%
select (ContentURL)

Remove the duplicate URLs (keep unique URLs):

AEROurl <- unique(AEROurl$ContentURL)

Remove the rownames from the table:

attr(AEROurl, "rownames") <- NULL

Up to this point the URLs included in tweets have been shortened using bit.ly or t.co. The following step uses the longurl package to expand the URLs:

AEROExpanded <- expand_urls(AEROurl, check=TRUE, warn=TRUE)

Remove URLs that could not be expanded (and result in a Null value):

AEROExpanded <- filter(AEROExpanded, !is.na(expanded_url))

Create a .csv file containing the extracted and expanded URLs:

write.csv(AEROurl, "C:/AEROurl.csv")

Full version:


library(dplyr)
library(stringr)
library(longurl)

AERO <- read.csv("C:/00_Data/AERO2015_Enduring_Learning.csv")

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

AERO$ContentURL <- str_extract(AERO$text, url_pattern)
AEROurl <- AERO %>%
        filter(!is.na(ContentURL)) %>%
        select (ContentURL)

AEROurl <- unique(AEROurl$ContentURL)
attr(AEROurl, "rownames") <- NULL

AEROExpanded <- expand_urls(AEROurl, check=TRUE, warn=TRUE)

AEROExpanded <- filter(AEROExpanded, !is.na(expanded_url))

write.csv(AEROurl, "C:/AEROurl.csv")

Results (sorted):

AERO 2015: Making Shared Twitter Links Useful with R

1 Response to AERO 2015: Making Shared Twitter Links Useful with R

Leave a comment Cancel reply

Recent Posts

Archives

Categories

ResearchChat Tweets

Meta

AERO 2015: Making Shared Twitter Links Useful with R

Share this:

Related

1 Response to AERO 2015: Making Shared Twitter Links Useful with R

Leave a comment Cancel reply

Recent Posts

Archives

Categories

ResearchChat Tweets

Meta