After several failed tries to preserve Twitter communication on a local incident using Archive-It (probably I did something wrong), I explored other tools such as twarc developed by Ed Summers. Due to the Twitter API limitations, I can’t use it for the past incident, but I would like to learn for the future. For me, the best way to learn is to try.
I discovered an article - An Open-Source Strategy for Documenting Events: The Case Study of the 42nd Canadian Federal Election on Twitter which introduces tools and approaches of how to use twarc and other tools to document events expressed via Twitter. Since it was the Open Access Week from October 21 to October 27, 2019, I replicated their study using Twitter with #oaweek
, #oaweek19
, #oaweek2019
, #openaccessweek
, and #openaccessweek2019
hashtags collected on October 25 and 26, 2019.
I used twarc search function to collect Twitter data.
twarc search '#oaweek' > oaweek.jsonl
twarc search '#oaweek19' > oaweek19.jsonl
twarc search '#oaweek2019' > oaweek2019.jsonl
twarc search '#openaccessweek' > openaccessweek.jsonl
twarc search '#openaccessweek2019' > openaccessweek2019.jsonl
twarc search '#oaweek' > oaweek_1026.jsonl
twarc search '#oaweek19' > oaweek19_1026.jsonl
twarc search '#oaweek2019' > oaweek2019_1026.jsonl
twarc search '#openaccessweek' > openaccessweek_1026.jsonl
twarc search '#openaccessweek2019' > openaccessweek2019_1026.jsonl
Then, I combined all these files to one file called all_combined_oaweek_2019.jsonl
and created the final version called deduped.jsonl
using deduplicate.py
utility from twarc to remove any tweets with duplicate IDs.
cat oaweek.jsonl oaweek19.jsonl oaweek2019.jsonl openaccessweek.jsonl openaccessweek2019.jsonl oaweek_1026.jsonl oaweek19_1026.jsonl oaweek2019_1026.jsonl openaccessweek_1026.jsonl openaccessweek2019_1026.jsonl > all_combined_oaweek_2019.jsonl
utils/deduplicate.py all_combined_oaweek_2019.jsonl > deduped.jsonl
There were 24,041 tweets in the final dataset.
In order to unshorten every URL in the dataset, unshorten.py
from twarc utilities in combination with unshrtn was applied.
docker run -p 3000:3000 docnow/unshrtn
cat deduped.jsonl | python3 twarc/utils/unshorten.py > deduped-unshortened.jsonl
wordcloud.py
utility from twarc was used to create a word cloud of tweets from the Open Access Week dataset.
I was able to see the most retweets below:
utils/retweets.py deduped.jsonl > deduped-retweets.jsonl
utils/tweet_urls.py deduped.jsonl > deduped-retweets.txt
I ran the code as suggested by the article, but there were only 97 tweets who provided geographic information so I skipped it.
utils/geo.py deduped.jsonl > deduped-geo.jsonl
cat deduped-geo.jsonl | wc -l
#97
Using users.py
from twarc utility, I was able to identify Twitter users who tweets most around open access week.
utils/users.py deduped.jsonl > deduped-users.txt
cat deduped-users.txt | sort | uniq -c | sort -n > deduped-users-uniq.txt
cat deduped-users-uniq.txt | wc -l
#10078
tail dedupled-users-uniq.txt
Using tags.py
from twarc utility, I was able to see other hashtags used in the dataset.
utils/tags.py deduped.jsonl > deduped-tags.txt
cat deduped-tags.txt | wc -l
#2604
head deduped-tags.txt
Using urls.py
from twarc utility, I was able to see URLs tweeted in the dataset.
utils/urls.py deduped-unshortened.jsonl > deduped-unshortened-urls.txt
cat deduped-unshortened-urls.txt | sort | uniq -c | sort -n > deduped-unshortened-urls-uniq.txt
cat deduped-unshortened-urls.txt | wc -l
# 7504
cat deduped-unshortened-urls-uniq.txt | wc -l
# 4221
tail deduped-unshortened-urls-uniq.txt
The original article used image_urls.py
which is no longer available and replaced by media_url.py
.
utils/media_urls.py deduped.jsonl > deduped-images.txt
cat deduped-images.txt | sort | uniq -c | sort -n > deduped-images-uniq.txt
cat deduped-images.txt | wc -l
#4942
cat deduped-images-uniq.txt | wc -l
#4827
tail deduped-images-uniq.txt
The original article didn’t explain the emoji, but I found that twarc provides emojis.py
so I just tried for fun.
utils/emojis.py deduped.jsonl > deduped-emoji.txt
I wish I knew this tool earlier and there was dedicated staff who was responsible for digital preservation at my institution. Nonetheless, I learned several tools (i.e., Social Feed Manager) from the digital preservation community (#digipres) so I will try it to learn by doing and do my best to convince other people that this is important and should be one of our responsibilities. I also discovered one research done by Anthony T. Pinter and Ben Goldman - Recount: Revisiting the 42nd Canadian Federal Election to Evaluate the Efficacy of Retroactive Tweet Collection to play Twitter beyond its 7 days limits. I am not an expert, but I do believe that this is important. It could be small contribution, but I’d like to improve myself to contribute even small to the community.
P.S. - I uploaded dehydrated tweet IDs to figshare if you want to reproduce this research or replicate with your own dataset.