Tim Squirrell is a PhD candidate in Science and Technology Studies at the University of Edinburgh. His research focusses on construction and negotiation of authority and expertise on the internet, with a focus on fitness and nutrition communities.

Introduction to computational/data-driven/digital methods of researching Reddit: 10 ways to explore the data

My friend Jason Baumgartner (stuck_in_the_matrix on Reddit) performs an extremely valuable service in pulling every single comment and post from reddit and storing them on a site that anyone can access (https://files.pushshift.io/reddit/). In contrast with Facebook (who just blocked Netvizz, the main way academics scrape data) and Instagram (who are similarly restrictive about the data you can use), that makes Reddit an extremely open platform for research and analysis.

Given that reddit is one of the most popular sites in the world (particularly in the US), it’s a shame that so little research has been done on its social dynamics. Quick shout-out for existing work:

Adrienne Massanari has written a book, as well as multiple papers on Reddit, with a particular focus on the way it facilitates the creation of “toxic techno cultures” like GamerGate and The Fappening.

Alex Halavais has a paper in First Monday which looks at a number of different subreddits and thinks about how reddit users draw upon evidence.

My own work has been published on Quartz, the New Statesman, and the LSE Impact Blog, using computational methods based on Jason’s data to analyse the linguistic practices of the alt-right and show how the various communities are connected. I do some temporal analysis of the word “cuck”, and also look at the “dictionary” of the alt-right. I’ve also made arguments about why Reddit refuses to ban The_Donald, a notorious source of hate speech. And then there’s my Reddit fanboy article.

If there’s more, let me know and I’ll happily add it.


How you can use computational methods and pushshift.io to analyse Reddit

Anyway, here’s some of the things you can do with Reddit, illustrated through the medium of white supremacist communities.

1. Basic word cloud analysis

You can generate the most commonly used words from a subreddit or author. Here’s The_Donald for the last 90 days:

 Word cloud for The_Donald

Word cloud for The_Donald

2. Word clouds based on phrases

You can create word clouds based on which terms are most likely to turn up in the same comment as a word or phrase, like these for “jew” in The_Donald and then the subreddit ChadRight:

 Word cloud for "jew" in The_Donald

Word cloud for "jew" in The_Donald

 Word cloud for "jew" in r/chadright

Word cloud for "jew" in r/chadright

3. Subreddit word frequency analysis

You can look at which subreddits are using a word or phrase the most, like this graph of the word “cuck” over the last 90 days:

 Subreddits most commonly using "cuck" over most recent 90 days

Subreddits most commonly using "cuck" over most recent 90 days

4. Taking account of subreddit size

You can adjust the parameters of these functions, for example normalising the function so that it takes account of the proportion of comments containing the phrase, rather than the absolute number:

 Subreddit frequency analysis of "cuck", normalised for subreddit size

Subreddit frequency analysis of "cuck", normalised for subreddit size

5. Detecting speech patterns in subreddits (e.g. hate speech & slurs)

Using a combination of terms, you can for example compile a list of subreddits which use racial slurs the most. Here’s a normalised graph of the subreddits which use the following terms the most: bitch|cunt|nigger|niggers|fucker|libtard|libtards|cucks

 Normalised frequency of slurs by subreddit

Normalised frequency of slurs by subreddit

6. Analysis of authors' comments & activity

You can take an individual user and look at the subreddits in which they are most active. Let’s use reddit CEO Steve Huffman’s account as an example here:

 Steve Huffman (spez) subreddit activity

Steve Huffman (spez) subreddit activity

7. Which users are saying words the most?

You can see which users are using a particular phrase the most. For example, here’s the list of users whose comments most commonly include a racial slur over the last 90 days:

hate speech user frequency racial slur

 

8. Which places do subreddits link to?

You can find which sites are most commonly linked to by a given author or subreddit, seen with The_Donald here:

 The_Donald outward links by domain frequency

The_Donald outward links by domain frequency

9. Time period analysis of word frequency

You can look at how frequently a word has been used over time, as with “cuck” here:

 Timeline analysis of the frequency of "cuck"

Timeline analysis of the frequency of "cuck"

10. Time of Day analysis

We can look at when someone is posting the most, or when a subreddit is most active, and in doing so potentially deduce what time-zone they are likely to be in (or a plurality of their users are likely in). Here’s The_Donald vs r/politics:

politics time of day post frequency
 the_donald time of day post frequency

the_donald time of day post frequency

10. Word association based on Natural Language Processing

Finally (for now), there’s a function called “describe”, which uses Natural Language Processing to find the words and sentences in which a word or phrase has been situated. That might sound confusing, but it’s reasonably intuitive when you see the result. Let’s look at “jews” in The_Donald:

white | not white | descended from swine and apes | massacred in morocco | a treacherous | a bunch of paedophiles | hated by allah to the extent that they are destined for eternal doom as a result of their beliefs | liberals | done on purpose thousands of years ago | simply telling their own to save the nastiness for gentile kids | a nation of liars | their generals | bad | supposed to be back when christ returns at the rapture so he can go back to dealing with them | really touchy when it comes to group criticism | on their way to hell | oppressing us" vs "white people oppressed black people" | inconsequential | white or not | the new fad by the left now | liberal | nazis | the color white | often without a seat when the music stops | no problem

Let’s try “jews” in r/chadright:

in fact | eveil monsters | vastly over-represented in ivy league and are classified as white - that make it harder for non-jewish whites to get into ivy league | satanists and control all the evil in the west | overrepresented in position of power just as feminists are concerned that men are overrepresented in positions of power | just bigoted nonsense | not semite so you can't be anti-semitic to asheknazi | jewing each other and in the end it may be even worse to them than it is to us | cursed and covered with malediction | the best | arabs | bigoted | so low iq it is incredible | evil monsters who want to rule the worldwe never talked about it | fucking kids as your universal message | oppressed and need to stand up for themselves | not anti-semitic | certainly important but one would have to show they were motivated by their perception of jewish interests | a monolith out to destroy white people as there are nonreligious jews | very  successful in this country | god's chosen and we don't deserve the same privilege and wealth they enjoy
 


Computational methods for analysing reddit: how to get involved

So there you have it. Eleven different functions for analysing data from reddit, from 2006 to present. There’s an enormous amount of scope for combining these functions, for example altering time periods, subreddits, words, or authors. It would be really great to see more people using these methods to analyse reddit. If you’re interested in learning more, ping me or Jason. If you want to help support what is an extremely time and resource-intensive (but worthwhile) project, Jason has a Patreon: https://www.patreon.com/pushshift.

 

Episode 2 of PhDigital - Joe Ondrak on Creepypasta, Fake news, and Post-postmodernism

Episode 2 of PhDigital - Joe Ondrak on Creepypasta, Fake news, and Post-postmodernism

Going beyond "why is this true?": improving generic feedback on analysis

0