My friend Jason Baumgartner (stuck_in_the_matrix on Reddit) performs an extremely valuable service in pulling every single comment and post from reddit and storing them on a site that anyone can access (https://files.pushshift.io/reddit/). In contrast with Facebook (who just blocked Netvizz, the main way academics scrape data) and Instagram (who are similarly restrictive about the data you can use), that makes Reddit an extremely open platform for research and analysis.
Given that reddit is one of the most popular sites in the world (particularly in the US), it’s a shame that so little research has been done on its social dynamics. Quick shout-out for existing work:
Adrienne Massanari has written a book, as well as multiple papers on Reddit, with a particular focus on the way it facilitates the creation of “toxic techno cultures” like GamerGate and The Fappening.
Alex Halavais has a paper in First Monday which looks at a number of different subreddits and thinks about how reddit users draw upon evidence.
My own work has been published on Quartz, the New Statesman, and the LSE Impact Blog, using computational methods based on Jason’s data to analyse the linguistic practices of the alt-right and show how the various communities are connected. I do some temporal analysis of the word “cuck”, and also look at the “dictionary” of the alt-right. I’ve also made arguments about why Reddit refuses to ban The_Donald, a notorious source of hate speech. And then there’s my Reddit fanboy article.
If there’s more, let me know and I’ll happily add it.
How you can use computational methods and pushshift.io to analyse Reddit
Anyway, here’s some of the things you can do with Reddit, illustrated through the medium of white supremacist communities.
1. Basic word cloud analysis
You can generate the most commonly used words from a subreddit or author. Here’s The_Donald for the last 90 days:
2. Word clouds based on phrases
You can create word clouds based on which terms are most likely to turn up in the same comment as a word or phrase, like these for “jew” in The_Donald and then the subreddit ChadRight:
3. Subreddit word frequency analysis
You can look at which subreddits are using a word or phrase the most, like this graph of the word “cuck” over the last 90 days:
4. Taking account of subreddit size
You can adjust the parameters of these functions, for example normalising the function so that it takes account of the proportion of comments containing the phrase, rather than the absolute number:
5. Detecting speech patterns in subreddits (e.g. hate speech & slurs)
Using a combination of terms, you can for example compile a list of subreddits which use racial slurs the most. Here’s a normalised graph of the subreddits which use the following terms the most: bitch|cunt|nigger|niggers|fucker|libtard|libtards|cucks
6. Analysis of authors' comments & activity
You can take an individual user and look at the subreddits in which they are most active. Let’s use reddit CEO Steve Huffman’s account as an example here:
7. Which users are saying words the most?
You can see which users are using a particular phrase the most. For example, here’s the list of users whose comments most commonly include a racial slur over the last 90 days:
8. Which places do subreddits link to?
You can find which sites are most commonly linked to by a given author or subreddit, seen with The_Donald here:
9. Time period analysis of word frequency
You can look at how frequently a word has been used over time, as with “cuck” here:
10. Time of Day analysis
We can look at when someone is posting the most, or when a subreddit is most active, and in doing so potentially deduce what time-zone they are likely to be in (or a plurality of their users are likely in). Here’s The_Donald vs r/politics:
10. Word association based on Natural Language Processing
Finally (for now), there’s a function called “describe”, which uses Natural Language Processing to find the words and sentences in which a word or phrase has been situated. That might sound confusing, but it’s reasonably intuitive when you see the result. Let’s look at “jews” in The_Donald:
white | not white | descended from swine and apes | massacred in morocco | a treacherous | a bunch of paedophiles | hated by allah to the extent that they are destined for eternal doom as a result of their beliefs | liberals | done on purpose thousands of years ago | simply telling their own to save the nastiness for gentile kids | a nation of liars | their generals | bad | supposed to be back when christ returns at the rapture so he can go back to dealing with them | really touchy when it comes to group criticism | on their way to hell | oppressing us" vs "white people oppressed black people" | inconsequential | white or not | the new fad by the left now | liberal | nazis | the color white | often without a seat when the music stops | no problem
Let’s try “jews” in r/chadright:
in fact | eveil monsters | vastly over-represented in ivy league and are classified as white - that make it harder for non-jewish whites to get into ivy league | satanists and control all the evil in the west | overrepresented in position of power just as feminists are concerned that men are overrepresented in positions of power | just bigoted nonsense | not semite so you can't be anti-semitic to asheknazi | jewing each other and in the end it may be even worse to them than it is to us | cursed and covered with malediction | the best | arabs | bigoted | so low iq it is incredible | evil monsters who want to rule the worldwe never talked about it | fucking kids as your universal message | oppressed and need to stand up for themselves | not anti-semitic | certainly important but one would have to show they were motivated by their perception of jewish interests | a monolith out to destroy white people as there are nonreligious jews | very successful in this country | god's chosen and we don't deserve the same privilege and wealth they enjoy
Computational methods for analysing reddit: how to get involved
So there you have it. Eleven different functions for analysing data from reddit, from 2006 to present. There’s an enormous amount of scope for combining these functions, for example altering time periods, subreddits, words, or authors. It would be really great to see more people using these methods to analyse reddit. If you’re interested in learning more, ping me or Jason. If you want to help support what is an extremely time and resource-intensive (but worthwhile) project, Jason has a Patreon: https://www.patreon.com/pushshift.