Digital Methods: when data could be dangerous
For the last week, I've been attending the Digital Methods Initiative Summer School at the University of Amsterdam. It's a conceptually interesting project that attempts to use the digital to study the digital. The theory behind it is that there is a lot of work which uses traditional methods to study the internet or the digital (e.g. interviews, ethnographies, surveys, and so on); likewise, there's quite a bit that uses the digital to study the "real world" (digital archives, interviews conducted over Skype or email, OCR programmes etc). However, there's comparatively little academic research that uses the digital to study the digital, and that's the gap in the market that DMI attempts to fill.
This looks like using digital tools and Snowden's data leaks to study mass surveillance, using content analysis to study climate change, or studying Wikipedia as a site of cultural heritage. In the case of the group I've been working with, we've been collaborating with the British Home Office and using various tools to study the Alt-Right. Others in the group have used tools that scrape data from YouTube, Twitter and Facebook and allow them to map the networks that result from this.
As my own PhD primarily uses Reddit as its site of analysis, and I wanted to get some methodological skills out of this summer school, I decided to join the project and help to map the origin, spread and use of the language of the Alt Right across Reddit. To that end, I've been using Google's Big Query API, along with good old Microsoft Excel, to look at the words "cuck" and "kek", and use that as a window to study the social dynamics of Alt Right communities.
I've already written quite a lot on my empirical findings, which I'll publish in due course, but here I just wanted to take some time to reflect on the things I've learned about methodology so far. For my own research, I primarily use qualitative methods: I look at all of the new posts published on the Paleo subreddit, and then I categorise them by topics of conversation and use that categorisation and my own readings of the posts and comments to try to draw inferences about the nature of discourse in that community, focussing on how authority is negotiated and contested. In September, I plan to start seeking interviews with members of that community and the wider paleo and nutrition communities in order to better understand what it is that drives people towards believing and practising one set of nutritional precepts over another. These are all reasonably tried-and-tested qualitative methods that provide rich understandings of interactions between people and the dynamics involved in the communities one studies.
But there's always been this nagging sensation that I could be doing something different and better. The people who do quantitative methods produce these firm conclusions that they can make inferences from, right? And they always have these cool, pretty graphics that let them display their research in a way that just invites people to click and read. They get to write things that make headlines. Their research has impacts. They get to make GIFs showing the most used words in /r/The_Donald in every month of its existence:
So I wanted to dip my toe in this other academic pool and see if it could resolve some of the anxieties about my research that I've been feeling. Perhaps what I'm doing right now is just intrinsically less worthwhile than a different methodology that would allow me to say more things, process more data, be more interesting.
If you're reading this in the tone that I'm trying to give off, you've likely already concluded that the above is not an accurate reflection of what this research has been like. In reality, just because I'm typing something in a fancy programming interface and using SQL syntax that I'd never heard of before Tuesday, doesn't mean that what I've been doing has produced any more significant results or been any better than any of the traditional methods I'd been using up until last week.
When looking at words, and trying to understand where they've come from and how they've spread, being able to query a big database and get a picture of when they have and haven't appeared is useful. But it doesn't tell you all that much: you don't know anything about the context in which the word was used. It doesn't give you the complete picture, in the same way that getting the metadata about someone's phone calls gives you an incomplete picture of who they are and what they've been doing. That picture can be filled in with more knowledge of that person, in the same way that my picture could be filled in with background knowledge about the events that have driven the rise and spread of the Alt Right. But without that knowledge, and without the ability to theorise, it's pretty useless. Worse than that, it's dangerous, because it gives you the false impression that you're finding interesting things when you might not be.
The dangers of data, even with theory
To pin this down with an example: the other day, I was trying to understand who gets labelled a "cuck" and who gets labelled "based" in The_Donald, the biggest Alt-Right community. A cuck is someone who is emasculated and inferior and so on; someone who is "based" has accepted the dogma of the Alt Right and is generally considered a good egg. You can see my findings below:
You can see that the words that tend to come up with "cuck" are those you might expect: globalists, Macron (this data was from May 2017, when they were quite upset at France's failure to elect Le Pen), liberals, "betas", and so on. Likewise, Sean Hannity and Poland are based, as are patriots.
But then I fell afoul of not having the complete picture, because I started to look at the other associations with "based" and saw words associated with rationalism: logic, science, data and so on. That made me assume they were talking about "based science", "based logic", etc, and that they just thought that these things were really cool and were fetishising a certain conception of knowledge.
Now, they do fetishise science and data and a certain conception of logic, but they don't call those things "based". They were obviously saying things like "making decisions based on the science", but because I had no context to how the word was being used, my own understanding of what the word was being used for in this context was blinding me to its much more commonplace usage.
This is the danger of powerful tools: they give you false confidence in your results. I could explain the data in front of me with the theory I had, and I could make up an incredibly convincing argument about it. But because I didn't have the complete picture, I was actually just making things up without even knowing I was making them up.
We're often warned about the dangers of data without theory: "p-hacking" and other dodgy statistical methods have become commonplace terms in the scientific and academic communities recently. We rarely talk about the dangers of data with theory when that data is incomplete.
I've really enjoyed getting to grips with digital tools. I like the way that they can guide me towards new places to explore, and point me in directions I didn't previously realise were relevant. They certainly produce some really cool pictures. But when it comes to research significance and impact, I no longer feel inferior because I don't use numbers and software as much as some people. What I do might not be perfect, but I think I'll stick with it.