Project Overview¶
SSHRC-funded computational social science project on the characterization of radicalization in online forums. Worked alongside Professor Finlay Maguire (Computer Science / Community Health & Epidemiology) and Professor Michael Halpin (Sociology & Social Anthropology).
One of the main goals of my work on this project is on using computational methods to accelerate and expand qualitative sociological analyses. Using NLP and related techniques we can augment and partly automate the qualitative work of sociologists.
The communities we analyze have been involved in multiple acts of violence. They have been previously banned from other platforms, namely Reddit, and have since migrated to a isolated online forum. We seek to characterize the radical ideas they discuss and to better understand how those ideas spread between users of the forum.
Web Scraping¶
To acquire the data from these forums we use web scraping. I wrote a scraping script which automates the process of collecting all the posts on the forum along with associated thread and user level metadata. The scripts run every morning to collect new posts.
The forums contain several million posts across hundreds of thousands of threads written by tens of thousands of users. All the text and metadata from these forums is stored in MongoDB, a noSQL database well suited for the large amounts of text data we work with.
Web App¶
I built a web app for exploring the data collected from scraping and stored in MongoDB. Since we work with sociologists who aren't necessarily knowledgeable in noSQL queries we wanted a web app which provides a no-code platform for exploring the data.
This web app was built using Plotly Dash. I used Dash since it is a versatile framework, built on top of Flask, with lots of flexibility and ability to work with large amounts of data. The web app connects to MongoDB and queries the data from there.
The queries are very flexible with multiple field input options to query based on dates, full-text, regex, various metadata, and more. The results can be outputted as either JSON or CSV, and each output contains interactive features to further explore the data. The results can also be downloaded for further analysis elsewhere.
Multiple figures are automatically generated with each query to show the distributions in posts, threads, and users over time along with other useful summary statistics.
The web app is only used internally and therefore is not accessible to the wider public.
Topic Modelling¶
To characterize the radical ideas being discussed on the forums we use topic models. These models have been around for decades, and over time multiple kinds of topic models have been created. We have explored and tested several on our data.
The two main types of models we have explored are LDA and BERTopic.
We first explored an older, well-known and commonly used topic model called Latent Dirichlet Allocation (LDA). This is a probabilistic model which assumes that every document is a mixture of topics and that every topic is a mixture of words. The training process identified which words are most likely to be part of the same topic and which topics are likely to be part of documents.
The base LDA model that is commonly used does have limitations. Since it is probabilistic in nature and since it's training process starts from random inputs the results can vary quite a bit between iterations. This led us to explore a variant of this model called Ensemble LDA.
The basic idea of Ensemble LDA is to train multiple LDA models on the same data and only keep the topics which overlap across multiple individual models. This helps deal with some of the noise and variations between models. It returns more stable and consistent topics.
Though the results from this model were good we wanted to explore more modern, context-aware approaches. This led us to BERTopic, a model which clusters sentence embeddings to identify topics in a corpus.
The benefit of BERTopic and the embedding approach is that sentence embeddings are context-aware as opposed to LDA which doesn't take into account the order or words and the context they are used in.
But, BERTopic requires significantly more computational resources to run. This led me to setting up the script on one of the compute clusters from the Digital Research Alliance of Canada. These compute clusters have a lot of resources along with some of the most powerful GPUs on the market. This allows us to use GPU acceleration to speed up the training of this model.
This is an active area of our research. We continue to work on training this model to get the best results we can.
Social Network Analysis¶
Beyond characterizing the radical ideas being discussed on these forums, we also seek to understand how they are shared and spread between users. To analyze this we turn to social network analysis and information diffusion.
The users of these forums form a social network through their interactions on the threads of the forum. Every time a user posts on a thread and another user posts afterwards a connection between them is formed. Both users are likely to have read the other's post and therefore have been exposed to the ideas within that post.
Through constructing a social network of the users we can understand the communities they form within this larger forum, and identify which users are more central to the network and may have more influence on others.
With a social network of the users we can also begin to analyze the diffusion of their ideas. Using information diffusion models we can model the spread of radical content. These models are often based on epidemiological models of infection spread. In the infection models each person is assumed to be either infected, susceptible, or (sometimes) recovered. Each person has a probability of transitioning between these states. In information diffusion models we apply this process to a social network graph to get a deeper understanding of who is spreading what information and who has the greatest influence on others in the forum.
This is an active area of research on this project. We continue to work on these models.