To acquire the data from these forums we use web scraping. I wrote a scraping script which automates the process of collecting all the posts on the forum along with associated thread and user level metadata. The scripts run every morning to collect new posts.

 

The forums contain several million posts across hundreds of thousands of threads written by tens of thousands of users. All the text and metadata from these forums is stored in MongoDB, a noSQL database well suited for the large amounts of text data we work with.