Reddit Topic Telescope

Logo

Report for Document Analysis course - Reddit Topics and Sentiment Analysis

Dashboard Llama Plots

View the Project on GitHub OksanaKalytenko/docana-project-Reddit-Topic-Telescope

Reddit Topic Telescope

Group members: Oksana Kalytenko, Mariia Pyvovar, Alisea Stroligo

Abstract

This project analyses topics in the “ExplainLikeImFive” subreddit to understand community interests and sentiments through topic extraction and sentiment evaluation. It then combines findings from the previous two text-analysis techniques (Topic Modeling and Sentiment Analysis) to infer also non-textual properties of the dataset (e.g. general behaviors of users online).

Introduction

With the provided dataset we were looking for inspirations on what to achieve with this kind of data. After some research we found commercial dashboard tools that analyse topics and sentiment in documents. We liked the idea of visualising topics together with sentiment to show hidden meaning inside texts written by different users. This served as our main inspiration and analysis was split into three different parts:

With these analyses it is our goal to get some summarised insights into a large number of documents, with the prospect to make conclusions for actions in the real world.

For topic modeling we decided to use BERTopic. As a starting point we used the original paper by Grootendorst M. (2022) [1] as well as a short write-up on the usage of the model by the original author on towardsdatascience.com [2]. With this model we are able to extract topics by leveraging clustering, while keeping the most important words inside each topic. This allows us to manually create best fitting names for each topic.

Inspired by the foundational work of Hutto and Gilbert (2014) on VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text [3], which introduced VADER as an effective tool for analyzing sentiment in social media content, we chose to conduct our analysis specifically using VADER. Additionally, our exploration was informed by the comparative analysis presented in the paper ‘Comparative Analysis of Lexicon and Machine Learning Approach for Sentiment Analysis’ [4]. This study discussed the strengths and weaknesses of two prominent models, VADER and Afinn.

Additionally, we took inspiration from the paper by Buntain et al. (2014) [5] and from the following project on Reddit Network Analysis https://github.com/samridhprasad/reddit-analysis [6], both of which are concerned with the distribution of posts by users between different subreddits and find Reddit-specific properties in the users’ activity. We were therefore motivated to try to find similar features in our dataset. In the absence of complete information on the user interactions and given the choice of reducing the dataset to a single subreddit, we challenged ourselves to find similar non-textual properties (i.e., users’ behavior on the specific online platform) by exploiting the results of computational linguistics tools (i.e. Topic Modeling and Sentiment Analysis on text).

Dataset

The dataset used for analyses consists of a corpus containing preprocessed posts from the Reddit dataset (Webis-TLDR-17), available at https://huggingface.co/datasets/webis/tldr-17. From this dataset, after preprocessing, for all further analyses (Topic Modeling, Sentiment Analysis and Network Analysis) only a subset of the data was used, i.e., all content from subreddit ‘explainlikeimfive’. This particular subreddit was chosen due to content (we expected a variety of different topics mentioned in this particular subreddit, optimal for Topic Modeling), size (after preprocessing, this subreddit appeared among the 20 most visited subreddits, but not among the first ten, so that it provided a large amount of content but would also not be among the most computationally expensive for analyses) and finally it presented similar characteristics to other subreddits in terms of number of posts, number of authors and frequency of posts per author (i.e. important features for Network Analysis), therefore this subreddit also seemed to be a representative sample for the whole original dataset. The complete data from the subreddit ‘explainlikeimfive’ can be found in a .zip file in the code folder of our GitHub repository.

Methods

Setup

All the notebooks are in the /code folder with their requirements.txt files. After installing the requirements the notebooks can be run in any compatible environment.

Data Preprocessing Setup

To install Python libraries for Data Preprocessing with the correct version run the following command in the terminal:

pip install -r requirements_preprocessing.txt

Topic Modeling Setup

To install Python libraries for Topic Modeling using BERTopic with the correct version run the following command in the terminal:

pip install -r requirements_topic.txt

Sentiment Analysis Setup

Operating System: Darwin Darwin Kernel Version 23.5.0 To install Python libraries with the correct version run the following command in the terminal:

pip install -r requirements_sentiment.txt  

Network Analysis Setup

To install Python libraries for Network Analysis with the correct version run the following command in terminal:

pip install -r requirements_networkanalysis.txt  

Experiments

Initial Preprocessing Steps

Summary of Topic Modeling Workflow

Summary of Sentiment Analysis Workflow

In our sentiment analysis experiments, we explored two different models: VADER and the lexicon-based text analyzer Afinn. Both VADER and Afinn operate using pre-defined sentiment lexicons rather than requiring training data. However, they excel in different aspects of sentiment analysis.

However, without access to human-labeled data for direct accuracy comparisons, it is challenging to definitively assess which model performed more accurately.

VADER is known for handling informal language, emojis, nuanced sentiments, and sarcasm interpretation. This capability makes VADER particularly well-suited for analyzing social media content and informal text data. Despite not having access to human-labeled data for direct accuracy comparisons during our experiments, VADER consistently showed superior performance in these areas.

Summary of Network Analysis Workflow

Results and Discussion

Our result as a dashboard can be found here.

Topic Modeling

A total of 30 different topics were identified in the dataset. “Economic and Financial Systems” is the largest topic in our subreddit. The next popular topics are “Genetics and Human Evolution”, “Physics and Cosmology”, “Gaming and Software Development”, “Nutrition and Dietary Health”.

Some topics may overlap like “Mathematical Concepts and Number Theory” and “Gaming and Software Development”. “Genetics and Human Evolution” topic contains a lot of posts on sexuality and women rights? Maybe this topic can be split further into 2 topics. It looks like more manual inspection is needed.

Next steps:

Sentiment Analysis

Our analysis reveals that topics such as Cultural Diversity and Ethnic Background, Firearms Public Safety, Human Anatomy and Health, Immunology and Diseases, Law Enforcement, Affairs and Religious Extremism, Welfare, and Diplomatic Relations tend to have more negative sentiment on average.

Network Analysis

Our main findings in this conclusive part of our project were that most users only participated in the subreddit with one post and (therefore) were also interested in only one topic on average. There do seem to exist however few users which are very active and participanting in conversations on multiple topics, these might be the “answer-persons” [5], a well-known role present in many other online platforms. Overall there seemed to be little interaction between users and a focus by each user on a specific select topic, potentially indicating that Reddit is not mainly used to connect with people but to retrieve information and opinions without extensive and repeated discussion among users. Therefore the distribution of user activity seems to be in line with our chosen reference paper [5] and project [6], which show similar results. A few questions emerged from the network analysis:

These findings and related questions provide insights into potential further analyses.

Conclusion

Topic Modeling

This analysis shows how natural language processing helps uncover what interests the community. “Economic and Financial Systems” stood out as the most discussed topic, indicating a strong interest in simplified explanations of complex financial concepts. Other popular topics included “Genetics and Human Evolution”, “Physics and Cosmology”, “Gaming and Software Development”, and “Nutrition and Dietary Health”, showing the community’s diverse interests in science, technology, and health. These insights might be helpful for governments and other creators of educational content to focus their efforts towards the topics outlined in this project and create more accessible content in these rather complex areas.

Sentiment analysis

Topics like Cultural Diversity, Firearms Public Safety, and Law Enforcement tend to have more negative feelings due to the controversial nature of these discussions, which often touch on sensitive and polarizing issues. These discussions often cause strong negative emotions and frustrations in the community, showing bigger societal issues like discrimination, racism, politics, and religion. Understanding these feelings helps us track public opinion and predict how the community might react.

Network Analysis

In this conclusive part of our analyses we were able to combine Topic Modeling and Sentiment Analysis through Network Analysis in order to find non-textual properties of our dataset starting from the users’ posts only. Our results, discussed in the previous section, also seem to be in line with the findings of our reference paper [5] and project [6].

Overall, we were able to test in our chosen subreddit the effectiveness of computational linguistic tools in providing insights on a dataset on a variety of dimensions, including non strictly-textual properties.

Contributions

Team Member Contributions  
Oksana Kalytenko Data Preprocessing, Topic Modeling, Dashboard Creation&Design  
Mariia Pyvovar Data Preprocessing, Sentiment Analysis  
Alisea Stroligo Data Preprocessing, Network Analysis  

References