Report for Document Analysis course - Reddit Topics and Sentiment Analysis
View the Project on GitHub OksanaKalytenko/docana-project-Reddit-Topic-Telescope
Group members: Oksana Kalytenko, Mariia Pyvovar, Alisea Stroligo
This project analyses topics in the “ExplainLikeImFive” subreddit to understand community interests and sentiments through topic extraction and sentiment evaluation. It then combines findings from the previous two text-analysis techniques (Topic Modeling and Sentiment Analysis) to infer also non-textual properties of the dataset (e.g. general behaviors of users online).
With the provided dataset we were looking for inspirations on what to achieve with this kind of data. After some research we found commercial dashboard tools that analyse topics and sentiment in documents. We liked the idea of visualising topics together with sentiment to show hidden meaning inside texts written by different users. This served as our main inspiration and analysis was split into three different parts:
With these analyses it is our goal to get some summarised insights into a large number of documents, with the prospect to make conclusions for actions in the real world.
For topic modeling we decided to use BERTopic. As a starting point we used the original paper by Grootendorst M. (2022) [1] as well as a short write-up on the usage of the model by the original author on towardsdatascience.com [2]. With this model we are able to extract topics by leveraging clustering, while keeping the most important words inside each topic. This allows us to manually create best fitting names for each topic.
Inspired by the foundational work of Hutto and Gilbert (2014) on VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text [3], which introduced VADER as an effective tool for analyzing sentiment in social media content, we chose to conduct our analysis specifically using VADER. Additionally, our exploration was informed by the comparative analysis presented in the paper ‘Comparative Analysis of Lexicon and Machine Learning Approach for Sentiment Analysis’ [4]. This study discussed the strengths and weaknesses of two prominent models, VADER and Afinn.
Additionally, we took inspiration from the paper by Buntain et al. (2014) [5] and from the following project on Reddit Network Analysis https://github.com/samridhprasad/reddit-analysis [6], both of which are concerned with the distribution of posts by users between different subreddits and find Reddit-specific properties in the users’ activity. We were therefore motivated to try to find similar features in our dataset. In the absence of complete information on the user interactions and given the choice of reducing the dataset to a single subreddit, we challenged ourselves to find similar non-textual properties (i.e., users’ behavior on the specific online platform) by exploiting the results of computational linguistics tools (i.e. Topic Modeling and Sentiment Analysis on text).
The dataset used for analyses consists of a corpus containing preprocessed posts from the Reddit dataset (Webis-TLDR-17), available at https://huggingface.co/datasets/webis/tldr-17. From this dataset, after preprocessing, for all further analyses (Topic Modeling, Sentiment Analysis and Network Analysis) only a subset of the data was used, i.e., all content from subreddit ‘explainlikeimfive’. This particular subreddit was chosen due to content (we expected a variety of different topics mentioned in this particular subreddit, optimal for Topic Modeling), size (after preprocessing, this subreddit appeared among the 20 most visited subreddits, but not among the first ten, so that it provided a large amount of content but would also not be among the most computationally expensive for analyses) and finally it presented similar characteristics to other subreddits in terms of number of posts, number of authors and frequency of posts per author (i.e. important features for Network Analysis), therefore this subreddit also seemed to be a representative sample for the whole original dataset. The complete data from the subreddit ‘explainlikeimfive’ can be found in a .zip file in the code folder of our GitHub repository.
All the notebooks are in the /code folder with their requirements.txt files. After installing the requirements the notebooks can be run in any compatible environment.
To install Python libraries for Data Preprocessing with the correct version run the following command in the terminal:
pip install -r requirements_preprocessing.txt
To install Python libraries for Topic Modeling using BERTopic with the correct version run the following command in the terminal:
pip install -r requirements_topic.txt
Operating System: Darwin Darwin Kernel Version 23.5.0 To install Python libraries with the correct version run the following command in the terminal:
pip install -r requirements_sentiment.txt
To install Python libraries for Network Analysis with the correct version run the following command in terminal:
pip install -r requirements_networkanalysis.txt
First, embeddings for the content were pre-calculated using the SentenceTransformer model “all-MiniLM-L6-v2”. This step ensures efficient handling of text data in subsequent processes.
To ensure reproducibility in topic modeling results, the UMAP dimensionality reduction algorithm was configured with a fixed random state. This setup mitigates the stochastic nature of UMAP, providing consistent results across multiple runs.
The number of topics was controlled using HDBSCAN’s min_cluster_size
parameter. Adjusting this parameter influences the number of clusters formed: a higher value results in fewer topics, while a lower value generates more topics.
A CountVectorizer
was employed to transform the text data into a numerical format, removing English stop words and considering both unigrams and bigrams.
Various models inspired by KeyBERT, SpaCy’s Part-of-Speech tagging, and Maximal Marginal Relevance (MMR) were used to diversify and enhance the topic words. These models were combined into a representation model for BERTopic.
The BERTopic model was configured with the pre-calculated embedding model, UMAP model, HDBSCAN model, vectorizer model, and the combined representation models. The model was set to identify the top 10 words per topic and was run in verbose mode to provide detailed outputs.
The BERTopic model was fitted and transformed on the content data, using the pre-calculated embeddings to generate topics and their probabilities.
To ensure that all documents were assigned to a topic, outliers generated by HDBSCAN were mapped to existing topics. This step helps in creating accurate topic representations and reduces unassigned data points.
In our sentiment analysis experiments, we explored two different models: VADER and the lexicon-based text analyzer Afinn. Both VADER and Afinn operate using pre-defined sentiment lexicons rather than requiring training data. However, they excel in different aspects of sentiment analysis.
However, without access to human-labeled data for direct accuracy comparisons, it is challenging to definitively assess which model performed more accurately.
VADER is known for handling informal language, emojis, nuanced sentiments, and sarcasm interpretation. This capability makes VADER particularly well-suited for analyzing social media content and informal text data. Despite not having access to human-labeled data for direct accuracy comparisons during our experiments, VADER consistently showed superior performance in these areas.
The data used in the first step of the Network Analysis part of the project consisted initially of the complete provided Reddit dataset. This data then underwent the same general filtering and preprocessing criteria enlisted at the beginning of this section, and a sample of 10000 posts was then used for further analysis, this data can be found in our .zip data folder as a .csv file ‘webis_tldr_clean’.
Firstly, we explored the possibility of using all subreddit data for the network analysis, this implied understanding the number of authors (nodes of our graph) and number of posts (node sizes) per subreddit (each same-colored edge connecting all authors active in a specific subreddit).
We subsequently plotted a Network Graph to visualize the properties of the dataset and confirm the suitability of the ‘explainlikeimfive’ subreddit selection also on the dimensions important for further network analysis.
For this part of analysis we relied on a sub-sample (100 most visited subreddits) of the dataset for visualization.
These analyses can be found in the following file: AllData_Network_Analysis.ipynb
This part of the analysis was conducted on the resulting data of the previous two analysis steps (Topic+Sentiment Analysiss) on the subreddit data stored in in our .zip data folder as a .csv file ‘explainlikeimfive’. The combined results are stored instead in the file ‘webis_tldr_subreddit(topic+sentiment).csv’. This data underwent the same general filtering and preprocessing criteria enlisted at the beginning of this section. We plotted the Network Graph, similarly to the previous plot, in which nodes represent different users and node size represents the number of posts per user, however, in this case edges represented different topics (not subreddits, as above) and nodes were colored on the basis of the average sentiment of each user’s overall posts. Also, for this part of analysis we relied on a sample (1000 posts) of the dataset for visualization. These analyses can be found in the following file: Subreddit_Network_Analysis.ipynb
Our result as a dashboard can be found here.
A total of 30 different topics were identified in the dataset. “Economic and Financial Systems” is the largest topic in our subreddit. The next popular topics are “Genetics and Human Evolution”, “Physics and Cosmology”, “Gaming and Software Development”, “Nutrition and Dietary Health”.
Some topics may overlap like “Mathematical Concepts and Number Theory” and “Gaming and Software Development”. “Genetics and Human Evolution” topic contains a lot of posts on sexuality and women rights? Maybe this topic can be split further into 2 topics. It looks like more manual inspection is needed.
Next steps:
Our analysis reveals that topics such as Cultural Diversity and Ethnic Background, Firearms Public Safety, Human Anatomy and Health, Immunology and Diseases, Law Enforcement, Affairs and Religious Extremism, Welfare, and Diplomatic Relations tend to have more negative sentiment on average.
Our main findings in this conclusive part of our project were that most users only participated in the subreddit with one post and (therefore) were also interested in only one topic on average. There do seem to exist however few users which are very active and participanting in conversations on multiple topics, these might be the “answer-persons” [5], a well-known role present in many other online platforms. Overall there seemed to be little interaction between users and a focus by each user on a specific select topic, potentially indicating that Reddit is not mainly used to connect with people but to retrieve information and opinions without extensive and repeated discussion among users. Therefore the distribution of user activity seems to be in line with our chosen reference paper [5] and project [6], which show similar results. A few questions emerged from the network analysis:
These findings and related questions provide insights into potential further analyses.
This analysis shows how natural language processing helps uncover what interests the community. “Economic and Financial Systems” stood out as the most discussed topic, indicating a strong interest in simplified explanations of complex financial concepts. Other popular topics included “Genetics and Human Evolution”, “Physics and Cosmology”, “Gaming and Software Development”, and “Nutrition and Dietary Health”, showing the community’s diverse interests in science, technology, and health. These insights might be helpful for governments and other creators of educational content to focus their efforts towards the topics outlined in this project and create more accessible content in these rather complex areas.
Topics like Cultural Diversity, Firearms Public Safety, and Law Enforcement tend to have more negative feelings due to the controversial nature of these discussions, which often touch on sensitive and polarizing issues. These discussions often cause strong negative emotions and frustrations in the community, showing bigger societal issues like discrimination, racism, politics, and religion. Understanding these feelings helps us track public opinion and predict how the community might react.
In this conclusive part of our analyses we were able to combine Topic Modeling and Sentiment Analysis through Network Analysis in order to find non-textual properties of our dataset starting from the users’ posts only. Our results, discussed in the previous section, also seem to be in line with the findings of our reference paper [5] and project [6].
Overall, we were able to test in our chosen subreddit the effectiveness of computational linguistic tools in providing insights on a dataset on a variety of dimensions, including non strictly-textual properties.
Team Member | Contributions | |
---|---|---|
Oksana Kalytenko | Data Preprocessing, Topic Modeling, Dashboard Creation&Design | |
Mariia Pyvovar | Data Preprocessing, Sentiment Analysis | |
Alisea Stroligo | Data Preprocessing, Network Analysis |
[1] Grootendorst, M., 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794.
[2] Grootendorst, M., 2021. Interactive Topic Modeling with BERTopic. https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8
[3] Hutto, C., & Gilbert, E. (2014). VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. Proceedings of the International AAAI Conference on Web and Social Media, 8(1), 216-225.
[4] Roopam Srivastava, Prof. (Dr.) P.K. Bharti, Dr. Parul Verma, (2022). Comparative Analysis of Lexicon and Machine Learning Approach for Sentiment Analysis. (IJACSA) International Journal of Advanced Computer Science and Applications
[5] Buntain, C.; Golbeck, J. Identifying social roles in reddit using network structure. In Proceedings of the 23rd International Conference on World Wide Web, Seoul, Republic of Korea, 7–11 April 2014; pp. 615–620.
[6] Samridh Prasad, reddit-analysis, (2019), GitHub repository, https://github.com/samridhprasad/reddit-analysis