Imagine if we could run a text – any text, through a code, press a button, and out would come a full analysis of the complexities of the human emotions and nuances behind them. That was what drew our team to this project and led us to step into the alien waters of sentiment analysis.
A few weeks ago, we were challenged as a group to come up with a student-life related problem and use data analytics (DA) solutions to solve it. Instead of coming up with the problem first, my group thought of the field that we wanted to learn about, then found a problem within the field to work on.
We decided to explore the field of sentiment analysis, which is the process of computationally identifying and categorising a given text to determine the writer’s attitude towards certain issues. When applied to a collective set of texts, we can identify the general sentiments of a population towards a subject.
With sentiment analysis as the solution, it was easy to choose a project. We decided to hit close to home and find out: what is the general sentiment of NUS students at any given point in time?
To achieve this, we tapped on NUSwhispers – a Facebook group with close to 23,000 followers for students to share their thoughts and feelings anonymously. We determined that data from this social media page would give us the widest coverage of opinions on student life. Hence, we consolidated all the posts and conducted sentiment analysis on it to determine the general sentiments amongst NUS students who post on the page.
Tools & Resources
Below is a list of the tools and resources we used to scrape the data and complete sentiment analysis. Both python and R are viable platforms to complete the task.
Facebook Access Token
Facebook Scrapping Scripts for Python
NLP Guide: http://www.nltk.org/book/
Packages For Sentiment Analysis
Syuzhet Package: https://cran.r-project.org/web/packages/syuzhet/
Lexicon for Text Analysis: https://cran.r-project.org/web/packages/lexicon/index.html
Text Cleaning Tools
Syuzhet Package for Sentiment analysis codes: https://colinpriest.com/2017/04/30/tutorial-sentiment-analysis-of-airlines-using-the-syuzhet-package-and-twitter/, https://medium.com/swlh/exploring-sentiment-analysis-a6b53b026131
Web Scrapping & Sentiment Analysis with Python
Before the team can conduct sentiment analysis for the posts on the NUS Whisper page, we needed to first scrape every post that had been submitted since the page was created and paste them, along with other relevant information, onto a spreadsheet.
Through our research, we discovered that someone on the internet had already coded a Facebook scraper which we can use. Those who are interested in scraping Facebook group posts may find the instructions and the program here: https://nocodewebscraping.com/facebook-scraper/
You will need to run Python 2.7 on Windows Powershell to use this program.
After we have the dataset ready, we can proceed with analysing the sentiments of all the posts. Initially, we wanted to conduct sentiment analysis the ‘machine learning way’. However, the impracticality of such an approach soon dawned upon us – we would have to manually associate 2,557 posts with different emotions before the machine can ‘learn’ on its own. This was a step too heavy for us to take.
Had we taken that step, we would have completed the project using the codes shown in the image below. The abridged flow of the program can be summarized as follow:
- Keep only alphabets in every post and remove all symbols;
- Transform all uppercase letters to lowercase;
- Keep only non-stopwords (eliminate common words that don’t mean much such as am, do, etc.) and take only the root of those words, i.e. transforming words such as eaten, ate and eats to eat;
- Put those words inside a ‘Corpus’;
- Create a sparse matrix for the posts and the words inside the ‘Corpus’, i.e. if there are 3000 words found in all the posts combined, and the first post consists of only 3 of those words, row 1 will have 3 cells of ‘TRUE’ (those 3 words found in the post) and 2997 cells of ‘FALSE’;
- Use the Naïve Bayes Classification algorithm to train the dataset
In order to complete this project more efficiently, we also considered categorizing each post according to the reactions that it received using the following rule:
- If the post has less than 8 reactions, we cannot confidently identify the sentiment that the post is expressing;
- Otherwise, the most chosen reaction will determine the sentiment attached to the post, e.g. the post with 23 ‘Hahas’, 3 ‘Likes’, 1 ‘anger’ and no other reactions will be classified as a humorous post
This method is feasible, but we must be careful with dealing with posts that were created before Facebook came up with the reaction feature and posts with few reactions. The accuracy of such an analysis is also questionable. Another possible solution we came up with was to analyse the texts and determine the emotions in those words using a predefined library of associations. This would certainly be efficient but, like the previous solution, the analysis may also not be accurate. Therefore, the team is currently looking at other alternatives that could provide a better balance between reliability and efficiency.
Sentiment Analysis with R Studio
Another way which our team chose to conduct analysis on the dataset we scrapped from Facebook was through R programming. Our first task after scraping the data would be to clean it. The R package “tm” supports multiple text mining functions.
Utilizing the “tm” package, we were able to input these posts into a ‘Corpus’ and then clean posts the way we described previously.
One thing we noted while cleaning is that the order we cleaned the data mattered. For example, if we were to delete the punctuations before clearing the URLs in our posts, it would break up “http://www.nuswhispers.com” into several different parts “http”, “www”, “nuswhispers” and so on, which would in turn not be deleted by our function to clear URLs and might cause potential problems in our analysis.
After we have cleaned the data, we would then be able to perform sentiment analysis using the get_nrc_sentiment() function in the Syuzhet package.
To be able to visualize this in clearer fashion, we could convert these results into a dataframe and plot it using the qplot function in ggplot.
Our end result of analysis of NUS Whispers posts in February 2018 can be seen in either a graph depicting many different emotions, or just categorizing these emotions into positive and negative, as seen below.
However, we realized that any results extracted from sentiment analysis would have to be taken with a pinch of salt, given that we have yet to verify the reliability of the Syuzhet package in analysing sentiments. In particular, we believe that the “Singlish” used amongst students who post on NUSWhispers may be a problem as the package has no way of identifying these slang terms and categorizing them into different sentiments. Sentence structure is also another limitation in our analysis because the Syuzhet package only analyses keywords and hence cannot pick up on sarcastic comments which may have a tone contradictory to the sentiments the individual words convey.
Challenges & Final Prototype
As with all projects, we faced challenges in getting to our goal. But since we had a rough idea of what sentiment analysis was, we knew in advance about the major problem that we would face. Codes after all, are just a string of text and it is very challenging to write a code that could detect sarcasm, which would drastically alter the sentiment of a post. The simplest way around this would simply be to accept the accuracy of the sentiment analysis since the tested accuracy is around 60%, which is rather decent. Another method would be to find a way to analyse sentence structures to detect sarcasm. Although this might appear to be a very good solution, it is still generally difficult to code for sarcasm detection, and, even if there were one, it might not improve the accuracy much. The python library that could potentially be used for this would be NLTK.
What we didn’t expect was having to label around two and a half thousand posts just to be able to conduct machine learning. As web scrapping only gives us the posts and the number of likes/reactions, there is no guaranteed way of telling what sentiments each post contains. In order to perform machine learning, a test set with labelled sentiments would be required. One way to do that is to manually assign a sentiment to each post through sheer brute force, but that would be very inefficient even if the workload was evenly distributed. We would thus have to find an alternative feasible route.
Even though it has not been fully realized, we envisage our final prototype to be a dashboard illustrating the differences in sentiments between students in NUS and NTU. We hope our final prototype will someday be able to analyse the sentiments in each post as accurately as possible, but that would involve automated detection of sarcasm. While the task seems straightforward, behind it lies deep pools of knowledge and actualizing it will certainly be a challenging task. We hope that you have gained some insight on sentiment analysis and, who knows, perhaps one day you’ll be the one to advance our project and take it one step nearer to completion.
From left to right in the foreground: Kai Cong, Medric, Simon, Leon and Kelvin.