How a Uniformed Idea Became a Working Sentiment Analysis Model

Connor Sparks
The Startup
Published in
5 min readNov 25, 2019

--

Photo by Luca Bravo on Unsplash

While I do hate Java, I do have to thank it for getting me back into Python and more importantly, machine learning.

During my Java class, I was introduced to the idea of a sentiment value generator; a way to find out whether or not a word has a positive or negative connotation. In this class, we were not, in fact, using Java to create one of these machines but rather using a list generated by researchers at Stanford in order to rate different reviews.

The idea was that we would loop through all of the words in a review, find the sentiment value by referencing a CSV, add all of these together, then use a simple list of if statements to guess the start rating of said review. This part was rather simple, but it was the concepts of this exercise that lead me to not only explore the relationship between sentiment value and start rating but to create my own sentiment value generator.

No Relation

My original intention was not to create a sentiment value generator but rather to explore the relationship between the total sentiment value of a review and its star rating. In class, we were simply asked to use the two test cases that we were given in order to create a list of if statements that could be used in order to find the star rating of a review. To me, this didn’t make much sense, as I felt that I didn’t have enough data to accurately figure this out, but I knew that I could use data online partnered with some basic machine learning concepts that I picked up to figure this out.

I started out by finding a list of product reviews on Kaggle that I could use in order to train my algorithm. My thought was that by taking a list of reviews with known star ratings and finding the sentiment value of each I could find the relationship between the two data points. Using a Colab Notebook, I was able to store and access both the Kaggle data and sentiment value data in order to train the machine.

View it Here

The first part of my function simply took the CSV and converted it into a Python Dictionary that I could access in order to find the sentiment value of a word. I then created a for loop that went through each review and then ran each word in said review through the dictionary in order to generate a total sentiment value for the review. This data was then passed into a Keras ANN that would find the correlation between the data however all I got was this.

Accuracy of 63%

Which made no sense. Until I graphed this.

No relation

It turned out, there was no relation between the data. Something that I quickly understood was because people simply didn’t have the same way of thinking about star ratings. To some, a five-star review means the product is simply okay, and to others, a three-star review means the product is a masterpiece.

Making it Work

At this point, I wanted to embark on the challenge of creating my own sentiment value generator, though, this may be somewhat misleading. In reality, I was able to train an ANN to figure out whether a review is positive or negative using over 50k IMDB reviews as training data.

My idea for this was simple. I would use a function to find every unique word within the 50k reviews and create a list containing them. Then, for each review in the dataset, I would create a vector that contained a 1 or 0 at the appropriate index of a word depending on whether it was in the review or not. All once again within the confines of another Colab notebook.

View it Here

This idea didn’t work right off the bat as there were a couple of major issues. The first, was simply that the revised dataset was too large. Trying to train the generator using all unique words ended up crashing the RAM specific VM as there were over 181000 entries, leading to an array with over 9 billion entries. The other was simply that it was slow, as in order to create the dataset, every single word was looped through.

Some simple revisions made this project actually work. First off, I used a fancy Python function in order to remove all punctuation then create a list of the top 5000 words that were used; this function ran in about 10 seconds. To fix the for loop issue, I used a Numpy function to simply compare an array of the words in a review, to the array of the most commonly used words.

These changes actually worked, getting it up to an accuracy of around 87% while also working on random reviews from the internet.

Accuracy

Needless to say, it was the coolest thing I did all week.

--

--

Connor Sparks
The Startup

👋 My musings here are basically a public stream-of-consciousness journal, so please forgive my writing: it’s getting better, I promise.