Sentiment Analysis in PyTorch using CNN

Harika Bonthu
7 min readJul 2, 2020

--

Why and What is Sentiment Analysis??

According to experts, 80% of the world’s existing data is in the form of text — unstructured data.

Sentiment Analysis (also known as opinion mining or emotion AI) is a sub-field of NLP that measures the inclination of people’s opinions (Positive/Negative) within the unstructured text.

But where are we getting this unstructured data from?

Social media tweets/posts, call transcripts, survey or interview reviews, text across blogs, forums, news, etc.

The two measures that are used to analyze the sentiment are:

  • Subjectivity: talks about how subjective the opinion is.
  • Polarity: talks about how positive/negative the opinion is.

What is Natural Language Processing(NLP)?

Natural Language is the way we, humans, communicate with each other. It could be Speech or Text. NLP is the automatic manipulation of the natural language by software.

Basic steps to be performed as part of NLP are:

  • Word Tokenization
  • Predicting POS(parts of speech) for each token
  • Obtaining the root word: Stemming or Lemmatization
  • Removing the stop words

For our project, we are going to build our dataset by extracting reviews from the internet

These reviews can be extracted from the webpages by a technique called web scraping. Below are the steps of web scraping:

  • Find the URL of the webpage that you want to scrap
  • Select the particular element by inspecting(Right-click -> Inspect)
  • Write the code to get the content of the selected elements
  • Store the data in the required format

And how do we do web scraping? by using any of the below-mentioned libraries:

  • Selenium
  • BeautifulSoup
  • Pandas

let’s jump into the code quickly

# Install necessary packagespip install requests BeautifulSoup4 matplotlib seaborn nltk pandas textblob sklearn
  • requests: to send HTTP requests
  • BeautifulSoup: to perform web scraping
  • matplotlib, seaborn: for visualization
  • nltk(natural language tool kit): word tokenization, stemming, lemmatization, finding stop words, etc
  • textblob: for part-of-speech tagging, sentiment analysis, classification, translation, etc
  • sklearn: to make use of various Machine learning algorithms
# import necessary librariesimport pandas as pd
import re # regular expression
import requests
from bs4 import BeautifulSoup as bs
import matplotlib.pyplot as plt
# import seaborn as sb
from textblob import TextBlob

Web Scraping to fetch the reviews using BeautifulSoup

Store the reviews in a pandas dataframe

# Creating a dataframe

df = pd.DataFrame(all_pages_reviews, columns = ['Reviews'])
df.head() # printing the first five reviews

Cleaning the data: remove unwanted/special characters and numbers

Perform NLP tasks (Tokenization, Stemming/Lemmatization, remove stopwords)

Tokenization — chopping off the text. Basic unit/token is a word but can perform sentence-level tokenization as well. Stemming, Lemmatization is used to obtain the root words.

Defining functions to calculate Subjectivity, Polarity and finally Analysis

Saving the resulted data into a CSV file

Our interest is only on the input text(Reviews) and Output(Analysis) columns

header = ['Reviews', 'Analysis']
df.to_csv('ownReviewDataset.csv', columns = header)

Converting pandas dataframe to torchtext dataset

# splitting the data into training, testing, and validation datasetsfrom sklearn.model_selection import train_test_splittrain, test_df = train_test_split(df[["Reviews", "Analysis"]], test_size=0.2)# split data into train and validation 
train_df, valid_df = train_test_split(train, test_size = 0.1)

Importing necessary libraries

!python -m spacy download en
!pip install torch
import spacy
spacy.load('en')
import torch
from torchtext import data
# from torchtext import datasets
import random

Defining the text and label fields

TEXT = data.Field(tokenize='spacy', batch_first = True)
LABEL = data.LabelField(dtype=torch.float)

Defining a class to convert pandas dataframe into a torchtext dataset

fields = [('text',TEXT), ('label',LABEL)]train_ds, val_ds, test_ds = DataFrameDataset.splits(fields, train_df=train_df, val_df=valid_df, test_df=test_df)

Build the Vocabulary

Build the vocab and load the pre-trained word embeddings.

TEXT.build_vocab(train_ds, max_size=25000, vectors="glove.6B.100d")
LABEL.build_vocab(train_ds)

Create Dataloader iterator

BATCH_SIZE = 12

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
(train_ds, val_ds, test_ds),
batch_size=BATCH_SIZE,
device=device)

We are now ready with the data, but let’s understand how a convolutional neural network can be used on text data.

Usually, CCNs are used for datasets consisting of 2D or 3D images. But the textual data we have is 1-dimensional. So, our first step is to convert each word into word embeddings. This lets us visualize our words in 2 dimensions, each word along one axis and the elements of vectors across the other dimension. Below is an example of a 2-dimensional representation of the embedded sentence.

source: https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/4%20-%20Convolutional%20Sentiment%20Analysis.ipynb

We can then use a kernel/filter that is [n x emb_dim]. This will cover n sequential words entirely, as their width will be emb_dim dimensions. Consider the image below, with our word vectors, which are represented in green. Here we have 4 words with 5-dimensional embeddings, creating a [4x5] “image” tensor. A filter that covers two words at a time (i.e. bi-grams) will be [2x5] filter, shown in yellow, and each element of the filter with have a weight associated with it. The output of this filter (shown in red) will be a single real number that is the weighted sum of all elements covered by the filter.

The filter then moves “down” the image (or across the sentence) to cover the next bi-gram and another output (weighted sum) is calculated.

Finally, the filter moves down again and the final output for this filter is calculated.

The final output will be a vector with several elements equal to the height of the image (or length of the word) minus the height of the filter plus one, 4–2+1 = 3 in our case.

In the above example, we have used one filter, but in reality, a CCN will have lots of such filters. The intuition is that we will be looking for the occurrence of different tri-grams, 4-grams, and 5-grams that are relevant for analyzing sentiment of the reviews.

The next step is to use max_pooling(obtain the maximum value over a dimension) on the convolutional layer.

The maximum value obtained by max_pooling is the “most important” feature for determining the sentiment of the review, which corresponds to the “most important” n-gram within the review. But how do we conclude which n-gram is the most important? The weighs of the filters are updated through backpropagation so that whenever certain n-grams that are highly indicative of the sentiment are seen, the output of the filter is a “high” value. This “high” value then passes through the max pooling layer if it is the maximum value in the output.

Build the Model

Create an instance of our CNN class

INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
N_FILTERS = 1000
FILTER_SIZES = [3,4,5]
OUTPUT_DIM = 1
DROPOUT = 0.5
model = CNN(INPUT_DIM, EMBEDDING_DIM, N_FILTERS, FILTER_SIZES, OUTPUT_DIM, DROPOUT)

Next, we’ll load the pre-trained embeddings

pretrained_embeddings = TEXT.vocab.vectorsmodel.embedding.weight.data.copy_(pretrained_embeddings)

Train the Model

# initialize the optimizer, loss function (criterion) and place the model and criterion on the GPU (if available)import torch.optim as optimoptimizer = optim.Adam(model.parameters())criterion = nn.BCEWithLogitsLoss()model = model.to(device)
criterion = criterion.to(device)

Function to calculate accuracy

def binary_accuracy(preds, y):
"""
Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
"""
#round predictions to the closest integer
rounded_preds = torch.round(torch.sigmoid(preds))
correct = (rounded_preds == y).float() #convert into float for division
acc = correct.sum()/len(correct)
return acc

We define a function for training our model…

Note: as we are using dropout again, we must remember to use model.train() to ensure the dropout is “turned on” while training.

We define a function for testing our model…

Note: again, as we are now using dropout, we must remember to use model.eval() to ensure the dropout is “turned off” while evaluating.

Training the model

Evaluating the model’s performance using test data(test_iterator)

We’ve achieved around 55–60% score which is pretty much less compared to the accuracy achieved for the IMDB dataset. Insufficient/Low training dataset size could be a reason(just a hunch).

References:

Note: some of the explanations and the images are borrowed from the reference mentioned above.

Check out the complete Jupyter Notebook here.

This Jupyter notebook and blog are written as part of the project for the course ZerotoGANs. Many thanks our instructor

, @Jovian.ml for this informative course.

--

--