TensorFlow : Text Classification of Movie Reviews

Text classification is a famous problem in Natural Language Processing where need to understand the context of the text and make a prediction of it whether it is positive or negative. Sometimes text classification problem required to make a prediction of a text in a range of some scale such as 1 to 10.

Sentiment analysis of a text is the best example of text classification. In the world of AI, many companies do sentiment analysis based on use-cases such as movie review, product review or any service review analysis that will help them to improve their business by knowing the user experience or feedback to a product/service.

In this tutorial, you will learn to train a Neural Network for a Movie review sentiment analysis using TensorFlow. We need to predict the movie review is positive or negative. Here, we use the IMDB movie review dataset that consists of the 25000 train and 25000 test text data sample labelled by positive and negative. This IMDB movie review dataset is published by Stanford AI Lab. Please refer to this link to get more information about the dataset.

Tensorflow allows you to load pre-processed dataset directly that is ready to use for deep learning model. You don’t bother to perform text pre-processing step such as tokenization as it is already tokenized data. But, in this tutorial, we will use raw dataset and perform tokenization using Tensorflow.

Let’s start to prepare a model for movie review sentiment analysis. First will add the required packages.

Import Required Packages

import tensorflow as tf
import pandas as pd
import numpy as np
import os
print(tf.__version__)
2.1.0

This entire program work with TensorFlow version 2.1.0. Let’s load the IMDB dataset.

Import IMDB movie review dataset

data_path = tf.keras.utils.get_file("IMDB_Dataset.csv",
"https://studymachinelearning.com/wp-content/uploads/2020/03/IMDB-Dataset.csv",
)
data_path 
Downloading data from https://studymachinelearning.com/wp-content/uploads/2020/03/IMDB-Dataset.csv
66215936/66212309 [==============================] - 155s 2us/step
'/home/.keras/datasets/IMDB_Dataset.csv'
df = pd.read_csv(data_path)

Explore the data

Let’s explore the dataset before training the model. There is a total 50,000 movie review present in a dataset labelled by positive and negative sentiment.

df.shape
(50000, 2)

The dataset is a balanced data equally distributed with 25000 positive and 25000 negative reviews.

df['sentiment'].value_counts()
positive    25000
negative    25000
Name: sentiment, dtype: int64

Let’s print a  sample of movie reviews:

print(f"Review : \n\n{df['review'].iloc[1]} \n ")
print(f"Label  : {df['sentiment'].iloc[1]}")
Review : 

A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done. 
 
Label  : positive

Prepare Data for Training

Let’s create a train and test data. Also, apply label encoding to the target variable.

df['sentiment'] = df['sentiment'].map({'positive':0,'negative':1})

train_df = df.sample(frac=0.8,random_state=100)
test_df = df.drop(train_df.index)

print(f"Train data shape: {train_df.shape}")
print(f"Test  data shape: {test_df.shape}")
Train data shape: (40000, 2)
Test  data shape: (10000, 2)

Tokenize the Data

As a machine can understand the numbers only not a text, so we need to convert text word to number by assigning the unique number to each word. This process is called tokenization and a unique number which assigns to a word is called token. TensorFlow provides a text tokenization class to tokenize a text.

Let’s perform tokenization step.

tokenizer  = tf.keras.preprocessing.text.Tokenizer(num_words=8000)
tokenizer.fit_on_texts(np.append(train_df['review'].values,test_df['review'].values))

word_index = tokenizer.word_index
nb_words = len(word_index) + 1

train_seq = tokenizer.texts_to_sequences(train_df["review"])
test_seq = tokenizer.texts_to_sequences(test_df["review"])

train_data = tf.keras.preprocessing.sequence.pad_sequences(train_seq, maxlen=100)
test_data = tf.keras.preprocessing.sequence.pad_sequences(test_seq, maxlen=100)

print(f"Train data shape: {train_data.shape}")
print(f"Test  data shape: {test_data.shape}")
Train data shape: (40000, 100)
Test  data shape: (10000, 100)

tokenizer.word_index returns the dictionary of words along with their uniquely assigned integer number. Let’s print how many unique word present in our dataset.

len(word_index)
124252

There is a total 124252 unique word present in a dataset.

train_label = train_df['sentiment'].values
test_label = test_df['sentiment'].values

Build the Model

Let’s build the neural network architecture with stacking multiple layers.

def create_model():
    model = tf.keras.Sequential([
      tf.keras.layers.Embedding(nb_words, 128),
      tf.keras.layers.GlobalAveragePooling1D(),
      tf.keras.layers.Dense(64,activation='relu'),
      tf.keras.layers.Dense(1)])

    model.compile(optimizer='adam',
              loss=tf.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

    return model

model = create_model()
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, None, 128)         15904384  
_________________________________________________________________
global_average_pooling1d_1 ( (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 65        
=================================================================
Total params: 15,912,705
Trainable params: 15,912,705
Non-trainable params: 0

Train the Model

Let’s define an EarlyStopping callback and fit the model.

call_back = [tf.keras.callbacks.EarlyStopping(monitor="val_accuracy",patience=2,
verbose=1,restore_best_weights=True)]

model.fit(train_data, train_label, epochs=10, batch_size=32,
          validation_data = (test_data,test_label),
          callbacks=call_back)
Train on 40000 samples, validate on 10000 samples
Epoch 1/10
40000/40000 [==============================] - 214s 5ms/sample - loss: 0.3712 - accuracy: 0.8160 - val_loss: 0.3195 - val_accuracy: 0.8650
Epoch 2/10
40000/40000 [==============================] - 217s 5ms/sample - loss: 0.2052 - accuracy: 0.9165 - val_loss: 0.3346 - val_accuracy: 0.8534
Epoch 3/10
39968/40000 [============================>.] - ETA: 0s - loss: 0.1260 - accuracy: 0.9525Restoring model weights from the end of the best epoch.
40000/40000 [==============================] - 216s 5ms/sample - loss: 0.1259 - accuracy: 0.9525 - val_loss: 0.4185 - val_accuracy: 0.8516
Epoch 00003: early stopping

Evaluate the Model

Let’s see how our model performs on test data.

loss, accuracy = model.evaluate(test_data,test_label)

print(f"Accuracy : {accuracy}")
print(f"Loss     : {loss}")
10000/10000 [==============================] - 0s 43us/sample - loss: 0.3195 - accuracy: 0.8650

Accuracy : 0.8650000095367432
Loss     : 0.31954611089229584

Our model has achieved a quite impressing performance with small neural network architecture even though we haven’t clean the data. We have feed the raw text data to the model that consists of the stopwords, HTML text, wrong spelling, etc. So, there is even more chance to improve the performance by just cleaning the raw data and feed to model.

Please refer this tutorial to discover the various approaches for text cleaning/text preprocessing. The pre-trained word embedding also helps to improve performance. Please refer to this tutorial to learn about how to use pre-trained word embedding for text classification.

In this tutorial, you have discovered how to build a neural network model for text classification (sentiment analysis) of the IMDB movie review dataset using TensorFlow. Please write a comment in a below section, if you have any question regarding this tutorial.

.      .     .

Leave a Reply

Your email address will not be published. Required fields are marked *

Natural Language Processing Tutorials

A complete introduction to GPT-3 with Use Case examples

Deep Unveiling of the BERT Model

Word Embedding

Jaccard Similarity – Text Similarity Metric in NLP

Text Preprocessing: Handle Emoji & Emoticon

Text Preprocessing: Removal of Punctuations

Develop the text Classifier with TensorFlow Hub

Tensorflow : BERT Fine-tuning with GPU

Introduction to BERT

NLTK – WordNet

Word Tokenization with NLTK

Installation of NLTK

Introduction to Natural Language Processing (NLP)

Cosine Similarity – Text Similarity Metric

Introduction to Word Embeddings

NLP – Stop Words

An Introduction to N-grams

Stemming and Lemmatization

TfidfVectorizer for text classification

CountVectorizer for text classification

Regular Expression for Text Cleaning in NLP

Text Data Cleaning & Preprocessing

Different Tokenization Technique for Text Processing