Text classification is a famous problem in Natural Language Processing where need to understand the context of the text and make a prediction of it whether it is positive or negative. Sometimes text classification problem required to make a prediction of a text in a range of some scale such as 1 to 10.
Sentiment analysis of a text is the best example of text classification. In the world of AI, many companies do sentiment analysis based on use-cases such as movie review, product review or any service review analysis that will help them to improve their business by knowing the user experience or feedback to a product/service.
In this tutorial, you will learn to train a Neural Network for a Movie review sentiment analysis using TensorFlow. We need to predict the movie review is positive or negative. Here, we use the IMDB movie review dataset that consists of the 25000 train and 25000 test text data sample labelled by positive and negative. This IMDB movie review dataset is published by Stanford AI Lab. Please refer to this link to get more information about the dataset.
Tensorflow allows you to load pre-processed dataset directly that is ready to use for deep learning model. You don’t bother to perform text pre-processing step such as tokenization as it is already tokenized data. But, in this tutorial, we will use raw dataset and perform tokenization using Tensorflow.
Let’s start to prepare a model for movie review sentiment analysis. First will add the required packages.
Import Required Packages
import tensorflow as tf import pandas as pd import numpy as np import os print(tf.__version__)
2.1.0
This entire program work with TensorFlow version 2.1.0. Let’s load the IMDB dataset.
Import IMDB movie review dataset
data_path = tf.keras.utils.get_file("IMDB_Dataset.csv", "https://studymachinelearning.com/wp-content/uploads/2020/03/IMDB-Dataset.csv", ) data_path
Downloading data from https://studymachinelearning.com/wp-content/uploads/2020/03/IMDB-Dataset.csv 66215936/66212309 [==============================] - 155s 2us/step '/home/.keras/datasets/IMDB_Dataset.csv'
Explore the data
Let’s explore the dataset before training the model. There is a total 50,000 movie review present in a dataset labelled by positive and negative sentiment.
df.shape
(50000, 2)
The dataset is a balanced data equally distributed with 25000 positive and 25000 negative reviews.
df['sentiment'].value_counts()
positive 25000 negative 25000 Name: sentiment, dtype: int64
Let’s print a sample of movie reviews:
print(f"Review : \n\n{df['review'].iloc[1]} \n ") print(f"Label : {df['sentiment'].iloc[1]}")
Review : A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done. Label : positive
Prepare Data for Training
Let’s create a train and test data. Also, apply label encoding to the target variable.
df['sentiment'] = df['sentiment'].map({'positive':0,'negative':1}) train_df = df.sample(frac=0.8,random_state=100) test_df = df.drop(train_df.index) print(f"Train data shape: {train_df.shape}") print(f"Test data shape: {test_df.shape}")
Train data shape: (40000, 2) Test data shape: (10000, 2)
Tokenize the Data
As a machine can understand the numbers only not a text, so we need to convert text word to number by assigning the unique number to each word. This process is called tokenization and a unique number which assigns to a word is called token. TensorFlow provides a text tokenization class to tokenize a text.
Let’s perform tokenization step.
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=8000) tokenizer.fit_on_texts(np.append(train_df['review'].values,test_df['review'].values)) word_index = tokenizer.word_index nb_words = len(word_index) + 1 train_seq = tokenizer.texts_to_sequences(train_df["review"]) test_seq = tokenizer.texts_to_sequences(test_df["review"]) train_data = tf.keras.preprocessing.sequence.pad_sequences(train_seq, maxlen=100) test_data = tf.keras.preprocessing.sequence.pad_sequences(test_seq, maxlen=100) print(f"Train data shape: {train_data.shape}") print(f"Test data shape: {test_data.shape}")
Train data shape: (40000, 100) Test data shape: (10000, 100)
tokenizer.word_index returns the dictionary of words along with their uniquely assigned integer number. Let’s print how many unique word present in our dataset.
len(word_index)
124252
There is a total 124252 unique word present in a dataset.
train_label = train_df['sentiment'].values test_label = test_df['sentiment'].values
Build the Model
Let’s build the neural network architecture with stacking multiple layers.
def create_model(): model = tf.keras.Sequential([ tf.keras.layers.Embedding(nb_words, 128), tf.keras.layers.GlobalAveragePooling1D(), tf.keras.layers.Dense(64,activation='relu'), tf.keras.layers.Dense(1)]) model.compile(optimizer='adam', loss=tf.losses.BinaryCrossentropy(from_logits=True), metrics=['accuracy']) return model model = create_model() model.summary()
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, None, 128) 15904384 _________________________________________________________________ global_average_pooling1d_1 ( (None, 128) 0 _________________________________________________________________ dense_1 (Dense) (None, 64) 8256 _________________________________________________________________ dense_2 (Dense) (None, 1) 65 ================================================================= Total params: 15,912,705 Trainable params: 15,912,705 Non-trainable params: 0
Train the Model
Let’s define an EarlyStopping callback and fit the model.
call_back = [tf.keras.callbacks.EarlyStopping(monitor="val_accuracy",patience=2, verbose=1,restore_best_weights=True)] model.fit(train_data, train_label, epochs=10, batch_size=32, validation_data = (test_data,test_label), callbacks=call_back)
Train on 40000 samples, validate on 10000 samples Epoch 1/10 40000/40000 [==============================] - 214s 5ms/sample - loss: 0.3712 - accuracy: 0.8160 - val_loss: 0.3195 - val_accuracy: 0.8650 Epoch 2/10 40000/40000 [==============================] - 217s 5ms/sample - loss: 0.2052 - accuracy: 0.9165 - val_loss: 0.3346 - val_accuracy: 0.8534 Epoch 3/10 39968/40000 [============================>.] - ETA: 0s - loss: 0.1260 - accuracy: 0.9525Restoring model weights from the end of the best epoch. 40000/40000 [==============================] - 216s 5ms/sample - loss: 0.1259 - accuracy: 0.9525 - val_loss: 0.4185 - val_accuracy: 0.8516 Epoch 00003: early stopping
Evaluate the Model
Let’s see how our model performs on test data.
loss, accuracy = model.evaluate(test_data,test_label) print(f"Accuracy : {accuracy}") print(f"Loss : {loss}")
. . .