NLP – Stop Words

Stop words are the words which are very common in text documents such as a, an, the, you, your, etc. The Stop Words highly appear in text documents. However, they are not being helpful for text analysis in many of the cases, So it is better to remove from the text. We can focus on the important words if stop words have removed.

The NLTK package provides a list of the stop word. If you do not have NLTK module installed in your local machine, please install it before proceeding further using below command.

pip install nltk

After installing the NLTK module successfully, please install NLTK data which contains many corpora, toy grammars, trained models, etc.

Installing NLTK Data

First, open the Python interpreter and type the following command.

import nltk
nltk.download()

After hitting this command the NLTK Downloaded Window Opens. Click the Download Button to download NLTK corpus.

Let’s load the stop words of the English language in python.

In [1]:
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
# Number of stop word in list.
In [2]: print(len(stopWords))
Out[2]: 179

NLTK support various language for text processing. Let’s list down the supported language by NLTK stopwords.

In [3]:
from nltk.corpus import stopwords
stopwords.fileids()
Out[3]:
['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish', 'turkish']
# Let's print 10 stopword
In [4]:
i = 0
for e in stopWords:
    print(e)
    i = i + 1
    if(i >= 10):
        break
Out[4]:
were
of
should've
through
yourselves
isn
won't
y
other
myself

.     .     .

Customized Stop Words

We can also construct a domain-specific customized stop word list. For example, in medical text documents, the word like Dr., drug, patient, medicine, etc. are appearing in most of the documents. We can consider these word as stop words. The other example is twitter text data, the term such as #, @, RT, etc. are occurring in most of the tweets.

.     .     .

Leave a Reply

Your email address will not be published. Required fields are marked *

Natural Language Processing Tutorials

A complete introduction to GPT-3 with Use Case examples

Deep Unveiling of the BERT Model

Word Embedding

Jaccard Similarity – Text Similarity Metric in NLP

TensorFlow : Text Classification of Movie Reviews

Text Preprocessing: Handle Emoji & Emoticon

Text Preprocessing: Removal of Punctuations

Develop the text Classifier with TensorFlow Hub

Tensorflow : BERT Fine-tuning with GPU

Introduction to BERT

NLTK – WordNet

Word Tokenization with NLTK

Installation of NLTK

Introduction to Natural Language Processing (NLP)

Cosine Similarity – Text Similarity Metric

Introduction to Word Embeddings

An Introduction to N-grams

Stemming and Lemmatization

TfidfVectorizer for text classification

CountVectorizer for text classification

Regular Expression for Text Cleaning in NLP

Text Data Cleaning & Preprocessing

Different Tokenization Technique for Text Processing