NLP – Stop Words

Stop words are the words which are very common in text documents such as a, an, the, you, your, etc. The Stop Words highly appear in text documents. However, they are not being helpful for text analysis in many of the cases, So it is better to remove from the text. We can focus on the important words if stop words have removed.

The NLTK package provides a list of the stop word. If you do not have NLTK module installed in your local machine, please install it before proceeding further using below command.

pip install nltk

After installing the NLTK module successfully, please install NLTK data which contains many corpora, toy grammars, trained models, etc.

Installing NLTK Data

First, open the Python interpreter and type the following command.

import nltk
nltk.download()

After hitting this command the NLTK Downloaded Window Opens. Click the Download Button to download NLTK corpus.

Let’s load the stop words of the English language in python.

In [1]:
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
# Number of stop word in list.
In [2]: print(len(stopWords))
Out[2]: 179

NLTK support various language for text processing. Let’s list down the supported language by NLTK stopwords.

In [3]:
from nltk.corpus import stopwords
stopwords.fileids()
Out[3]:
['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish', 'turkish']
# Let's print 10 stopword
In [4]:
i = 0
for e in stopWords:
    print(e)
    i = i + 1
    if(i >= 10):
        break
Out[4]:
were
of
should've
through
yourselves
isn
won't
y
other
myself

.     .     .

Customized Stop Words

We can also construct a domain-specific customized stop word list. For example, in medical text documents, the word like Dr., drug, patient, medicine, etc. are appearing in most of the documents. We can consider these word as stop words. The other example is twitter text data, the term such as #, @, RT, etc. are occurring in most of the tweets.

.     .     .

Leave a Reply

Your email address will not be published. Required fields are marked *

Natural Language Processing Tutorials