Stop words are the words which are very common in text documents such as a, an, the, you, your, etc. The Stop Words highly appear in text documents. However, they are not being helpful for text analysis in many of the cases, So it is better to remove from the text. We can focus on the important words if stop words have removed.
The NLTK package provides a list of the stop word. If you do not have NLTK module installed in your local machine, please install it before proceeding further using below command.
pip install nltk
After installing the NLTK module successfully, please install NLTK data which contains many corpora, toy grammars, trained models, etc.
Installing NLTK Data
First, open the Python interpreter and type the following command.
import nltk nltk.download()
After hitting this command the NLTK Downloaded Window Opens. Click the Download Button to download NLTK corpus.
Let’s load the stop words of the English language in python.
In [1]: from nltk.corpus import stopwords stopWords = set(stopwords.words('english'))
# Number of stop word in list. In [2]: print(len(stopWords)) Out[2]: 179
NLTK support various language for text processing. Let’s list down the supported language by NLTK stopwords.
In [3]: from nltk.corpus import stopwords stopwords.fileids() Out[3]: ['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish', 'turkish']
# Let's print 10 stopword In [4]: i = 0 for e in stopWords: print(e) i = i + 1 if(i >= 10): break Out[4]: were of should've through yourselves isn won't y other myself
. . .
Customized Stop Words
We can also construct a domain-specific customized stop word list. For example, in medical text documents, the word like Dr., drug, patient, medicine, etc. are appearing in most of the documents. We can consider these word as stop words. The other example is twitter text data, the term such as #, @, RT, etc. are occurring in most of the tweets.
. . .