Stemming and Lemmatization

Stemming and Lemmatization is the method to normalize the text documents. The main goal of the text normalization is to keep the vocabulary small, which help to improve the accuracy of many language modelling tasks.

For example, vocabulary size will be reduced if we transform each word to lowercase. Hence, the difference between How and how is ignored.

Stemming and Lemmatization help us to normalize text and improve the vocabulary by reducing the inflectional forms. It is the process of converting a word to its base form.

 

Stemming

Stemming usually refers to a process of chopping off the last few characters. Stemming operates on a single word without knowledge of the context. Stemming is not a well-defined process, it often suffers from incorrect meaning and spelling errors.

Stemmers use language-specific rules, but they require less knowledge than a lemmatizer. There are several Stemming algorithms are exist. All these stemming algorithms have their own behaviour.

NLTK provides various stemmer in different languages such as English, German, French, Finnish, Danish, Dutch, Hungarian, Italian, etc. The nltk.stem package provides the implementation of the stemmer.

Let’s see the behaviour of the below three stemmers:

  • Porter Stemmer
  • Snowball Stemmer
  • Lancaster Stemmer

 

Example

from nltk.stem.snowball import SnowballStemmer
from nltk.stem import LancasterStemmer
from nltk.stem import PorterStemmer

ss = SnowballStemmer('english')
ps =PorterStemmer()

words= ["wait", "waiting", "waited", "waits"]
for e in words:
    ps_stem_Word = ps.stem(e)
    ss_stem_Word = ss.stem(e)
    print('Word: {}   ->   PorterStemmer: {}   &   SnowballStemmer: {} '.format(e,ps_stem_Word,ss_stem_Word))

Output –

Word: wait   ->   PorterStemmer: wait   &   SnowballStemmer: wait 
Word: waiting   ->   PorterStemmer: wait   &   SnowballStemmer: wait 
Word: waited   ->   PorterStemmer: wait   &   SnowballStemmer: wait 
Word: waits   ->   PorterStemmer: wait   &   SnowballStemmer: wait

Lemmatization

Lemmatization is closely related to Stemming, but the main difference is that Lemmatization considers the morphological analysis of the words and converts the word with meaningful way.

Lemmatization needs a complete vocabulary and morphological analysis to correctly lemmatize words. Lemmatization is better to use instead of Stemming as at least lemmatization doesn’t vanish the meaning of the word.

NLTK provides WordNet lemmatizer, which only removes affixes if the resulting word is present in its dictionary. Lemmatizer is slower than stemmers, as the lemmatizer need to check the resultant word in the dictionary.

Let’s see the example of WordNet lemmatizer:

Example

from nltk.stem import WordNetLemmatizer
wordnet_lemma = WordNetLemmatizer()
words= ["wait", "waiting", "waited", "waits"]
for e in words:
    lemma_word=wordnet_lemma.lemmatize(e)
    print('{} -> {}'.format(e,lemma_word))

Output – 

wait -> wait
waiting -> waiting
waited -> waited
waits -> wait

.     .     .

Leave a Reply

Your email address will not be published. Required fields are marked *

Natural Language Processing Tutorials

A complete introduction to GPT-3 with Use Case examples

Deep Unveiling of the BERT Model

Word Embedding

Jaccard Similarity – Text Similarity Metric in NLP

TensorFlow : Text Classification of Movie Reviews

Text Preprocessing: Handle Emoji & Emoticon

Text Preprocessing: Removal of Punctuations

Develop the text Classifier with TensorFlow Hub

Tensorflow : BERT Fine-tuning with GPU

Introduction to BERT

NLTK – WordNet

Word Tokenization with NLTK

Installation of NLTK

Introduction to Natural Language Processing (NLP)

Cosine Similarity – Text Similarity Metric

Introduction to Word Embeddings

NLP – Stop Words

An Introduction to N-grams

TfidfVectorizer for text classification

CountVectorizer for text classification

Regular Expression for Text Cleaning in NLP

Text Data Cleaning & Preprocessing

Different Tokenization Technique for Text Processing