Different Tokenization Technique for Text Processing

In this article, I have described the different tokenization method for text preprocessing. As all of us know machine only understands numbers. So It’s necessary to convert text to a number which machine can understand. The method which accomplishes to convert text to the number (Token) is called Tokenization. There are many methods exist for tokenization. Here, I have listed these tokenization techniques with an example.

Keras Tokenization

Let’s see how Keras split the text into words as a token.

Output:

Let’s see how Keras tokenizer works:

Output:

While applying tokenizer on the unknown text like test sample, if a word is not present in the training sample those words are skipped. As described in the above example “not” keyword is an unknown word for tokenizer which is not present in train sample.

Note: here, both train and test sample text is opposite. But, we get the same tokenization.

To overcome this problem, Keras has “oov_token” parameter which handles the unknown word.

From Keras documentation:

  • oov_token: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls

Output:

In the above example, “not” and “very” keywords are unknown keyword we can also be called them out-of-vocabulary words. While calling Keras API, we set ‘oov_token=True’. Hence, tokenizer assigns a unique token to the unknown words. This approach is much better than the previous one. At least it can distinguish the different texts.

We can also see the token assign to a specific word.

Output:

Spacy tokenizer

Let’s see how Spacy tokenizer split the word.

Output:

Difference between Keras tokenizer and spacy tokenizer

Example: tokenize text using the Spacy tokenizer

word_seq.append(word_dict[token.text])

Output:

NLTK Tokenizer

NLTK has three different words tokenizer

  • WhitespaceTokenizer : Tokenize using the white spaces
  • WordPunctTokenizer : Tokenize using Punctuations
  • TreebankWordTokenizer : Tokenization using grammar rules

Let’s see how these tokenizer work.

Output :

Summary

Here I have explained different tokenization techniques with examples. These tokenization methods are used to convert text to token in text analysis.

Thanks for reading the full article. If you feel this article helpful to you, please clap for it… It means a lot to me…

Leave a Reply

Your email address will not be published. Required fields are marked *

Natural Language Processing Tutorials

A complete introduction to GPT-3 with Use Case examples

Deep Unveiling of the BERT Model

Word Embedding

Jaccard Similarity – Text Similarity Metric in NLP

TensorFlow : Text Classification of Movie Reviews

Text Preprocessing: Handle Emoji & Emoticon

Text Preprocessing: Removal of Punctuations

Develop the text Classifier with TensorFlow Hub

Tensorflow : BERT Fine-tuning with GPU

Introduction to BERT

NLTK – WordNet

Word Tokenization with NLTK

Installation of NLTK

Introduction to Natural Language Processing (NLP)

Cosine Similarity – Text Similarity Metric

Introduction to Word Embeddings

NLP – Stop Words

An Introduction to N-grams

Stemming and Lemmatization

TfidfVectorizer for text classification

CountVectorizer for text classification

Regular Expression for Text Cleaning in NLP

Text Data Cleaning & Preprocessing

Study Machine Learning