Different Tokenization Technique for Text Processing

In this article, I have described the different tokenization method for text preprocessing. As all of us know machine only understands numbers. So It’s necessary to convert text to a number which machine can understand. The method which accomplishes to convert text to the number (Token) is called Tokenization. There are many methods exist for tokenization. Here, I have listed these tokenization techniques with an example.

Keras Tokenization

Let’s see how Keras split the text into words as a token.

Output:

Let’s see how Keras tokenizer works:

Output:

While applying tokenizer on the unknown text like test sample, if a word is not present in the training sample those words are skipped. As described in the above example “not” keyword is an unknown word for tokenizer which is not present in train sample.

Note: here, both train and test sample text is opposite. But, we get the same tokenization.

To overcome this problem, Keras has “oov_token” parameter which handles the unknown word.

From Keras documentation:

  • oov_token: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls

Output:

In the above example, “not” and “very” keywords are unknown keyword we can also be called them out-of-vocabulary words. While calling Keras API, we set ‘oov_token=True’. Hence, tokenizer assigns a unique token to the unknown words. This approach is much better than the previous one. At least it can distinguish the different texts.

We can also see the token assign to a specific word.

Output:

Spacy tokenizer

Let’s see how Spacy tokenizer split the word.

Output:

Difference between Keras tokenizer and spacy tokenizer

Example: tokenize text using the Spacy tokenizer

word_seq.append(word_dict[token.text])

Output:

NLTK Tokenizer

NLTK has three different words tokenizer

  • WhitespaceTokenizer : Tokenize using the white spaces
  • WordPunctTokenizer : Tokenize using Punctuations
  • TreebankWordTokenizer : Tokenization using grammar rules

Let’s see how these tokenizer work.

Output :

Summary

Here I have explained different tokenization techniques with examples. These tokenization methods are used to convert text to token in text analysis.

Thanks for reading the full article. If you feel this article helpful to you, please clap for it… It means a lot to me…

Leave a Reply

Your email address will not be published. Required fields are marked *

Natural Language Processing Tutorials