Different Tokenization Technique for Text Processing

In this article, I have described the different tokenization method for text preprocessing. As all of us know machine only understands numbers. So It’s necessary to convert text to a number which machine can understand. The method which accomplishes to convert text to the number (Token) is called Tokenization. There are many methods exist for tokenization. Here, I have listed these tokenization techniques with an example.

Keras Tokenization

Let’s see how Keras split the text into words as a token.

from keras.preprocessing.text import text_to_word_sequence
text = "It's very easy to understand."result = text_to_word_sequence(text)
print(result)

Output:

["it's", 'very', 'easy', 'to', 'understand']

Let’s see how Keras tokenizer works:

from keras.preprocessing.text import Tokenizer
tok = Tokenizer()train_text = ["this girl is looking beautiful!!"] 
test_text = ["this girl is not looking beautiful"]tok.fit_on_texts(train_text)t1 = tok.texts_to_sequences(train_text) 
t2 = tok.texts_to_sequences(test_text)print(t1) 
print(t2)

Output:

[[1, 2, 3, 4, 5]]
[[1, 2, 3, 4, 5]]

While applying tokenizer on the unknown text like test sample, if a word is not present in the training sample those words are skipped. As described in the above example “not” keyword is an unknown word for tokenizer which is not present in train sample.

Note: here, both train and test sample text is opposite. But, we get the same tokenization.

To overcome this problem, Keras has “oov_token” parameter which handles the unknown word.

From Keras documentation:

oov_token: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls

from keras.preprocessing.text import Tokenizer
tok = Tokenizer(oov_token=True)train_text = ["this girl is looking beautiful!!"] 
test_text = ["this girl is not looking very beautiful"]tok.fit_on_texts(train_text)t1 = tok.texts_to_sequences(train_text) 
t2 = tok.texts_to_sequences(test_text)print(t1) 
print(t2)

Output:

[[1, 2, 3, 4, 5]]
[[1, 2, 3, 6, 4, 6, 5]]

In the above example, “not” and “very” keywords are unknown keyword we can also be called them out-of-vocabulary words. While calling Keras API, we set ‘oov_token=True’. Hence, tokenizer assigns a unique token to the unknown words. This approach is much better than the previous one. At least it can distinguish the different texts.

We can also see the token assign to a specific word.

print(tok.word_index)

Output:

{True: 6, 'beautiful': 5, 'is': 3, 'girl': 2, 'looking': 4, 'this': 1}

Spacy tokenizer

Let’s see how Spacy tokenizer split the word.

import spacynlp = spacy.load("en")
doc = nlp("It's beautiful toy")
token_li = []
for token in doc:
    token_li.append(token.text)print(token_li)

Output:

['It', "'s", 'beautiful', 'toy']

Difference between Keras tokenizer and spacy tokenizer

text = "It's beautiful toy"Keras : [“it’s”, ‘beautiful’, ‘toy’]
Spacy : [‘It’, “‘s”, ‘beautiful’, ‘toy’]

Example: tokenize text using the Spacy tokenizer

import spacynlp = spacy.load("en")
doc = nlp("It's beautiful toy")word_index = 1
word_dict = {}
word_seq = []for token in doc:
    if(token.text not in word_dict):
        word_dict[token.text] = word_index 
        word_index += 1

word_seq.append(word_dict[token.text])

print("Word Dict : ",word_dict)
print("Word seq : ",word_seq)

Output:

Word Dict : {"'s": 2, 'It': 1, 'beautiful': 3, 'toy': 4}
Word seq : [1, 2, 3, 4]

NLTK Tokenizer

NLTK has three different words tokenizer

WhitespaceTokenizer : Tokenize using the white spaces
WordPunctTokenizer : Tokenize using Punctuations
TreebankWordTokenizer : Tokenization using grammar rules

Let’s see how these tokenizer work.

import nltk
text = "Wow, it's awesome place!!"whiteSpace_tk = nltk.tokenize.WhitespaceTokenizer().tokenize(text)
wordPunct_tk = nltk.tokenize.WordPunctTokenizer().tokenize(text)
treeBank_tk = nltk.tokenize.TreebankWordTokenizer().tokenize(text)print("WhitespaceTokenizer :\t",whiteSpace_tk)
print("WordPunctTokenizer :\t",wordPunct_tk)
print("TreebankWordTokenizer :\t",treeBank_tk)

Output :

WhitespaceTokenizer :
	 ['Wow,', "it's", 'awesome', 'place!!']WordPunctTokenizer :
	 ['Wow', ',', 'it', "'", 's', 'awesome', 'place', '!!']TreebankWordTokenizer :
	 ['Wow', ',', 'it', "'s", 'awesome', 'place', '!', '!']

Summary

Here I have explained different tokenization techniques with examples. These tokenization methods are used to convert text to token in text analysis.

Thanks for reading the full article. If you feel this article helpful to you, please clap for it… It means a lot to me…

Different Tokenization Technique for Text Processing

Keras Tokenization

Spacy tokenizer

NLTK Tokenizer

Leave a Reply Cancel reply

Natural Language Processing Tutorials