In this article, I have described the different tokenization method for text preprocessing. As all of us know machine only understands numbers. So It’s necessary to convert text to a number which machine can understand. The method which accomplishes to convert text to the number (Token) is called Tokenization. There are many methods exist for tokenization. Here, I have listed these tokenization techniques with an example.
Keras Tokenization
Let’s see how Keras split the text into words as a token.
from keras.preprocessing.text import text_to_word_sequence text = "It's very easy to understand."result = text_to_word_sequence(text) print(result)
Output:
["it's", 'very', 'easy', 'to', 'understand']
Let’s see how Keras tokenizer works:
from keras.preprocessing.text import Tokenizer tok = Tokenizer()train_text = ["this girl is looking beautiful!!"] test_text = ["this girl is not looking beautiful"]tok.fit_on_texts(train_text)t1 = tok.texts_to_sequences(train_text) t2 = tok.texts_to_sequences(test_text)print(t1) print(t2)
Output:
[[1, 2, 3, 4, 5]]
[[1, 2, 3, 4, 5]]
While applying tokenizer on the unknown text like test sample, if a word is not present in the training sample those words are skipped. As described in the above example “not” keyword is an unknown word for tokenizer which is not present in train sample.
Note: here, both train and test sample text is opposite. But, we get the same tokenization.
To overcome this problem, Keras has “oov_token” parameter which handles the unknown word.
From Keras documentation:
- oov_token: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls
from keras.preprocessing.text import Tokenizer tok = Tokenizer(oov_token=True)train_text = ["this girl is looking beautiful!!"] test_text = ["this girl is not looking very beautiful"]tok.fit_on_texts(train_text)t1 = tok.texts_to_sequences(train_text) t2 = tok.texts_to_sequences(test_text)print(t1) print(t2)
Output:
[[1, 2, 3, 4, 5]]
[[1, 2, 3, 6, 4, 6, 5]]
In the above example, “not” and “very” keywords are unknown keyword we can also be called them out-of-vocabulary words. While calling Keras API, we set ‘oov_token=True’. Hence, tokenizer assigns a unique token to the unknown words. This approach is much better than the previous one. At least it can distinguish the different texts.
We can also see the token assign to a specific word.
print(tok.word_index)
Output:
{True: 6, 'beautiful': 5, 'is': 3, 'girl': 2, 'looking': 4, 'this': 1}
Spacy tokenizer
Let’s see how Spacy tokenizer split the word.
import spacynlp = spacy.load("en") doc = nlp("It's beautiful toy") token_li = [] for token in doc: token_li.append(token.text)print(token_li)
Output:
['It', "'s", 'beautiful', 'toy']
Difference between Keras tokenizer and spacy tokenizer
text = "It's beautiful toy"Keras : [“it’s”, ‘beautiful’, ‘toy’] Spacy : [‘It’, “‘s”, ‘beautiful’, ‘toy’]
Example: tokenize text using the Spacy tokenizer
import spacynlp = spacy.load("en") doc = nlp("It's beautiful toy")word_index = 1 word_dict = {} word_seq = []for token in doc: if(token.text not in word_dict): word_dict[token.text] = word_index word_index += 1
word_seq.append(word_dict[token.text])
print("Word Dict : ",word_dict)
print("Word seq : ",word_seq)
Output:
Word Dict : {"'s": 2, 'It': 1, 'beautiful': 3, 'toy': 4}
Word seq : [1, 2, 3, 4]
NLTK Tokenizer
NLTK has three different words tokenizer
- WhitespaceTokenizer : Tokenize using the white spaces
- WordPunctTokenizer : Tokenize using Punctuations
- TreebankWordTokenizer : Tokenization using grammar rules
Let’s see how these tokenizer work.
import nltk text = "Wow, it's awesome place!!"whiteSpace_tk = nltk.tokenize.WhitespaceTokenizer().tokenize(text) wordPunct_tk = nltk.tokenize.WordPunctTokenizer().tokenize(text) treeBank_tk = nltk.tokenize.TreebankWordTokenizer().tokenize(text)print("WhitespaceTokenizer :\t",whiteSpace_tk) print("WordPunctTokenizer :\t",wordPunct_tk) print("TreebankWordTokenizer :\t",treeBank_tk)
Output :
WhitespaceTokenizer : ['Wow,', "it's", 'awesome', 'place!!']WordPunctTokenizer : ['Wow', ',', 'it', "'", 's', 'awesome', 'place', '!!']TreebankWordTokenizer : ['Wow', ',', 'it', "'s", 'awesome', 'place', '!', '!']
Summary
Here I have explained different tokenization techniques with examples. These tokenization methods are used to convert text to token in text analysis.
Thanks for reading the full article. If you feel this article helpful to you, please clap for it… It means a lot to me…