Word tokenization is the process of split the text into words is called the token. Tokenization is an important part of the field of Natural Language Processing.
NLTK provides two sub-module for tokenization:
- word tokenizer
- sentence tokenizer
word tokenizer
It will return the Python list of words by splitting the text.
In [1]: from nltk.tokenize import word_tokenize text = "Hello, world!!, Good Morning" print(word_tokenize(text)) Out[1]: ['Hello', ',', 'world', '!', '!', ',', 'Good', 'Morning']
Sentence tokenizer
NLTK’s sentence tokenizer sent_tokenize used to split the text into sentences.
In [2]: from nltk.tokenize import sent_tokenize text = "Hello, world!!, Good Morning" print(sent_tokenize(text)) Out[2]: ['Hello, world!', '!, Good Moring']
Tokenization is the process to break the text string into identifiable linguistic units that constitute a piece of language data. The default method of NLTK’s word tokenizer is to split the text into words based on whitespace.
However, in real life text data, you may need customizing word tokenizer. To fulfil it, you can use a regular expression to tokenize the text and have much more control over the tokenization process.
Regular expressions are a powerful method of identifying the specific patterns in the text. Python provides the re module to deal with regular expression. Using re.findall() method, we can find all substrings or tokens in the text that match a pattern.
. . .