Word Tokenization with NLTK

Word tokenization is the process of split the text into words is called the token. Tokenization is an important part of the field of Natural Language Processing.

NLTK provides two sub-module for tokenization:

  • word tokenizer
  • sentence tokenizer

word tokenizer 

It will return the Python list of words by splitting the text.

In [1]:
from nltk.tokenize import word_tokenize
text = "Hello, world!!, Good Morning"
print(word_tokenize(text))

Out[1]:
['Hello', ',', 'world', '!', '!', ',', 'Good', 'Morning']

Sentence tokenizer 

NLTK’s sentence tokenizer sent_tokenize used to split the text into sentences.

In [2]:
from nltk.tokenize import sent_tokenize
text = "Hello, world!!, Good Morning"
print(sent_tokenize(text))

Out[2]:
['Hello, world!', '!, Good Moring']

Tokenization is the process to break the text string into identifiable linguistic units that constitute a piece of language data. The default method of NLTK’s word tokenizer is to split the text into words based on whitespace.

However, in real life text data, you may need customizing word tokenizer. To fulfil it, you can use a regular expression to tokenize the text and have much more control over the tokenization process.

Regular expressions are a powerful method of identifying the specific patterns in the text. Python provides the re module to deal with regular expression. Using re.findall() method, we can find all substrings or tokens in the text that match a pattern.

.     .     .

Leave a Reply

Your email address will not be published. Required fields are marked *

Natural Language Processing Tutorials

A complete introduction to GPT-3 with Use Case examples

Deep Unveiling of the BERT Model

Word Embedding

Jaccard Similarity – Text Similarity Metric in NLP

TensorFlow : Text Classification of Movie Reviews

Text Preprocessing: Handle Emoji & Emoticon

Text Preprocessing: Removal of Punctuations

Develop the text Classifier with TensorFlow Hub

Tensorflow : BERT Fine-tuning with GPU

Introduction to BERT

NLTK – WordNet

Installation of NLTK

Introduction to Natural Language Processing (NLP)

Cosine Similarity – Text Similarity Metric

Introduction to Word Embeddings

NLP – Stop Words

An Introduction to N-grams

Stemming and Lemmatization

TfidfVectorizer for text classification

CountVectorizer for text classification

Regular Expression for Text Cleaning in NLP

Text Data Cleaning & Preprocessing

Different Tokenization Technique for Text Processing