An Introduction to N-grams

An N-gram is a contiguous sequence of n items from a given sample of text or speech. In Natural Language Processing, the concept of N-gram is widely used for text analysis. An N-gram of size 1 is referred to as a “unigram“, size 2 is a “bigram”, size 3 is a “trigram”.

Example:

text = “The Margherita pizza has very good taste.”

If we consider N=2, then N-gram would be:

  • The Margherita
  • Margherita pizza
  • pizza has
  • has very
  • very good
  • good taste

.     .      .

Why N-gram?

An N-gram plays important role in text analysis in Machine Learning. Sometimes a single word alone isn’t sufficient to observe the context of a text. Let’s see how N-gram will useful for text analysis using an example.

For example, we need to predict the sentiment of the text such as positive or negative.

text = “The Margherita pizza is not bad taste”

If we consider unigram or a single word for text analysis, the negative word “bad” lead to the wrong prediction of the text. But if we use bigram,  the bigram word “not bad” helps to predict the text as a positive sentiment.

.     .      .

No. of N-gram in a sentence:

No. of N-gram = X - (N - 1)

Where,

X is the total number of words in a sentence.

K is an N-gram value

.     .      .

Leave a Reply

Your email address will not be published. Required fields are marked *

Natural Language Processing Tutorials

A complete introduction to GPT-3 with Use Case examples

Deep Unveiling of the BERT Model

Word Embedding

Jaccard Similarity – Text Similarity Metric in NLP

TensorFlow : Text Classification of Movie Reviews

Text Preprocessing: Handle Emoji & Emoticon

Text Preprocessing: Removal of Punctuations

Develop the text Classifier with TensorFlow Hub

Tensorflow : BERT Fine-tuning with GPU

Introduction to BERT

NLTK – WordNet

Word Tokenization with NLTK

Installation of NLTK

Introduction to Natural Language Processing (NLP)

Cosine Similarity – Text Similarity Metric

Introduction to Word Embeddings

NLP – Stop Words

Stemming and Lemmatization

TfidfVectorizer for text classification

CountVectorizer for text classification

Regular Expression for Text Cleaning in NLP

Text Data Cleaning & Preprocessing

Different Tokenization Technique for Text Processing