TfidfVectorizer for text classification

The word count from text documents is very basic at the starting point. However simple word count is not sufficient for text processing because of the words like “the”, “an”, “your”, etc. are highly occurred in text documents. Their large word count is meaningless towards the analysis of the text. Tf-idf can be successfully used for stop-words filtering from the text document.

The other way to solve this problem is word frequency. This method is called the TF-IDF stands for Term Frequency – Inverse Document Frequency . TF-IDF is a numerical statistic which measures the importance of the word in a document.

  • Term Frequency: Number of time a word appears in a text document.
  • Inverse Document Frequency: Measure the word is a rare word or common word in a document.

 

The formula that used to compute the IF-IDF is:

tf(t,d) = (Number of times term t appears in a document) / (Total number of terms in the document)

Where, 
tf(t,d) - Term Frequency
t = term 
d = document

idf(t) = log [ n / df(t) ] + 1

where,
idf(t) - Inverse Document Frequency
n - Total number of documents
df(t) is the document frequency of term t;

tf-idf(t, d) = tf(t, d) * idf(t)

Example:

Consider a document which has a total of 100 words and the word “book” has occurred 5 times in a document.

Term frequency (tf) = 5 / 100 = 0.05

Let’s assume we have 10,000 documents and the word “book” has occurred in 1000 of these. Then idf is:

Inverse Document Frequency(IDF) = log[10000/1000] + 1 = 2

TF-IDF = 0.05 * 2 = 0.1

.     .     .

A Scikit-Learn provides the implementation of the TfidfVectorizer.

Parameters:

  • input: text document
  • lowercase : bool(Default-True). Convert all characters to lowercase before tokenizing.
  • stop_words : Remove the defined words from resulting vocabulary.
  • ngram_range : The lower and upper boundary of the range of n-values for different n-grams to be extracted.
  • max_df : Ignore the term that has a document frequency higher than a threshold.
  • min_df : Ignore the term that has a document frequency lower than a threshold.
  • max_features : Build a vocabulary that only considers top max_features ordered by word occurrence.
  • norm : ‘l1’, ‘l2’ or ‘None’ (Default-‘l2’)
  • use_idf : boolean (default=True). Enable inverse-document-frequency reweighting.
  • smooth_idf : boolean (default=True). Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
  • sublinear_tf : boolean (default=False). Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

 

Example

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
text = ["Do not limit your challenges,challenge your limits",
        "your challenges",
        "their limits"]

vect = TfidfVectorizer()  # create an object 
vect.fit(text)            # build vocabulary
tokenize_text = vect.transform(text)  # encode the text data

# Let's print vocabulary
In [2]: vect.vocabulary_ 
Out[2]: {'your': 7, 'limits': 4, 'challenge': 0, 'limit': 3, 'do': 2, 'not': 5, 'their': 6, 'challenges': 1}

In [3]: vect.get_feature_names()
Out[3]: ['challenge', 'challenges', 'do', 'limit', 'limits', 'not', 'their', 'your']

In [4]: tokenize_text.shape
Out[4]: (3, 8)

# Let's print idf score of terms
In [5]: vect.idf_
Out[5]: [1.69314718, 1.28768207, 1.69314718, 1.69314718, 1.28768207,
       1.69314718, 1.69314718, 1.28768207]

# Let’s apply this vocabulary to encode new text data.
In [6]: 
new_text = ["push yourself to your limit"]
new_txt_encode = vect.transform(new_text)
In [7]: new_txt_encode.toarray()
Out[7]: [[0.        , 0.        , 0.        , 0.79596054, 0.        ,
        0.        , 0.        , 0.60534851]]

.     .     .

Leave a Reply

Your email address will not be published. Required fields are marked *

Natural Language Processing Tutorials

A complete introduction to GPT-3 with Use Case examples

Deep Unveiling of the BERT Model

Word Embedding

Jaccard Similarity – Text Similarity Metric in NLP

TensorFlow : Text Classification of Movie Reviews

Text Preprocessing: Handle Emoji & Emoticon

Text Preprocessing: Removal of Punctuations

Develop the text Classifier with TensorFlow Hub

Tensorflow : BERT Fine-tuning with GPU

Introduction to BERT

NLTK – WordNet

Word Tokenization with NLTK

Installation of NLTK

Introduction to Natural Language Processing (NLP)

Cosine Similarity – Text Similarity Metric

Introduction to Word Embeddings

NLP – Stop Words

An Introduction to N-grams

Stemming and Lemmatization

CountVectorizer for text classification

Regular Expression for Text Cleaning in NLP

Text Data Cleaning & Preprocessing

Different Tokenization Technique for Text Processing