The word count from text documents is very basic at the starting point. However simple word count is not sufficient for text processing because of the words like “the”, “an”, “your”, etc. are highly occurred in text documents. Their large word count is meaningless towards the analysis of the text. Tf-idf can be successfully used for stop-words filtering from the text document.
The other way to solve this problem is word frequency. This method is called the TF-IDF stands for “Term Frequency – Inverse Document Frequency ”. TF-IDF is a numerical statistic which measures the importance of the word in a document.
- Term Frequency: Number of time a word appears in a text document.
- Inverse Document Frequency: Measure the word is a rare word or common word in a document.
The formula that used to compute the IF-IDF is:
tf(t,d) = (Number of times term t appears in a document) / (Total number of terms in the document) Where, tf(t,d) - Term Frequency t = term d = document idf(t) = log [ n / df(t) ] + 1 where, idf(t) - Inverse Document Frequency n - Total number of documents df(t) is the document frequency of term t; tf-idf(t, d) = tf(t, d) * idf(t)
Example:
Consider a document which has a total of 100 words and the word “book” has occurred 5 times in a document.
Term frequency (tf) = 5 / 100 = 0.05
Let’s assume we have 10,000 documents and the word “book” has occurred in 1000 of these. Then idf is:
Inverse Document Frequency(IDF) = log[10000/1000] + 1 = 2
TF-IDF = 0.05 * 2 = 0.1
. . .
A Scikit-Learn provides the implementation of the TfidfVectorizer.
Parameters:
- input: text document
- lowercase : bool(Default-True). Convert all characters to lowercase before tokenizing.
- stop_words : Remove the defined words from resulting vocabulary.
- ngram_range : The lower and upper boundary of the range of n-values for different n-grams to be extracted.
- max_df : Ignore the term that has a document frequency higher than a threshold.
- min_df : Ignore the term that has a document frequency lower than a threshold.
- max_features : Build a vocabulary that only considers top max_features ordered by word occurrence.
- norm : ‘l1’, ‘l2’ or ‘None’ (Default-‘l2’)
- use_idf : boolean (default=True). Enable inverse-document-frequency reweighting.
- smooth_idf : boolean (default=True). Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
- sublinear_tf : boolean (default=False). Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
Example
In [1]: from sklearn.feature_extraction.text import TfidfVectorizer text = ["Do not limit your challenges,challenge your limits", "your challenges", "their limits"] vect = TfidfVectorizer() # create an object vect.fit(text) # build vocabulary tokenize_text = vect.transform(text) # encode the text data # Let's print vocabulary In [2]: vect.vocabulary_ Out[2]: {'your': 7, 'limits': 4, 'challenge': 0, 'limit': 3, 'do': 2, 'not': 5, 'their': 6, 'challenges': 1} In [3]: vect.get_feature_names() Out[3]: ['challenge', 'challenges', 'do', 'limit', 'limits', 'not', 'their', 'your'] In [4]: tokenize_text.shape Out[4]: (3, 8) # Let's print idf score of terms In [5]: vect.idf_ Out[5]: [1.69314718, 1.28768207, 1.69314718, 1.69314718, 1.28768207, 1.69314718, 1.69314718, 1.28768207] # Let’s apply this vocabulary to encode new text data. In [6]: new_text = ["push yourself to your limit"] new_txt_encode = vect.transform(new_text) In [7]: new_txt_encode.toarray() Out[7]: [[0. , 0. , 0. , 0.79596054, 0. , 0. , 0. , 0.60534851]]
. . .