As we are all aware that the machine can only understand the numbers not text. So it is necessary to encode text data to number. The process of assigning each unique number to each word is called tokenization.
The Scikit-Learn library provides a CountVectorizer which convert a collection of text documents to a matrix of token counts. A CountVectorizer offers a simple way to both tokenize text data and build a vocabulary of known words. It also encodes the new text data using that built vocabulary. The encoded vector is a sparse matrix because it contains lots of zeros. The following steps are taken to use CountVectorizer:
- Create an object of CountVectorizer class.
- Call fit() function in order to build a vocabulary of words from text data.
- Call transform() function to tokenize text data using built vocabulary.
Parameters:
- input : sequence strings
- lowercase : bool(Default-True). Convert all characters to lowercase before tokenizing.
- stop_words : Remove the defined words from resulting vocabulary.
- ngram_range : The lower and upper boundary of the range of n-values for different n-grams to be extracted.
- max_df : Ignore the term that has a document frequency higher than a threshold.
- min_df : Ignore the term that has a document frequency lower than a threshold.
- max_features : Build a vocabulary that only considers top max_features ordered by word occurrence.
Example:
In [1]: from sklearn.feature_extraction.text import CountVectorizer text = ["Do not limit your challenges,challenge your limits"] vect = CountVectorizer() # create an object vect.fit(text) # build vocabulary tokenize_text = vect.transform(text) # encode the text data # Let's print vocabulary In [2]: vect.vocabulary_ Out[2]: {'limit': 3, 'do': 2, 'challenge': 0, 'challenges': 1, 'not': 5, 'your': 6, 'limits': 4} In [3]: vect.get_feature_names() Out[3]: ['challenge', 'challenges', 'do', 'limit', 'limits', 'not', 'your']
# Let's see the encoded vector, which showing a count of 1 occurrence for each word except the last word(index=6) that has an occurrence of 2. In [4]: tokenize_text.toarray() Out[4]: [[1, 1, 1, 1, 1, 1, 2]]
Let’s apply this vocabulary to encode new text data.
In [5]: new_text = ["push yourself to your limit"] new_txt_encode = vect.transform(new_text) In [6]: new_txt_encode.toarray() Out[6]: [[0, 0, 0, 1, 0, 0, 1]]
The above result says that the word at index 3 and 6 (‘limit’, ‘your’) have occurred once in the test document and the rest of the words are not present in the vocabulary hence, it assigns 0 defined as no occurrence.