Text Preprocessing: Removal of Punctuations

Text cleaning or Text pre-processing is a mandatory step when we are working with text in Natural Language Processing (NLP).  In real-life human writable text data contain various words with the wrong spelling, short words, special symbols, emojis, etc. we need to clean this kind of noisy text data before feeding it to the machine learning model.

There are different methods required to clean the data. In this tutorial, you will get to know about how to handle a special symbol or punctuation in the text data. Generally, when we are working with Natural Language Processing, we use word embedding such as GloVe,  fastText, etc. This all word embedding matrix handle the punctuation with different ways.

Many word embedding matrix support punctuation and special symbols. It that case, we need to retain punctuation as that models are aware of the difference between hurray and hurray!. Even in this scenario, the model works better with punctuation.

And, it’s perfectly fine to remove the punctuation from the text if your word embedding model won’t support punctuation.

So, sometimes we need to remove punctuation from the text or retain as it is depending on the situation. Here, I have explained two methods of it with examples.

Remove Punctuation

we need to carefully choose the list of punctuation which we are going to discard based on the use case. For example, Python’s string module contains the following list of punctuation.

>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

You can add or discard the punctuation as per requirement. Let’s see a below function that removes the punctuation symbol from the text.

import string
regular_punct = list(string.punctuation)
def remove_punctuation(text,punct_list):
    for punc in punct_list:
        if punc in text:
            text = text.replace(punc, ' ')
    return text.strip()

remove_punctuation(" Good Morning!How are you? ",regular_punct)

This generates the following output:

'Good Morning How are you'

You can also add other special symbols to list which you want to discard. Below is the list of various special characters that rarely appear in a sentence.

extra_punct = [
    ',', '.', '"', ':', ')', '(', '!', '?', '|', ';', "'", '$', '&',
    '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£',
    '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',
    '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', '“', '★', '”',
    '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾',
    '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', '▒', ':', '¼', '⊕', '▼',
    '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲',
    'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', '∙', ')', '↓', '、', '│', '(', '»',
    ',', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø',
    '¹', '≤', '‡', '√', '«', '»', '´', 'º', '¾', '¡', '§', '£', '₤']

It’s better to handle this kind of special characters in text preprocessing step that will helps the model to improve the performance.

Retain Punctuation

If your word embedding matrix supports special symbols, then you should retain as it is in the text rather than discarding it. Because it is aware of the difference between happy and happy!!. However, each special symbol has its own meaning in a sentence. So, it is a good idea to retain it in a text. Sometimes we may loss useful intimation of a special character in a sentence if we remove it from a text.

In this case, let’s see how to handle the special symbols in the below methods by just making the space in both sides of a symbol.

import string
regular_punct = list(string.punctuation)

def spacing_punctuation(text, regular_punct):
    for punc in regular_punct:
        if punc in text:
            text = text.replace(punc, f' {punc} ')
    return text.strip()

spacing_punctuation("Good Morning!! Hoow are you?? ",regular_punct)

This produces the following result:

'Good Morning ! ! Hoow are you ? ?'

Using this approach, the model can consider the meaning of special characters while training. Let’s see whether GloVe embedding matrix supports special symbols or not.

Let’s load GloVe embedding matrix.

import numpy as np
embeddings_index = {}
path = "glove.6B/glove.6B.300d.txt"

f = open(path)
for line in f:
    values = line.split(' ')
    # The first element indicate the word
    word = values[0]      
    # And rest of the part is vector representation of that word.
    coefs = np.asarray(values[1:], dtype='float32') 
    embeddings_index[word] = coefs
f.close()

print('The number of words in GloVe : ', len(embeddings_index))

Output:

The number of words in GloVe :  400000

The Glove word embedding represent 400K words. Let’s see the vector representation of punctuations.

>>> import string
>>> regular_punct = list(string.punctuation)
>>> regular_punct
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
# Let's add few special character to list
>>> extra_punct = ['§', '£', '₤','º']
>>> all_punct = list(set(regular_punct + extra_punct))
not_supported_symbol = []
for e in all_punct:
    if e not in embeddings_index:
        not_supported_symbol.append(e)
print("Special symbol not present in word embedding : ", not_supported_symbol)

Output:

Special symbol not present in word embedding :  ['§', 'º']

GloVe word vector also supports the various currency symbols. It is safe to remove unknown special characters from a sentence.

The main reason to separate punctuation by spacing in a sentence is to get a vector representation of words. For example, the word hello! doesn’t have vector representation in GloVe word embedding. However, it contains the vector representation of the word hello. So, we need to separate each special characters by spacing to get a vector representation of maximum words in a text document.

Word hello! doesn’t present in a word embedding.

>>> embeddings_index['hello!']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-26-90bec1d858a6> in <module>
----> 1 embeddings_index['hello!']

KeyError: 'hello!'

But, the word hello exist in a word embedding.

>>> embeddings_index['hello']

This will print the vector representation of the word hello.

>>> embeddings_index['!']

This will print the vector representation of the punctutation ‘!’

In this tutorial, you have learnt about special symbols and punctuations and how to deal with it in the various scenario. Please write a comment in the below section, if you have any questions about text processing.

.     .     .

Leave a Reply

Your email address will not be published. Required fields are marked *

Natural Language Processing Tutorials

A complete introduction to GPT-3 with Use Case examples

Deep Unveiling of the BERT Model

Word Embedding

Jaccard Similarity – Text Similarity Metric in NLP

TensorFlow : Text Classification of Movie Reviews

Text Preprocessing: Handle Emoji & Emoticon

Develop the text Classifier with TensorFlow Hub

Tensorflow : BERT Fine-tuning with GPU

Introduction to BERT

NLTK – WordNet

Word Tokenization with NLTK

Installation of NLTK

Introduction to Natural Language Processing (NLP)

Cosine Similarity – Text Similarity Metric

Introduction to Word Embeddings

NLP – Stop Words

An Introduction to N-grams

Stemming and Lemmatization

TfidfVectorizer for text classification

CountVectorizer for text classification

Regular Expression for Text Cleaning in NLP

Text Data Cleaning & Preprocessing

Different Tokenization Technique for Text Processing