Regular Expression for Text Cleaning in NLP

Regular Expression is very useful for text manipulation in text cleaning phase of Natural Language Processing. If you don’t have sufficient understanding of Regular Expression, I recommend you to read this tutorial of Regular Expression in Python.

The real-life human writable text data contains emojis, short word, wrong spelling, special symbols, etc. The other example is twitter’s tweet. It contains very noisy text such as hashtag. It is necessary to remove non-useful information from a tweet.

Find hashtag

In [1]:import re
tweet = "wow!,it is a natural beauty.#nature #_beautiful #"
x = re.findall('#[_]*[a-z]+',tweet)
In [2]: x
Out[2]: ['#nature', '#_beautiful']

Regex for Dates

In [3]:
import re
date = '23 oct 2019 23 oct,2019 23 october,2019 oct 26,2020'
# search only whitespace between day, month and year.
x = re.findall('\d{2} [a-z]{3} \d{4}',date) 

# search only whitespace or comma between day, month and year.
y = re.findall('\d{2}[ |,][a-z]{3}[ |,]\d{4}',date)

In [4]: x
Out[4]: ['23 oct 2019']

In [5]: y
Out[5]: ['23 oct 2019', '23,oct,2019']

In [6]: x2 = re.findall('\d{2}[ |,](?:Jan|Feb|Mar|oct)[a-z]*[ |,]\d{4}',date)
In [7]: x2
Out[7]: ['23 oct 2019', '23 oct,2019', '23 october,2019']

In [8]: x3 = re.findall('(?:\d{2})*[ |,](?:Jan|Feb|Mar|oct)[a-z]*[ |,](?:\d{2},)*\d{4}',date)
In [9]: x3
Out[9]: ['23 oct 2019', '23 oct,2019', '23 october,2019', ' oct 26,2020']

Detect Bad words using Regex

In [10]:
import re
s = "f**k f** fu*k fu** f**king f**king news"
# Replace bad words to "B_word" text in the string
x = re.sub('f[a-z]*\*+[a-z]*','B_word',s)

In [11]: x
Out[11]: 'B_word B_word B_word B_word B_word B_word news'

.     .     .

Let’s predict the sentiment of the text. we need to remove the non-useful information to achieve better performance.

In below example, price $1000 will not contribute to predicting the sentiment of the text. Hence, it is better to remove it.

In [12]:
import re
s = "The cost of mobile is $1000"
x = re.sub('\$\d+','_',s)
In [13]: x
Out[13]: 'The cost of mobile is _'

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Natural Language Processing Tutorials

A complete introduction to GPT-3 with Use Case examples

Deep Unveiling of the BERT Model

Word Embedding

Jaccard Similarity – Text Similarity Metric in NLP

TensorFlow : Text Classification of Movie Reviews

Text Preprocessing: Handle Emoji & Emoticon

Text Preprocessing: Removal of Punctuations

Develop the text Classifier with TensorFlow Hub

Tensorflow : BERT Fine-tuning with GPU

Introduction to BERT

NLTK – WordNet

Word Tokenization with NLTK

Installation of NLTK

Introduction to Natural Language Processing (NLP)

Cosine Similarity – Text Similarity Metric

Introduction to Word Embeddings

NLP – Stop Words

An Introduction to N-grams

Stemming and Lemmatization

TfidfVectorizer for text classification

CountVectorizer for text classification

Text Data Cleaning & Preprocessing

Different Tokenization Technique for Text Processing