Regular Expression is very useful for text manipulation in text cleaning phase of Natural Language Processing. If you don’t have sufficient understanding of Regular Expression, I recommend you to read this tutorial of Regular Expression in Python.
The real-life human writable text data contains emojis, short word, wrong spelling, special symbols, etc. The other example is twitter’s tweet. It contains very noisy text such as hashtag. It is necessary to remove non-useful information from a tweet.
Find hashtag
In [1]:import re tweet = "wow!,it is a natural beauty.#nature #_beautiful #" x = re.findall('#[_]*[a-z]+',tweet) In [2]: x Out[2]: ['#nature', '#_beautiful']
Regex for Dates
In [3]: import re date = '23 oct 2019 23 oct,2019 23 october,2019 oct 26,2020' # search only whitespace between day, month and year. x = re.findall('\d{2} [a-z]{3} \d{4}',date) # search only whitespace or comma between day, month and year. y = re.findall('\d{2}[ |,][a-z]{3}[ |,]\d{4}',date) In [4]: x Out[4]: ['23 oct 2019'] In [5]: y Out[5]: ['23 oct 2019', '23,oct,2019'] In [6]: x2 = re.findall('\d{2}[ |,](?:Jan|Feb|Mar|oct)[a-z]*[ |,]\d{4}',date) In [7]: x2 Out[7]: ['23 oct 2019', '23 oct,2019', '23 october,2019'] In [8]: x3 = re.findall('(?:\d{2})*[ |,](?:Jan|Feb|Mar|oct)[a-z]*[ |,](?:\d{2},)*\d{4}',date) In [9]: x3 Out[9]: ['23 oct 2019', '23 oct,2019', '23 october,2019', ' oct 26,2020']
Detect Bad words using Regex
In [10]: import re s = "f**k f** fu*k fu** f**king f**king news" # Replace bad words to "B_word" text in the string x = re.sub('f[a-z]*\*+[a-z]*','B_word',s) In [11]: x Out[11]: 'B_word B_word B_word B_word B_word B_word news'
. . .
Let’s predict the sentiment of the text. we need to remove the non-useful information to achieve better performance.
In below example, price $1000 will not contribute to predicting the sentiment of the text. Hence, it is better to remove it.
In [12]: import re s = "The cost of mobile is $1000" x = re.sub('\$\d+','_',s) In [13]: x Out[13]: 'The cost of mobile is _'