Introduction to BERT

BERT stands for Bidirectional Encoder Representations from Transformers. BERT is NLP Framework which is introduced by Google AI’s researchers. It is a new pre-training language representation model which obtains state-of-the-art results on various Natural Language Processing (NLP) tasks. The pre-trained BERT model can be fine-tuned by just adding a single output layer.

How BERT works

The model architecture of the BERT is a multi-layer bidirectional Transformer encoder, which considers both left and right context in all layers. 

Pre-trained representations can also either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional.

Word2Vec and GloVe is a context-free model, which generate a single “word embedding” representation for each word in the vocabulary. So, the word bank would have the same representation in river bank and bank deposits.

Where the contextual model generates the word representation based on the other words in the sentence.

The unidirectional contextual model contextualized each word using the words to its left or right not both. Whereas in the bidirectional contextual model,  each word is contextualized using both its left and right words.

For example, in the sentence, I made a bank deposit.

The unidirectional representation of the word bank is only based on its left words I made a not deposit.

However, the bidirectional representation considers both its left and right context. BERT is a deeply bidirectional model. It represents the word bank using both its left and right context.

BERT model is trained on very large corpus like English Wikipedia (2,500M words) and BooksCorpus (800M words).

BERT Architecture

There are two model sizes for BERT-

  • BERT Base – 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
  • BERT Large – 24 layers (transformer blocks), 16 attention heads and, 340 million parameters

 

BERT has two stages: Pre-training and fine-tuning.

Pre-training

BERT is a very large model (12-layer to 24-layer Transformer) and trained on a large corpus for a long period of time. Training of BERT model is very expensive. It takes approx four days on 4 to 16 cloud TPUS. But don’t worry, Google has released various pre-trained models of BERT. We do not need to train the model from scratch. 

Fine-tuning

Fine-tuning is inexpensive and straightforward compared to pre-training.  We can use the pre-trained BERT model to create state-of-the-art models for a wide range of NLP tasks such as question answering and language inference, without substantial task-specific architecture modifications. We just need to add a single additional output layer to the pre-trained model for fine-tuning.

The fine-tuning process first initialized the pre-trained model parameters and all of the parameters are fine-tuned using labelled data from the downstream tasks.

For fine-tuning, most model hyperparameters are the same as in pre-training, with the exception of the batch size, learning rate, and the number of training epochs. The dropout probability was always kept at 0.1.

Pre-trained Model

Google has released various pre-trained BERT models with differing number of layers, hidden units, and attention heads.

  • Uncased – the text has been lowercased before WordPiece tokenization. It also strips out any accent markers.
  • Cased – the true case and accent markers are preserved.

 

BERT-Base, Uncased 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Cased 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Multilingual Cased (New) 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Multilingual Cased (Old) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Chinese Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

.      .     .

Leave a Reply

Your email address will not be published. Required fields are marked *

Natural Language Processing Tutorials

A complete introduction to GPT-3 with Use Case examples

Deep Unveiling of the BERT Model

Word Embedding

Jaccard Similarity – Text Similarity Metric in NLP

TensorFlow : Text Classification of Movie Reviews

Text Preprocessing: Handle Emoji & Emoticon

Text Preprocessing: Removal of Punctuations

Develop the text Classifier with TensorFlow Hub

Tensorflow : BERT Fine-tuning with GPU

NLTK – WordNet

Word Tokenization with NLTK

Installation of NLTK

Introduction to Natural Language Processing (NLP)

Cosine Similarity – Text Similarity Metric

Introduction to Word Embeddings

NLP – Stop Words

An Introduction to N-grams

Stemming and Lemmatization

TfidfVectorizer for text classification

CountVectorizer for text classification

Regular Expression for Text Cleaning in NLP

Text Data Cleaning & Preprocessing

Different Tokenization Technique for Text Processing