Introduction to BERT

BERT stands for Bidirectional Encoder Representations from Transformers. BERT is NLP Framework which is introduced by Google AI’s researchers. It is a new pre-training language representation model which obtains state-of-the-art results on various Natural Language Processing (NLP) tasks. The pre-trained BERT model can be fine-tuned by just adding a single output layer.

How BERT works

The model architecture of the BERT is a multi-layer bidirectional Transformer encoder, which considers both left and right context in all layers.

Pre-trained representations can also either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional.

Word2Vec and GloVe is a context-free model, which generate a single “word embedding” representation for each word in the vocabulary. So, the word bank would have the same representation in river bank and bank deposits.

Where the contextual model generates the word representation based on the other words in the sentence.

The unidirectional contextual model contextualized each word using the words to its left or right not both. Whereas in the bidirectional contextual model, each word is contextualized using both its left and right words.

For example, in the sentence, I made a bank deposit.

The unidirectional representation of the word bank is only based on its left words I made a not deposit.

However, the bidirectional representation considers both its left and right context. BERT is a deeply bidirectional model. It represents the word bank using both its left and right context.

BERT model is trained on very large corpus like English Wikipedia (2,500M words) and BooksCorpus (800M words).

BERT Architecture

There are two model sizes for BERT-

BERT Base – 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
BERT Large – 24 layers (transformer blocks), 16 attention heads and, 340 million parameters

BERT has two stages: Pre-training and fine-tuning.

Pre-training

BERT is a very large model (12-layer to 24-layer Transformer) and trained on a large corpus for a long period of time. Training of BERT model is very expensive. It takes approx four days on 4 to 16 cloud TPUS. But don’t worry, Google has released various pre-trained models of BERT. We do not need to train the model from scratch.

Fine-tuning

Fine-tuning is inexpensive and straightforward compared to pre-training. We can use the pre-trained BERT model to create state-of-the-art models for a wide range of NLP tasks such as question answering and language inference, without substantial task-specific architecture modifications. We just need to add a single additional output layer to the pre-trained model for fine-tuning.

The fine-tuning process first initialized the pre-trained model parameters and all of the parameters are fine-tuned using labelled data from the downstream tasks.

For fine-tuning, most model hyperparameters are the same as in pre-training, with the exception of the batch size, learning rate, and the number of training epochs. The dropout probability was always kept at 0.1.

Pre-trained Model

Google has released various pre-trained BERT models with differing number of layers, hidden units, and attention heads.

Uncased – the text has been lowercased before WordPiece tokenization. It also strips out any accent markers.
Cased – the true case and accent markers are preserved.

BERT-Base, Uncased	12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased	24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased	12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Cased	24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Multilingual Cased (New)	104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Multilingual Cased (Old)	102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Chinese	Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

. . .

How BERT works

BERT Architecture

Pre-training

Fine-tuning

Pre-trained Model

Leave a Reply Cancel reply

Natural Language Processing Tutorials