BERT stands for Bidirectional Encoder Representations from Transformers. BERT is NLP Framework which is introduced by Google AI’s researchers. It is a new pre-training language representation model which obtains state-of-the-art results on various Natural Language Processing (NLP) tasks. The pre-trained BERT model can be fine-tuned by just adding a single output layer.
How BERT works
The model architecture of the BERT is a multi-layer bidirectional Transformer encoder, which considers both left and right context in all layers.
Pre-trained representations can also either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional.
Word2Vec and GloVe is a context-free model, which generate a single “word embedding” representation for each word in the vocabulary. So, the word bank would have the same representation in river bank and bank deposits.
Where the contextual model generates the word representation based on the other words in the sentence.
The unidirectional contextual model contextualized each word using the words to its left or right not both. Whereas in the bidirectional contextual model, each word is contextualized using both its left and right words.
For example, in the sentence, I made a bank deposit.
The unidirectional representation of the word bank is only based on its left words I made a not deposit.
However, the bidirectional representation considers both its left and right context. BERT is a deeply bidirectional model. It represents the word bank using both its left and right context.
BERT model is trained on very large corpus like English Wikipedia (2,500M words) and BooksCorpus (800M words).
BERT Architecture
There are two model sizes for BERT-
- BERT Base – 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
- BERT Large – 24 layers (transformer blocks), 16 attention heads and, 340 million parameters
BERT has two stages: Pre-training and fine-tuning.
Pre-training
BERT is a very large model (12-layer to 24-layer Transformer) and trained on a large corpus for a long period of time. Training of BERT model is very expensive. It takes approx four days on 4 to 16 cloud TPUS. But don’t worry, Google has released various pre-trained models of BERT. We do not need to train the model from scratch.
Fine-tuning
Fine-tuning is inexpensive and straightforward compared to pre-training. We can use the pre-trained BERT model to create state-of-the-art models for a wide range of NLP tasks such as question answering and language inference, without substantial task-specific architecture modifications. We just need to add a single additional output layer to the pre-trained model for fine-tuning.
The fine-tuning process first initialized the pre-trained model parameters and all of the parameters are fine-tuned using labelled data from the downstream tasks.
For fine-tuning, most model hyperparameters are the same as in pre-training, with the exception of the batch size, learning rate, and the number of training epochs. The dropout probability was always kept at 0.1.
Pre-trained Model
Google has released various pre-trained BERT models with differing number of layers, hidden units, and attention heads.
- Uncased – the text has been lowercased before WordPiece tokenization. It also strips out any accent markers.
- Cased – the true case and accent markers are preserved.
BERT-Base, Uncased | 12-layer, 768-hidden, 12-heads, 110M parameters |
BERT-Large, Uncased | 24-layer, 1024-hidden, 16-heads, 340M parameters |
BERT-Base, Cased | 12-layer, 768-hidden, 12-heads, 110M parameters |
BERT-Large, Cased | 24-layer, 1024-hidden, 16-heads, 340M parameters |
BERT-Base, Multilingual Cased (New) | 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters |
BERT-Base, Multilingual Cased (Old) | 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters |
BERT-Base, Chinese | Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters |
. . .