The shortage of training data is one of the biggest challenges in Natural Language Processing. Because the NLP is a diversified area with a variety of tasks in multilingual data. The most task-specific dataset contains only a few thousand training data, which is not sufficient to achieve better accuracy.
To improve the performance of the modern deep learning-based NLP model, the millions or billions of training data required. Researchers have developed various methods for training the general-purpose language representation model using a huge amount of unannotated text on the web. It is called pre-training.
These pre-trained models can be used to create state-of-the-art models for a wide range of NLP tasks such as question answering and test classification. It is known as fine-tuning. Fine-tuning is being effective when we don’t have a sufficient amount of training samples.
BERT
BERT stands for Bidirectional Encoder Representations from Transformers. BERT is NLP Framework that is introduced by Google AI’s researchers. It is a new pre-training language representation model that obtains state-of-the-art results on various Natural Language Processing (NLP) tasks. The pre-trained BERT model can be fine-tuned by just adding a single output layer. You can found the academic paper of BERT here: https://arxiv.org/abs/1810.04805.
In this tutorial, you will learn to fine-tuning of BERT model with an example. You can refer to the previous tutorial of BERT that has explained the architecture of the BERT Model.
We will use Kaggle’s data of Quora Insincere Questions Classification task for the demonstration.
In [1]: # Let's load the required packages import pandas as pd import numpy as np import datetime import zipfile import sys import os
Download the pre-trained BERT model along with model weights and configuration file
In [2]: !wget storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
Extract the downloaded model zip file.
In [3]: repo = 'model_repo' if not os.path.exists(repo): print("Dir created!") os.mkdir(repo) with zipfile.ZipFile("uncased_L-12_H-768_A-12.zip","r") as zip_ref: zip_ref.extractall(repo)
In [4]: BERT_MODEL = 'uncased_L-12_H-768_A-12' BERT_PRETRAINED_DIR = f'{repo}/uncased_L-12_H-768_A-12' OUTPUT_DIR = f'{repo}/outputs' if not os.path.exists(OUTPUT_DIR): os.makedirs(OUTPUT_DIR) print(f'***** Model output directory: {OUTPUT_DIR} *****') print(f'***** BERT pretrained directory: {BERT_PRETRAINED_DIR} *****') Out[4]:
***** Model output directory: model_repo/outputs ***** ***** BERT pretrained directory: model_repo/uncased_L-12_H-768_A-12 *****
Prepare and Import BERT modules
The following BERT modules are clones the source code from GitHub and import the modules.
In [5]: # Download the BERT modules !wget raw.githubusercontent.com/google-research/bert/master/modeling.py !wget raw.githubusercontent.com/google-research/bert/master/optimization.py !wget raw.githubusercontent.com/google-research/bert/master/run_classifier.py !wget raw.githubusercontent.com/google-research/bert/master/tokenization.py !wget raw.githubusercontent.com/google-research/bert/master/run_classifier_with_tfhub.py
In [6]: # Import BERT modules import modeling import optimization import run_classifier import tokenization import tensorflow as tf import run_classifier_with_tfhub
Prepare the training data
Here, we will train the BERT model on small fraction of the training data.
In [7]: from sklearn.model_selection import train_test_split train_df = pd.read_csv('input/train.csv') train_df = train_df.sample(2000) # Train on 2000 data train, val = train_test_split(train_df, test_size = 0.1, random_state=42) train_lines, train_labels = train.question_text.values, train.target.values val_lines, val_labels = val.question_text.values, val.target.values label_list = ['0', '1']
In [8]: def create_examples(lines, set_type, labels=None): guid = f'{set_type}' examples = [] if guid == 'train': for line, label in zip(lines, labels): text_a = line label = str(label) examples.append( run_classifier.InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) else: for line in lines: text_a = line label = '0' examples.append( run_classifier.InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) return examples
Specify the BERT pre-trained model.
Here, the uncased_L-12_H-768_A-12 model is used. The model is consists of 12-layer, 768-hidden, 12-heads, 110M parameters. It is an Uncased model that means the text has been lowercased before tokenization.
In [9]: BERT_MODEL = 'uncased_L-12_H-768_A-12' BERT_MODEL_HUB = 'https://tfhub.dev/google/bert_' + BERT_MODEL + '/1'
Initialize model hyperparameters.
In [10]: TRAIN_BATCH_SIZE = 32 EVAL_BATCH_SIZE = 8 LEARNING_RATE = 2e-5 NUM_TRAIN_EPOCHS = 3.0 WARMUP_PROPORTION = 0.1 MAX_SEQ_LENGTH = 128 # Model Configuration SAVE_CHECKPOINTS_STEPS = 1000 ITERATIONS_PER_LOOP = 1000 NUM_TPU_CORES = 8 VOCAB_FILE = os.path.join(BERT_PRETRAINED_DIR, 'vocab.txt') CONFIG_FILE = os.path.join(BERT_PRETRAINED_DIR, 'bert_config.json') INIT_CHECKPOINT = os.path.join(BERT_PRETRAINED_DIR, 'bert_model.ckpt') DO_LOWER_CASE = BERT_MODEL.startswith('uncased') tpu_cluster_resolver = None # Model trained on GPU, we won't need a cluster resolver def get_run_config(output_dir): return tf.contrib.tpu.RunConfig( cluster=tpu_cluster_resolver, model_dir=output_dir, save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS, tpu_config=tf.contrib.tpu.TPUConfig( iterations_per_loop=ITERATIONS_PER_LOOP, num_shards=NUM_TPU_CORES, per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))
Load tokenizer module
Note: When you are using the Cased model, pass do_lower_case = False.
In [11]: tokenizer = tokenization.FullTokenizer(vocab_file=VOCAB_FILE, do_lower_case=DO_LOWER_CASE) train_examples = create_examples(train_lines, 'train', labels=train_labels) # compute number of train and warmup steps from batch size num_train_steps = int( len(train_examples) / TRAIN_BATCH_SIZE * NUM_TRAIN_EPOCHS) num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)
Fine-tune on a pre-trained BERT Model from TF Hub
This section illustrates the fine-tuning pre-trained BERT model from TensorFlow hub modules.
In [12]: model_fn = run_classifier_with_tfhub.model_fn_builder( num_labels=len(label_list), learning_rate=LEARNING_RATE, num_train_steps=num_train_steps, num_warmup_steps=num_warmup_steps, use_tpu=False, bert_hub_module_handle=BERT_MODEL_HUB ) estimator_from_tfhub = tf.contrib.tpu.TPUEstimator( use_tpu=False, #If False training will fall on CPU or GPU model_fn=model_fn, config=get_run_config(OUTPUT_DIR), train_batch_size=TRAIN_BATCH_SIZE, eval_batch_size=EVAL_BATCH_SIZE, )
In [13]: # Train the model def model_train(estimator): print('Please wait...') train_features = run_classifier.convert_examples_to_features( train_examples, label_list, MAX_SEQ_LENGTH, tokenizer) print('***** Started training at {} *****'.format(datetime.datetime.now())) print(' Num examples = {}'.format(len(train_examples))) print(' Batch size = {}'.format(TRAIN_BATCH_SIZE)) tf.logging.info(" Num steps = %d", num_train_steps) train_input_fn = run_classifier.input_fn_builder( features=train_features, seq_length=MAX_SEQ_LENGTH, is_training=True, drop_remainder=True) estimator.train(input_fn=train_input_fn, max_steps=num_train_steps) print('***** Finished training at {} *****'.format(datetime.datetime.now()))
In [14]: model_train(estimator_from_tfhub)
In [15]: # Evaluate the model def model_eval(estimator): eval_examples = create_examples(val_lines, 'test') eval_features = run_classifier.convert_examples_to_features( eval_examples, label_list, MAX_SEQ_LENGTH, tokenizer) print('***** Started evaluation at {} *****'.format(datetime.datetime.now())) print(' Num examples = {}'.format(len(eval_examples))) print(' Batch size = {}'.format(EVAL_BATCH_SIZE)) eval_steps = int(len(eval_examples) / EVAL_BATCH_SIZE) eval_input_fn = run_classifier.input_fn_builder( features=eval_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=True) result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps) print('***** Finished evaluation at {} *****'.format(datetime.datetime.now())) print("***** Eval results *****") for key in sorted(result.keys()): print(' {} = {}'.format(key, str(result[key])))
In [16]: model_eval(estimator_from_tfhub)
Fine-tune on a pre-trained BERT model from checkpoints
You can also load the pre-trained BERT model from saved checkpoints.
In [17]: CONFIG_FILE = os.path.join(BERT_PRETRAINED_DIR, 'bert_config.json') INIT_CHECKPOINT = os.path.join(BERT_PRETRAINED_DIR, 'bert_model.ckpt') OUTPUT_DIR = f'{repo}/outputs_checkpoints' if not os.path.exists(OUTPUT_DIR): os.makedirs(OUTPUT_DIR) model_fn = run_classifier.model_fn_builder( bert_config=modeling.BertConfig.from_json_file(CONFIG_FILE), num_labels=len(label_list), init_checkpoint=INIT_CHECKPOINT, learning_rate=LEARNING_RATE, num_train_steps=num_train_steps, num_warmup_steps=num_warmup_steps, use_tpu=False, #If False training will fall on CPU or GPU, use_one_hot_embeddings=True) estimator_from_checkpoints = tf.contrib.tpu.TPUEstimator( use_tpu=False, model_fn=model_fn, config=get_run_config(OUTPUT_DIR), train_batch_size=TRAIN_BATCH_SIZE, eval_batch_size=EVAL_BATCH_SIZE)
In [18]: # Train the Model model_train(estimator_from_checkpoints)
# Evaluate the Model In [19]: model_eval(estimator_from_checkpoints)
. . .