Model Quantization Methods In TensorFlow Lite

Edge hardware like a mobile device, embedded devices and IoT devices often have a limited computational resource. With this constrained that can’t execute TensorFlow model. We need to optimize the model to reduce the model size so that model can run quickly on these devices.

TensorFlow Lite provides one of the most popular model optimization techniques is called quantization. Quantization used to reduce the precision of the model’s parameters such as weights and activation outputs into 8-bit integers. By default, all weights parameters are 32-bit floating-point numbers. It enables to greatly reduce the model size as 8-bit integers occupy less memory than 32-bit floating-point numbers. Although these 8-bit representations can be less precise so it turns outs little degradation in the model accuracy.

Quantization can take place during model training or after model training. We can refer the quantization during model training as Quantization-aware training. And refer after model training quantization as post-training quantization. This tutorial outlines post-training quantization with examples.

TensorFlow Lite provides a various degree of post-training quantization. The following table depicts types of post-training quantization options are available in TensorFlow Lite.

Technique	Data requirements	Size reduction	Accuracy	Supported hardware
Float16 quantization	No data	Up to 50%	Insignificant accuracy loss	CPU, GPU
Dynamic range quantization	No data	Up to 75%	Accuracy loss	CPU, GPU (Android)
Integer quantization	Unlabelled representative sample	Up to 75%	Smaller accuracy loss	CPU, GPU (Android), EdgeTPU, Hexagon DSP

The following section has explained each post-training quantization techniques with example and made a comparison between each technique.

In order to quantize model, we need a trained TensorFlow model. So, let’s train a simple CNN model on cifar10 image dataset from scratch. And will compare model accuracy of original TensorFlow model and the converted model with quantization.

Generate a TensorFlow Model

import tensorflow as tf

from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
import numpy as np

tf.__version__

'2.3.1'

# Load cifar10 dataset
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

# Define the model architecture
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))

# Train the model
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=10,
validation_data=(test_images, test_labels))

Epoch 1/10
1563/1563 [==============================] - 26s 16ms/step - loss: 1.5000 - accuracy: 0.4564 - val_loss: 1.2685 - val_accuracy: 0.5540
Epoch 2/10
1563/1563 [==============================] - 25s 16ms/step - loss: 1.1422 - accuracy: 0.5965 - val_loss: 1.0859 - val_accuracy: 0.6181
Epoch 3/10
1563/1563 [==============================] - 27s 17ms/step - loss: 0.9902 - accuracy: 0.6524 - val_loss: 0.9618 - val_accuracy: 0.6665
Epoch 4/10
1563/1563 [==============================] - 29s 18ms/step - loss: 0.9006 - accuracy: 0.6874 - val_loss: 0.9606 - val_accuracy: 0.6630
Epoch 5/10
1563/1563 [==============================] - 33s 21ms/step - loss: 0.8308 - accuracy: 0.7086 - val_loss: 0.8756 - val_accuracy: 0.6956
Epoch 6/10
1563/1563 [==============================] - 32s 20ms/step - loss: 0.7692 - accuracy: 0.7314 - val_loss: 0.8990 - val_accuracy: 0.6929
Epoch 7/10
1563/1563 [==============================] - 26s 17ms/step - loss: 0.7223 - accuracy: 0.7485 - val_loss: 0.8578 - val_accuracy: 0.7034
Epoch 8/10
1563/1563 [==============================] - 27s 17ms/step - loss: 0.6782 - accuracy: 0.7612 - val_loss: 0.8746 - val_accuracy: 0.7109
Epoch 9/10
1563/1563 [==============================] - 29s 19ms/step - loss: 0.6415 - accuracy: 0.7742 - val_loss: 0.8381 - val_accuracy: 0.7262
Epoch 10/10
1563/1563 [==============================] - 25s 16ms/step - loss: 0.6086 - accuracy: 0.7870 - val_loss: 0.8420 - val_accuracy: 0.7215

Convert to a TensorFlow Lite model

Let’s convert the trained TensorFlow model to TensorFlow Lite format using TFLiteConverter API with applying quantization.

No quantization

Let’s convert the trained model into TFLite with no quantization. Here, all model’s parameter data has 32-bit float values.

# Convert the model
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

# save the model
open("tflite_model.tflite", "wb").write(tflite_model)

The generated TFLite model size is approx 49K.

Float16 quantization

Float16 quantization reduces the model size by quantizing the model’s weight parameters to float16 bit floating-point numbers for a minimal impact on accuracy and latency. This quantization technique significantly reduces the model size by half.

Let’s add float16 quantization of weights while convert model into TensorFlow Lite. First set the optimizations flag to default optimizations that quantize all fixed parameters such as weights. Then specify float16 is the supported type on the target platform:

Note that by default converted model still considered input and output as a float data type. This quantization method only quantized weight parameters. However, activations are still stored in floating-point.

converter = tf.lite.TFLiteConverter.from_keras_model(model)

# Set the optimization mode 
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Set float16 is the supported type on the target platform
converter.target_spec.supported_types = [tf.float16]

# Convert and Save the model
tflite_model = converter.convert()
open("converted_model.tflite", "wb").write(tflite_model)

Here, we can observe that the float16 quantized model size is approximately half the size.

Dynamic range quantization

The post-training dynamic range quantization converting model weights to 8-bit precision during model conversation from TensorFlow graphdefs to TensorFlow Lite format. Dynamic range quantization enables 4x reduction in the model size.

The model’s activation outputs are always stored in floating-point. In dynamic range quantization, the weight parameters are quantized post-training and activation are quantized dynamically at inference.

To quantize the model using dynamic range quantization, set the optimizations flag to quantize all fixed parameters such as weights.

converter = tf.lite.TFLiteConverter.from_keras_model(model)

# Set the optimization mode 
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Convert and Save the model
tflite_model = converter.convert()
open("converted_model.tflite", "wb").write(tflite_model)

The generated TFLite model using dynamic range quantization is approximately 1/4 the size

Integer quantization

Microcontroller devices, Edge TPU performs an integer-based operation. So above generated TFLite model won’t compatible with integer-only hardware. To execute the TensorFlow model on integer-only hardware, we need to quantize all model parameters, input and output tensor to an integer.

The post-training integer quantization is an optimization technique that converts both model’s weights and activation outputs from 32-bit floating-point numbers to the nearest 8-bit fixed-point numbers. It also quantizes a model’s input/output data. That tends to smaller model size and increased inference speed which is most suitable to deploy TensorFlow model on low-powered devices such as microcontrollers.

Sometimes, it is also called full integer quantization as it converts all model parameters such as weights and activations into 8-bit integer numbers.

To quantize the variable data such as a model’s input/output and intermediates between layers, we need to provide a RepresentativeDataset by supplying a set of input data in a generator function. This enables the converter to estimate a dynamic range for all the variable data.

Here, integer-based quantization must be required integer input and output tensor for compatibility. By default, the TensorFlow Lite Converter assign the model input and output tensor in a float.

Let’s apply a full integer quantization technique to convert a model into TFLite:

def representative_data_gen():
    data = tf.data.Dataset.from_tensor_slices(train_images).batch(1).take(100)
    for input_value in data:
        yield [input_value]

converter = tf.lite.TFLiteConverter.from_keras_model(model)

# Set the optimization mode 
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Pass representative dataset to the converter
converter.representative_dataset = representative_data_gen

# Restricting supported target op specification to INT8
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]

# Set the input and output tensors to uint8 
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

# Convert and Save the model
tflite_model = converter.convert()
open("converted_model.tflite", "wb").write(tflite_model)

Here, we can see the model size is approximately 1/4 the size.