Edge hardware like a mobile device, embedded devices and IoT devices often have a limited computational resource. With this constrained that can’t execute TensorFlow model. We need to optimize the model to reduce the model size so that model can run quickly on these devices.

TensorFlow Lite provides one of the most popular model optimization techniques is called quantization. Quantization used to reduce the precision of the model’s parameters such as weights and activation outputs into 8-bit integers. By default, all weights parameters are 32-bit floating-point numbers. It enables to greatly reduce the model size as 8-bit integers occupy less memory than 32-bit floating-point numbers. Although these 8-bit representations can be less precise so it turns outs little degradation in the model accuracy.

Quantization can take place during model training or after model training. We can refer the quantization during model training as **Quantization-aware training**. And refer after model training quantization as **post-training quantization**. This tutorial outlines post-training quantization with examples.

TensorFlow Lite provides a various degree of post-training quantization. The following table depicts types of post-training quantization options are available in TensorFlow Lite.

Technique | Data requirements | Size reduction | Accuracy | Supported hardware |
---|---|---|---|---|

Float16 quantization | No data | Up to 50% | Insignificant accuracy loss | CPU, GPU |

Dynamic range quantization | No data | Up to 75% | Accuracy loss | CPU, GPU (Android) |

Integer quantization | Unlabelled representative sample | Up to 75% | Smaller accuracy loss | CPU, GPU (Android), EdgeTPU, Hexagon DSP |

The following section has explained each post-training quantization techniques with example and made a comparison between each technique.

In order to quantize model, we need a trained TensorFlow model. So, let’s train a simple CNN model on cifar10 image dataset from scratch. And will compare model accuracy of original TensorFlow model and the converted model with quantization.

**Generate a TensorFlow Model**

import tensorflow as tf from tensorflow.keras import datasets, layers, models import matplotlib.pyplot as plt import numpy as np tf.__version__

'2.3.1'

# Load cifar10 dataset (train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data() # Normalize pixel values to be between 0 and 1 train_images, test_images = train_images / 255.0, test_images / 255.0 class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'] # Define the model architecture model = models.Sequential() model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3))) model.add(layers.MaxPooling2D((2, 2))) model.add(layers.Conv2D(64, (3, 3), activation='relu')) model.add(layers.MaxPooling2D((2, 2))) model.add(layers.Conv2D(64, (3, 3), activation='relu')) model.add(layers.Flatten()) model.add(layers.Dense(64, activation='relu')) model.add(layers.Dense(10)) # Train the model model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy']) history = model.fit(train_images, train_labels, epochs=10, validation_data=(test_images, test_labels))

Epoch 1/10 1563/1563 [==============================] - 26s 16ms/step - loss: 1.5000 - accuracy: 0.4564 - val_loss: 1.2685 - val_accuracy: 0.5540 Epoch 2/10 1563/1563 [==============================] - 25s 16ms/step - loss: 1.1422 - accuracy: 0.5965 - val_loss: 1.0859 - val_accuracy: 0.6181 Epoch 3/10 1563/1563 [==============================] - 27s 17ms/step - loss: 0.9902 - accuracy: 0.6524 - val_loss: 0.9618 - val_accuracy: 0.6665 Epoch 4/10 1563/1563 [==============================] - 29s 18ms/step - loss: 0.9006 - accuracy: 0.6874 - val_loss: 0.9606 - val_accuracy: 0.6630 Epoch 5/10 1563/1563 [==============================] - 33s 21ms/step - loss: 0.8308 - accuracy: 0.7086 - val_loss: 0.8756 - val_accuracy: 0.6956 Epoch 6/10 1563/1563 [==============================] - 32s 20ms/step - loss: 0.7692 - accuracy: 0.7314 - val_loss: 0.8990 - val_accuracy: 0.6929 Epoch 7/10 1563/1563 [==============================] - 26s 17ms/step - loss: 0.7223 - accuracy: 0.7485 - val_loss: 0.8578 - val_accuracy: 0.7034 Epoch 8/10 1563/1563 [==============================] - 27s 17ms/step - loss: 0.6782 - accuracy: 0.7612 - val_loss: 0.8746 - val_accuracy: 0.7109 Epoch 9/10 1563/1563 [==============================] - 29s 19ms/step - loss: 0.6415 - accuracy: 0.7742 - val_loss: 0.8381 - val_accuracy: 0.7262 Epoch 10/10 1563/1563 [==============================] - 25s 16ms/step - loss: 0.6086 - accuracy: 0.7870 - val_loss: 0.8420 - val_accuracy: 0.7215

**Convert to a TensorFlow Lite model**

Let’s convert the trained TensorFlow model to TensorFlow Lite format using **TFLiteConverter** API with applying quantization.

**No quantization**

Let’s convert the trained model into TFLite with no quantization. Here, all model’s parameter data has 32-bit float values.

# Convert the model converter = tf.lite.TFLiteConverter.from_keras_model(model) tflite_model = converter.convert() # save the model open("tflite_model.tflite", "wb").write(tflite_model)

493624

The generated TFLite model size is approx 49K.

**Float16 quantization**

Float16 quantization reduces the model size by quantizing the model’s weight parameters to float16 bit floating-point numbers for a minimal impact on accuracy and latency. This quantization technique significantly reduces the model size by half.

Let’s add float16 quantization of weights while convert model into TensorFlow Lite. First set the **optimizations** flag to default optimizations that quantize all fixed parameters such as weights. Then specify float16 is the supported type on the target platform:

Note that by default converted model still considered input and output as a float data type. This quantization method only quantized weight parameters. However, activations are still stored in floating-point.

converter = tf.lite.TFLiteConverter.from_keras_model(model) # Set the optimization mode converter.optimizations = [tf.lite.Optimize.DEFAULT] # Set float16 is the supported type on the target platform converter.target_spec.supported_types = [tf.float16] # Convert and Save the model tflite_model = converter.convert() open("converted_model.tflite", "wb").write(tflite_model)

249936

Here, we can observe that the float16 quantized model size is approximately half the size.

**Dynamic range quantization**

The post-training dynamic range quantization converting model weights to 8-bit precision during model conversation from TensorFlow graphdefs to TensorFlow Lite format. Dynamic range quantization enables 4x reduction in the model size.

The model’s activation outputs are always stored in floating-point. In dynamic range quantization, the weight parameters are quantized post-training and activation are quantized dynamically at inference.

To quantize the model using dynamic range quantization, set the **optimizations** flag to quantize all fixed parameters such as weights.

converter = tf.lite.TFLiteConverter.from_keras_model(model) # Set the optimization mode converter.optimizations = [tf.lite.Optimize.DEFAULT] # Convert and Save the model tflite_model = converter.convert() open("converted_model.tflite", "wb").write(tflite_model)

131472

The generated TFLite model using dynamic range quantization is approximately **1/4** the size

**Integer quantization**

Microcontroller devices, Edge TPU performs an integer-based operation. So above generated TFLite model won’t compatible with integer-only hardware. To execute the TensorFlow model on integer-only hardware, we need to quantize all model parameters, input and output tensor to an integer.

The post-training integer quantization is an optimization technique that converts both model’s weights and activation outputs from 32-bit floating-point numbers to the nearest 8-bit fixed-point numbers. It also quantizes a model’s input/output data. That tends to smaller model size and increased inference speed which is most suitable to deploy TensorFlow model on low-powered devices such as microcontrollers.

Sometimes, it is also called full integer quantization as it converts all model parameters such as weights and activations into 8-bit integer numbers.

To quantize the variable data such as a model’s input/output and intermediates between layers, we need to provide a **RepresentativeDataset **by supplying a set of input data in a generator function. This enables the converter to estimate a dynamic range for all the variable data.

Here, integer-based quantization must be required integer input and output tensor for compatibility. By default, the TensorFlow Lite Converter assign the model input and output tensor in a float.

Let’s apply a full integer quantization technique to convert a model into TFLite:

def representative_data_gen(): data = tf.data.Dataset.from_tensor_slices(train_images).batch(1).take(100) for input_value in data: yield [input_value] converter = tf.lite.TFLiteConverter.from_keras_model(model) # Set the optimization mode converter.optimizations = [tf.lite.Optimize.DEFAULT] # Pass representative dataset to the converter converter.representative_dataset = representative_data_gen # Restricting supported target op specification to INT8 converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] # Set the input and output tensors to uint8 converter.inference_input_type = tf.uint8 converter.inference_output_type = tf.uint8 # Convert and Save the model tflite_model = converter.convert() open("converted_model.tflite", "wb").write(tflite_model)

133216

Here, we can see the model size is approximately **1/4** the size.