Learning Rate is one of the most important hyperparameter to tune for Neural network to achieve better performance. Learning Rate determines the step size at each training iteration while moving toward an optimum of a loss function.

In the Back-propagation method, the weight and bias parameters are updated using a gradient descent optimization algorithm. The Gradient descent optimization algorithm finds the gradient of the loss function.

The amount that the weights and bias parameters are updated is known as the learning rate. The mathematical equation to update the weight and bias parameters are as follow:

Here, the gradient term **?L/ ?W** is the partial derivation of the loss function with respect to the weight parameter. That defines the rate of changes in the error with respect to the changes in the weight parameter.

While updating the weight parameters, it is better to use the fraction of gradient value instead of considering the full amount.

A too high learning rate allows the model to learn faster but, it might be overshooting the minimum point as the weight are updated rapidly.

A small learning rate allows the model to learn slowly and carefully. It makes the smaller changes to the weight on each update. Learning rate that is too small may get stuck in an undesirable local minimum. Therefore, we should not use the learning rate too large and too small.

The learning rate value depends on your Neural Network architecture as well as your training dataset. It is very tricky to find the best learning rate. However, it is better to use dynamic learning rate instead of a constant learning rate.

There are several methods are exist which help to find the optimal learning rate. In this tutorial, you will get to know how to configure the optimal learning rate when training of the neural network.

**Keras’ LearningRateScheduler callback**

Keras Deep Learning library provides the callback function LearningRateScheduler that allows adjusting the learning rate at each epoch by specifying the function.

You need to define a Python custom function for setting the learning rate that takes epoch number and current learning rate as input and returns the new learning rate as output.

**Example:**

Let’s define the function for custom learning rate schedule. This function considers learning rate **α = 0.01** for first five training epochs and decreases it exponentially after that.

defcustom_LearningRate_schedular(epoch): if epoch < 5: return 0.01 else: return 0.01 * tf.math.exp(0.1 * (10 - epoch))

You can pass this function to Keras’ callback LearningRateScheduler method.

from keras.callbacks import LearningRateScheduler callback = LearningRateScheduler(custom_LearningRate_schedular) model.fit(X, Y, epochs=100, callbacks=[callback], validation_data=(val_X, val_Y))

**Keras’ ReduceLROnPlateau callback**

Keras’ provides another callback function **ReduceLROnPlateau** that also handle the learning rate. It reduces the learning rate when a defined metric has stopped improving.

Neural Network often improves performance by reducing the learning rate slightly. This callback monitors the defined metric and if no improvement is seen for a ‘patience’ number of epochs, the learning rate is reduced.

**Example**

The below callback function allows the model to reduce the learning rate by factor 0.2 if the validation loss is not improved for 3 epoch.

```
from keras.callbacks import ReduceLROnPlateau
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, min_lr=0.001)
model.fit(X_train, Y_train, callbacks=[reduce_lr])
```

**Adaptive Learning Rate**

Keras also provides the extension of the classical stochastic gradient descent that support adaptive learning rates such as **Adagrad****, ****Adadelta**, **RMSprop** and** ****Adam**.

Adagrad is an optimizer with parameter-specific learning rates, which are adapted relative to how frequently a parameter gets updated during training. The more updates a parameter receives, the smaller the learning rate.

Adadelta is a more robust extension of Adagrad that adapts learning rates based on a moving window of gradient updates, instead of accumulating all past gradients.

from keras.optimizers import Adagrad, Adadelta, RMSprop, Adam opt = Adagrad(lr=0.01, epsilon=1e-08, decay=0.0) opt = Adadelta(lr=1.0, rho=0.95, epsilon=1e-08, decay=0.0) opt = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) opt = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0) model.compile(..., optimizer=opt)

It is recommended to leave the parameters of these optimizers at their default values except the learning rate, which can be freely tuned.

**. . .**