Video can be viewed as a sequence of images where we need to consider temporal components (i.e. time element) and spatial components (i.e. resolution element). So video processing can be done just like image processing with temporal features.
Below mentioned are few video classification techniques:
- CNN and RNN as two separate model
- CNN and RNN as one model
- 3D CNN
CNN and RNN as two separate model
As the name suggests, this technique uses two different models for video classification:
- CNN model for feature extraction
- RNN Model for interpreting features
How does it work?
- When we process video in CNN models, it means we are feeding one image frame at a time to the CNN model for processing.
- These frames are passed through the CNN model and the output of the final pool layer of the CNN model is saved into the memory (the last output layer has to be removed if the pre-trained model is being used).
- So now we have feature vectors that can be used as input for our second model but before that, we need to do some processing on it.
- These feature vectors are converted into sequences. The size of sequences depends on the user’s requirements that how many frames he wants to work on.
- Now, these sequences become the input of the RNN model.
- We can test multiple RNN models on these sequences to get a better result.
Advantages
- With help of RNN, we can achieve temporal information (i.e. time element) of videos.
- Able to test multiple RNN model.
- Able to use pre-trained weights of the CNN model for image processing.
Disadvantages
- It may need extra memory to store feature vectors.
CNN AND RNN AS ONE MODEL(CRNN)
In this technique 2 models are used as a subpart of the whole model for instance:
-
- CNN model for feature extraction from video frames
- RNN model for interpreting these features
Process
- Every frame of video is passed through the CNN model (pre-trained CNN model can also be used) which produces a feature vector (the last output layer has to be removed if the pre-trained model is being used).
- Here CNN model works as the encoder.
- After flattening the feature vector, it is fed into the RNN model which produces the final output of the network.
- Here RNN model works as the decoder.
Advantages
- Using this method we can handle various sequences of various lengths.
- So it can easily adapt to different task such as image captioning, video summary, etc.
Disadvantages
- More computational power is required as it needs to compute both models (RNN and CNN) back to back.
- It has a tough time for the RNN model to learn temporal feature (time element of video).
USING 3D CONVOLUTIONAL NETWORK
- With the help of a 3D Convolutional Network, the model can learn temporal features (time element) in a better way.
- 3D CNN is just like 2D CNN, but here we have to work on temporal feature too.
- In 2D CNN we use filters of height x weight whereas in 3D CNN we use filters of height x weight x depth so they swept horizontally and vertically in depth-wise and same as for pooling operation too.
- In a 3D Convolutional Network, filtering and pooling operations are performed for spatio-temporally while in 2D Convolutional Network they are done only on spatially.
- So using a 3D Convolutional Network we can work with many video classification problems.
Advantages
- Compared to 2D CNN, 3D CNN has the ability to extract temporal information.
- We can perform the same operation as we perform on 2D CNN.
- This powerful architecture helps to process videos in real-time
Disadvantages
- 3D CNNs often use more GPU memory compare to the traditional 2D CNN.
- It produces more parameter compare to 2D CNN and may lead to overfitting.