UIJRT » United International Journal for Research & Technology

Comparative Study for Neural Image Caption Generation Using Different Transfer Learning Along with Diverse Beam Search & Bi-Directional RNN

Total Views / Downloads: 227 

Cite ➜

Indulkar, Y. and Patil, A., 2021. Comparative Study for Neural Image Caption Generation Using Different Transfer Learning Along with Diverse Beam Search & Bi-Directional RNN. United International Journal for Research & Technology (UIJRT), 2(9), pp.01-09.


Captions are important to understand the meaning of an image or to represent an image in a better way possible. Image Captioning needs a very precise and apt point of view of an image based on features present in that. Deep learning has advanced to a better level where we can use the power of the computer to tag an image with captions intended for it. Computer Vision is an ideal approach to use feature extraction for understanding the features present in an image for further captioning. This research is based on different transfer learning models used for feature extraction from an image along with caption generation. The transfer learning models used in this research are Xception, InceptionV3, VGG16 for feature extraction from an image. Along with the use of features, the caption needs to be generated for which this paper proposed an alternate RNN model for better caption generation, this model uses a bi-directional layer which is compared with the standard RNN model to select the best model along with the best transfer learning for neural image caption generation. For creating an apt caption with the help of feature extraction and the RNN model, the diverse beam search algorithm is used for getting the k-top best alternative values with the highest probability, which will produce a better caption as compared to argmax. The evaluation for each model with a combination of Xception-RNN, InceptionV3-RNN, InceptionV3-ARNN, VGG16-RNN, and finally VGG16-ARNN was done by using BLEU (Bilingual Evaluation Understudy) along with the training and validation loss.

Keywords: Beam Search, Bi-directional RNN, Caption Generation, Natural Language Processing, Transfer Learning.


  1. Ahmet Aker and Robert Gaizauskas. 2010. Generating image descriptions using dependency relational patterns. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1250–1258.
  2. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. 2013. Devise: A deep visual-semantic embedding model. In NIPS. 2121—2129.
  3. Chen, Xinlei, and C. Lawrence Zitnick. “Learning a recurrent visual representation for image caption generation.” arXiv preprint arXiv:1411.5654 (2014).
  4. Chen, Xinlei, and C. Lawrence Zitnick. “Mind’s eye: A recurrent visual representation for image caption generation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
  5. Wang, Minsi, et al. “A parallel-fusion RNN-LSTM architecture for image caption generation.” 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016.
  6. Wang, Cheng, Haojin Yang, and Christoph Meinel. “Image captioning with deep bidirectional LSTMs and multi-task learning.” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14.2s (2018): 1-20
  7. Cao, Pengfei, et al. “Image captioning with bidirectional semantic attention-based guiding of long short-term memory.” Neural Processing Letters 50.1 (2019): 103-119.
  8. Kumar, Akshi, and Shikhar Verma. “CapGen: A Neural Image Caption Generator with Speech Synthesis.” Data Analytics and Management. Springer, Singapore, 2021. 605-616.
  9. Yang, Zhilin, et al. “Review networks for caption generation.” arXiv preprint arXiv:1605.07912(2016).
  10. Xu, Kelvin, et al. “Show, attend and tell: Neural image caption generation with visual attention.” International conference on machine learning. PMLR, 2015.
  11. Amirian, Soheyla, et al. “Automatic Image and Video Caption Generation With Deep Learning: A Concise Review and Algorithmic Overlap.” IEEE Access8 (2020): 218386-218400.
  12. Vijayakumar, Ashwin, et al. “Diverse beam search for improved description of complex scenes.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018.
  13. Tanti, Marc, Albert Gatt, and Kenneth P. Camilleri. “What is the role of recurrent neural networks (rnns) in an image caption generator?.” arXiv preprint arXiv:1708.02043(2017).
  14. Kesavan, Varsha, Vaidehi Muley, and Megha Kolhekar. “Deep Learning based Automatic Image Caption Generation.” 2019 Global Conference for Advancement in Technology (GCAT). IEEE, 2019.
  15. Katpally, Harshitha, and Ajay Bansal. “Ensemble Learning on Deep Neural Networks for Image Caption Generation.” 2020 IEEE 14th International Conference on Semantic Computing (ICSC). IEEE, 2020.

For Conference & Paper Publication​

UIJRT Publication - International Journal