Captions are important to understand the meaning of an image or to represent an image in a better way possible. Image Captioning needs a very precise and apt point of view of an image based on features present in that. Deep learning has advanced to a better level where we can use the power of the computer to tag an image with captions intended for it. Computer Vision is an ideal approach to use feature extraction for understanding the features present in an image for further captioning. This research is based on different transfer learning models used for feature extraction from an image along with caption generation. The transfer learning models used in this research are Xception, InceptionV3, VGG16 for feature extraction from an image. Along with the use of features, the caption needs to be generated for which this paper proposed an alternate RNN model for better caption generation, this model uses a bi-directional layer which is compared with the standard RNN model to select the best model along with the best transfer learning for neural image caption generation. For creating an apt caption with the help of feature extraction and the RNN model, the diverse beam search algorithm is used for getting the k-top best alternative values with the highest probability, which will produce a better caption as compared to argmax. The evaluation for each model with a combination of Xception-RNN, InceptionV3-RNN, InceptionV3-ARNN, VGG16-RNN, and finally VGG16-ARNN was done by using BLEU (Bilingual Evaluation Understudy) along with the training and validation loss.