Caption generation is the challenging neural network problem of generating a human-readable textual description to the given photograph. It requires understanding from the domain of computer vision as well as from the field of natural language processing. Every day, we encounter a large number of images on social media. These sources contain images that viewers would have to interpret themselves. Image captioning is important for many reasons. For example, Facebook and Twitter can directly generate descriptions based on images. The descriptions can include what we wear, where we are (e.g., beach, cafe), and what we are doing there. To generate automatic captions, image understanding is important to detect and recognize objects. It also needs to understand object properties and their interactions with other objects and scene type or location. Generating well-formed sentences requires both semantic and syntactic understanding of the language. In deep learning based techniques, features are learned automatically from training data and they can handle a large set of images and videos. Deep learning techniques such as CNN will be used for image classification and RNN encoders and decoders will be used for text generation that is captions for the provided image. Language models such as LSTM will also be implemented in both sentiment analysis and caption generation.