View of CNN-RNN based Image-Text Interactions for Image Captioning

(1)

CNN-RNN based Image-Text Interactions for Image Captioning

1Gurmeet Chahal, ²A. Ranjith Kumar, ³Ravi Prakash Modanwal, ⁴Ankit Kumar, ⁵Govind Singh Rautela

2Assistant Professor

1,2,3,4,5

Computer Science and Engineering

1,2,3,4,5

Lovely Professional University, Phagwara, India

Abstract

A novel method of neural network training that explores the direction from text language training network. Here dual attention mechanisms are used to facilitate the communication among visual data and semantic information and this process supports the CNNs effectively for the distillation of visual features by controlling the semantic features. The proposed text-image embedded approach realizes the training of asynchronous inference behaviour. The images are classified using the trained model irrespective of availability of text. The scalability is improved by improving the substantial characteristics to the multimodal vision tasks and helps in attention based interpretable decision making. The semantic information is effectively used for improving the performance of CNN accuracy. Therefore the performance accuracy of retrieval of image captioning and classification of multi-label image also improved.

Keywords: Convolutional Neural Network, Dual attention mechanism, Semantic

information, Text-Image caption.

1. Introduction

Most commonly many images are not entailed with the description, but humans can identify the particulars without the proper descriptions. If humans need automatic image captions, the computer must interpret some kind of image captions. Image captioning is crucial for a variety of purposes. It can be utilized to automate the image indexing. Image indexing is essential for Content Based Image Retrieval (CBIR), and it can be used in a number of fields such as biomedicine, commerce, the military,

(2)

education, digital libraries, and web searching. To support the interactions of multimodality system two kinds of attention mechanisms (dual attention mechanism [4]) is proposed to facilitate the interactions between semantic (text) and visual (image) information.

Convolutional Neural Networks (CNNs) gradually achieve standard vision problems [1], such as image classification [2], semantic segmentation [3], and object detection. Training process of CNNs is carried out for the classification of images from the large set human provided image labels. CNNs are proved to be effective in discovering representations and to generalize process, but the principle of training and testing in CNN seems to be quite difficult for justification and interpretation [9].

According to human learning, teachers teach learners in natural languages for understanding the images by observations.

2. Related Works

Picture grouping utilizing CNNs is the establishment for machine vision assignments. As of late, one key examination focal point of convolutional network is to investigate its interpretability to clarify their internal operations and analyze portrayals. Machining interpretability expectation has ample qualities. Some connected exploration incorporates envisioning unit examples of profound portrayals [6], diagnosing network forecasts. Every one of these techniques investigate the portrayals of effectively prepared organizations in an unadulterated designing style, trusting that CNNs take in right examples from huge scope information, which, by and large, is fruitful. For instance, consolidates neural organizations with legitimate principles to upgrade model between pretability [7] alters and prepares interpretable CNNs that particular channels are dynamic at explicit semantic item parts. Instructing strategy to CNNs was depicted in [13] to unravel objects from environmental factors.

The consideration system is a critical pattern in profound realizing which fabricates an assistant module to additionally distil visual information in convolution network highlight maps and empower a degree by disclosing "where an organization sees". For instance, in machine interpretation [8], the consideration instrument

(3)

empowers Recurrent Neural Networks (RNNs) to specifically use past recollections and decide. In computer vision, it is broadly utilized in picture inscribing [5] and Visual Question Answering (VQA) [14], which helps neural organizations to catch staggered setting to make explicit forecasts. Besides, the idea of co-consideration or double consideration (TandemNet) has additionally been utilized in VQA, yet they have various inspirations. Also, the proposed technique expects to control and normalize the convolutions to distil visual information by consolidating human translation present in picture related inscriptions.

3. Proposed model: CNN-RNN based Image – Text interaction

A novel method to train CNNs with the guidance of text interactions along with training and testing behaviour (asynchronous) that facilitates the captioning of image is proposed along with availability of text. Two approaches are presented here such as TandemNet and TandemNet2, for achieving interaction information between visual and semantic knowledge for visual data distillation.TandemNet2 is proposed with two techniques such as modality transfer and semantic attention.

a. CNN for Image encoding

Convolutional neural networks have been demonstrated for computer vision on for learning representations of image. Residual network (ResNet) and its other versions are widely used for supporting broad vision tasks like object detection and an image description. Pre-trained CNN process is standard and it is shared with various tasks to the represent image and this is utilised for image encoding process as well as for representing feature maps. RGB image is provided with the dimension of H×W×3 (height and width, respectively), CNN is applied for feature extraction maps in the preceding convolutional pooling layers that consequences with input image representation and it is denoted as V∈RC×G. Here „G‟ represents size of feature map and C represents count of that purely based on architecture of CNN and size of the image given as input. For exemplar, ResNet101 and image size is taken here is 224×224 dimension, G= 7×7 and C= 2048. Figure 1 shows the block diagram of the proposed model.

(4)

Figure 1: Workflow of Proposed method b. Text Encoding using RNN

RNNs [10] are generally applied for the sequence-to-sequence processing applications [11]. The RNN progresses all elements (either word or character) that is given in the verdict at some particular time and finally the process is repeated recurrently for encoding the entire sentence. Parallel fusion of Long Short Term Memory (LSTM) [15] supports in clearing few recurrent networks key problems and adjusts the semantic information structure.

LSTM consists of design unit that enables the long-term dependency for minimizing gradient generic issues that occurred in RNN. A sequence of words fx1, fx2, …fxn, is given and the LSTM reads whole words once in a time and maintains a statement of memory and a hidden state ht*2*RD. LSTM is updated at each time and it is given in equation 1.

1, 1 (x , h , m )t t t

t_ t_ LSTM

h m ₍₁₎

Here x_t is represented as an embedding of word that is given as input, that is primarily programmed as a 1-word vector (dataset length and word size are same or equal) which is multiplexed with the learned word-matrix embedding.

(5)

c. Image-Text Interaction

The optimal interaction computation unit is contributed in our proposed approach and it is obtained by the visual-semantic interactions. The interaction „I‟ is evaluated among the features of visual „V‟ and features of semantic „S‟ is given in equation 2.

f f

(V , S ) _f ^a

u

S text available

I where S

S text unavailable

   ^

 ⁽²⁾

Here „ρ‟ is represented as label likelihood. Coupling degree controlling method for visual and semantic information as well as the visual-semantic interaction modules is proposed in parallel here therefore the dedicated and repeated embeddings can be avoided. Principle of generation of a “simulated” text encoding even though the text is not given, that achieves the asynchronous train/test behaviour. Therefore, features of semantic are programmed from the text encoders when text is available (i.e. w/text) or automatic encoding process „Su‟ during unavailable of text (i.e.w/o text).

d. TandemNet and TandemNet2

TandemNet is proposed with the dual attention mechanism that aims to build an interaction between the images and the text. Dual attention mechanism is proposed to provide a generic network that generates attention using inputs and preserves the helpful data‟s. The dual attention method strongly supports the information distillation that actually improves the performance of network.

Features of image representation „V‟ and features of text representation „S‟ are given in the 1X1 matrix and it is embedded in a 1X1 convolutional layer with a regular dimension of „M‟, followed by batch normalization, significant image regions attention is generated by dual-attention model that simultaneously separates the sentence part. Therefore the attention function f_att that computes with a piecewise weight vector α is given in equation 3.

(6)

i

f a

i

exp(e ) (V S ),

exp(e )

A

i

eF i 



⁽³⁾

TandemNet2 considers the multimodal relation to build the image and text interactions. The scalability of TandemNet2 has improved through their modality transfer design for several kinds of vision tasks like caption based retrieval of images as well as object localization. Set of query is set as input of vector „q‟ in a matrix K, called keys, with the key vectors of „Q‟ matrix called values. The module of attention is given in equation 4.

(q, K, Q) max

T

q

qK Q

d

 

 

 

 

  ⁽⁴⁾

Here d_q is denoted as query size „q‟. And output is the weighted average „Q‟

with the context vector. And Max(qK^T√dq) outputs the weighted vector that decides significant values presented in „Q‟. The image content that is described latent from the automated text programming. Transfer function of attention is given in equation 5,

( ( ); _T)

T 

  V

hT (5)

4. Results and Discussion

TandemNet2 has a good parametric called stochastic modality transfer function for similarity measures that exists between input image and text (negative mean square loss). Hence similarity measure ability is used to perform caption-based image retrieval and it is common to evaluate image-text embedding mechanism.

The experimental set up is followed with running caption-image retrieval. 1000 number of image-and-caption subset is selected randomly of 5 times on COCO.

TandemNet2 is trained with various drop rate values from the extension of image captioning method of mRNN [12]. Here real text is used rather than simulated text and hence least drop rates consequences perfect matching in the training. Maximum all TandemNet2 models use „0.5‟ drop rate for other analysis. In this proposed method the best results can be obtained when the drop rate is set to 1. TandemNet2 performs

(7)

well when compared to TandemNet and it has achieved considerably high margin in accuracy. Therefore the effectiveness of the transfer function of modality and image- text captioning gets increased.

Therefore, TandemNet uses the semantic information effectively for improving the visual attention and this attention capability is properly maintained even the text is not given. Attention comparison of with/text to the attention of without text in all row and the without text attention is observed to be substantially „corrected‟. TandemNet2 achieves promising performance when compared with TandemNet classification and retrieval of multi labelled images.

Figure 2: Accuracy with varying drop rates

The text drop rate of TandemNet is analysed and it is shown in figure 2. The model compulsively uses text information during lower drop rate and hence lower accuracies are obtained for unavailable text. The text seems to be unsuited if drop rate is elevated high; that results in low accuracy with text and good accuracy for without text. The drop rate „0.5‟ is set as good and used for training datasets in evaluation of TandemNet.

Figure 3 analyzes the drop rate of TandemNet2. Comparatively TandemNet2 is significantly tactless and achieves better drop rate variance. The F1/C score (without text) drops only by 0.38% with the decreased drop-rate from 0.5 to 0.3. The drop rate

0 20 40 60 80 100 120

1 3 5 7 9

Accuracy (%)

Drop Rate

TandemNet2 TandemNet

(8)

of 0.5 is almost used for maximum experiments (except caption-image retrieval model).

Figure 3: F1/C score for varying drop-rates 5. Conclusion

This proposed dual attention mechanisms are used to facilitate the communication among visual data and semantic information and this process supports the CNNs effectively for the distillation of visual features by controlling the semantic features. The proposed text-image embedded approach realizes the training of asynchronous inference behaviour. The images are classified using the trained model irrespective of availability of text. The scalability is improved by improving the substantial characteristics to the multimodal vision tasks and helps in attention based interpretable decision making. The semantic information is effectively used for improving the performance of CNN accuracy. Thusly performance accuracy of retrieval of image captioning and classification of multi-label image also improved.

References

1. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 1097- 1105.

0 10 20 30 40 50 60 70 80 90 100

1 3 5 7 9

F1-C (%)

Drop Rate

TandemNet2 TandemNet

(9)

2. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition.

In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770- 778).

3. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).

4. Zhang, Z., Chen, P., Sapkota, M., & Yang, L. (2017, September). Tandemnet: Distilling knowledge from medical images using diagnostic reports as optional semantic references. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 320-328). Springer, Cham.

5. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., ... & Bengio, Y. (2015, June). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048-2057). PMLR.

6. Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep inside convolutional networks:

Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.

7. Zhang, Q., Wu, Y. N., & Zhu, S. C. (2018). Interpretable convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 8827- 8836).

8. Denil, M., Bazzani, L., Larochelle, H., & de Freitas, N. (2012). Learning where to attend with deep architectures for image tracking. Neural computation, 24(8), 2151-2184.

9. Nam, H., Ha, J. W., & Kim, J. (2017). Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 299-307).

10. Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015, June). An empirical exploration of recurrent network architectures. In International conference on machine learning (pp. 2342- 2350). PMLR.

11. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215.

(10)

12. Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., & Yuille, A. (2014). Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632.

13. Stone, A., Wang, H., Stark, M., Liu, Y., Scott Phoenix, D., & George, D. (2017).

Teaching compositionality to cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5058-5067).

14. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6904- 6913).

15. Zhang, J., Li, K., & Wang, Z. (2021). Parallel-fusion LSTM with synchronous semantic and visual information for image captioning. Journal of Visual Communication and Image Representation, 75, 103044.