A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, Don't just assume; look and answer: Overcoming priors for visual question answering, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.

A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, Dont just assume; look and answer: Overcoming priors for visual question answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.4971-4980, 2018.

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson et al., Bottom-up and top-down attention for image captioning and visual question answering, IEEE Conference on Computer Vision and Pattern Recognition CVPR, vol.6, p.7, 2005.

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra et al., VQA: Visual Question Answering, International Conference on Computer Vision (ICCV), vol.1, 2015.

D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate

Y. Bai, J. Fu, T. Zhao, and T. Mei, Deep attention neural tensor network for visual question answering, The European Conference on Computer Vision (ECCV), 2018.

P. W. Battaglia, Relational inductive biases, deep learning, and graph networks, vol.2, p.4, 2018.

H. Ben-younes, R. Cadène, N. Thome, and M. Cord, Mutan: Multimodal tucker fusion for visual question answering, vol.6, p.7, 2005.
URL : https://hal.archives-ouvertes.fr/hal-02073637

X. Chen, L. Li, L. Fei-fei, and A. Gupta, Iterative visual reasoning beyond convolutions, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol.1, 2018.

Z. Chen, Z. Yanpeng, H. Shuaiyi, T. Kewei, and M. Yi, Structured attentions for visual question answering, IEEE International Conference on Computer Vision (ICCV, 2002.

A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell et al., Multimodal compact bilinear pooling for visual question answering and visual grounding, EMNLP. The Association for Computational Linguistics, 2007.

Y. Goyal, T. Khot, D. Summers-stay, D. Batra, and D. Parikh, Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering, IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2008.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2016.

R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, Learning to reason: End-to-end module networks for visual question answering, Proceedings of the IEEE International Conference on Computer Vision (ICCV, 2002.

D. A. Hudson and C. D. Manning, Compositional attention networks for machine reasoning, International Conference on Learning Representations, vol.2, p.3, 2018.

J. Johnson, B. Hariharan, L. Van-der-maaten, L. Feifei, C. L. Zitnick et al., CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning, IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2005.

J. Johnson, B. Hariharan, L. Van-der-maaten, J. Hoffman, L. Fei-fei et al., Inferring and executing programs for visual reasoning, ICCV, vol.1, 2017.

K. Kafle and C. Kanan, An analysis of visual question answering algorithms, The IEEE International Conference on Computer Vision (ICCV), 2007.

J. Kim, J. Jun, and B. Zhang, , vol.2, p.7, 2018.

J. Kim, K. W. On, W. Lim, J. Kim, J. Ha et al., Hadamard Product for Low-rank Bilinear Pooling, The 5th International Conference on Learning Representations, 2007.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, ICLR, 2015.

Y. Zhang, J. Hare, and A. Prgel-bennett, Learning to count objects in natural images for visual question answering, International Conference on Learning Representations, vol.2, p.7, 2018.