A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, Don't just assume; look and answer: Overcoming priors for visual question answering, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.

A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, Dont just assume; look and answer: Overcoming priors for visual question answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.4971-4980, 2018.

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson et al., Bottom-up and top-down attention for image captioning and visual question answering, IEEE Conference on Computer Vision and Pattern Recognition CVPR, vol.6, p.7, 2005.

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra et al., VQA: Visual Question Answering, International Conference on Computer Vision (ICCV), vol.1, 2015.

D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate

Y. Bai, J. Fu, T. Zhao, and T. Mei, Deep attention neural tensor network for visual question answering, The European Conference on Computer Vision (ECCV), 2018.

P. W. Battaglia, Relational inductive biases, deep learning, and graph networks, vol.2, p.4, 2018.

H. Ben-younes, R. Cadène, N. Thome, and M. Cord, Mutan: Multimodal tucker fusion for visual question answering, vol.6, p.7, 2005.
URL : https://hal.archives-ouvertes.fr/hal-02073637

X. Chen, L. Li, L. Fei-fei, and A. Gupta, Iterative visual reasoning beyond convolutions, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol.1, 2018.

Z. Chen, Z. Yanpeng, H. Shuaiyi, T. Kewei, and M. Yi, Structured attentions for visual question answering, IEEE International Conference on Computer Vision (ICCV, 2002.

A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell et al., Multimodal compact bilinear pooling for visual question answering and visual grounding, EMNLP. The Association for Computational Linguistics, 2007.

Y. Goyal, T. Khot, D. Summers-stay, D. Batra, and D. Parikh, Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering, IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2008.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2016.

R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, Learning to reason: End-to-end module networks for visual question answering, Proceedings of the IEEE International Conference on Computer Vision (ICCV, 2002.

D. A. Hudson and C. D. Manning, Compositional attention networks for machine reasoning, International Conference on Learning Representations, vol.2, p.3, 2018.

J. Johnson, B. Hariharan, L. Van-der-maaten, L. Feifei, C. L. Zitnick et al., CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning, IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2005.

J. Johnson, B. Hariharan, L. Van-der-maaten, J. Hoffman, L. Fei-fei et al., Inferring and executing programs for visual reasoning, ICCV, vol.1, 2017.

K. Kafle and C. Kanan, An analysis of visual question answering algorithms, The IEEE International Conference on Computer Vision (ICCV), 2007.

J. Kim, J. Jun, and B. Zhang, , vol.2, p.7, 2018.

J. Kim, K. W. On, W. Lim, J. Kim, J. Ha et al., Hadamard Product for Low-rank Bilinear Pooling, The 5th International Conference on Learning Representations, 2007.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, ICLR, 2015.

R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba et al., Skip-thought vectors, Proceedings of the 28th International Conference on Neural Information Processing Systems, vol.2, pp.3294-3302, 2015.

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata et al.,

. Shamma, Visual genome: Connecting language and vision using crowdsourced dense image annotations, International Journal of Computer Vision, vol.123, issue.1, pp.32-73, 2017.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems, vol.25, pp.1097-1105, 2012.

D. Lopez-paz, R. Nishihara, S. Chintala, B. Schölkopf, and L. Bottou, Discovering causal signals in images, IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2017.

C. Lu, R. Krishna, M. Bernstein, and L. Fei-fei, Visual relationship detection with language priors, European Conference on Computer Vision, 2016.

M. Malinowski, C. Doersch, A. Santoro, and P. Battaglia, Learning visual question answering by bootstrapping hard attention, The European Conference on Computer Vision (ECCV), 2008.

M. Malinowski and M. Fritz, A multi-world approach to question answering about real-world scenes based on uncertain input, Advances in Neural Information Processing Systems, vol.27, pp.1682-1690, 2014.

D. Mascharka, P. Tran, R. Soklaski, and A. Majumdar, Transparency by design: Closing the gap between performance and interpretability in visual reasoning, IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2002.

H. Noh and B. Han, Training recurrent answering units with joint loss minimization for vqa, 2016.

W. Norcliffe-brown, E. Vafeias, and S. Parisot, Learning conditioned graph structures for interpretable visual question answering, vol.6, p.7, 2003.

E. Perez, F. Strub, H. Vries, V. Dumoulin, and A. C. Courville, Film: Visual reasoning with a general conditioning layer, AAAI, vol.2, p.5, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01648685

A. Santoro, F. Hill, D. G. Barrett, A. S. Morcos, and T. P. Lillicrap, Measuring abstract reasoning in neural networks, JMLR Workshop and Conference Proceedings, vol.80, pp.4477-4486, 2018.

A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu et al., A simple neural network module for relational reasoning, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, pp.4974-4983, 2017.

Y. Shi, T. Furlanello, S. Zha, and A. Anandkumar, Question type guided attention in visual question answering, The European Conference on Computer Vision (ECCV), 2018.

K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville et al., Show, attend and tell: Neural image caption generation with visual attention, Proceedings of the 32Nd International Conference on International Conference on Machine Learning, vol.37, pp.2048-2057, 2015.

Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola, Stacked attention networks for image question answering, IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2016.

Z. Yu, J. Yu, J. Fan, and D. Tao, Multi-modal factorized bilinear pooling with co-attention learning for visual question answering, IEEE International Conference on Computer Vision (ICCV), vol.2, p.3, 2017.

Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao, Beyond bilinear: Generalized multi-modal factorized high-order pooling for visual question answering, IEEE Transactions on Neural Networks and Learning Systems, vol.2, p.3, 2018.

Y. Jiang, *. Vivek-natarajan, *. , X. Chen, *. et al., Pythia v0.1: the winning entry to the vqa challenge, p.7, 2005.

Y. Zhang, J. Hare, and A. Prgel-bennett, Learning to count objects in natural images for visual question answering, International Conference on Learning Representations, vol.2, p.7, 2018.