A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems, vol.25, pp.1097-1105, 2012.

T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, 2013.

R. Kiros, Y. Zhu, R. Ruslan, R. Salakhutdinov, R. Zemel et al., Skip-thought vectors, Advances in neural information processing systems, pp.3294-3302, 2015.

A. Karpathy and L. Fei-fei, Deep visual-semantic alignments for generating image descriptions, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.3128-3137, 2015.

C. Lu, R. Krishna, M. Bernstein, and L. Fei-fei, Visual relationship detection with language priors, European Conference on Computer Vision, pp.852-869, 2016.

A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav et al., Visual Dialog, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 2017.

F. Harm-de-vries, S. Strub, O. Chandar, H. Pietquin, A. C. Larochelle et al., GuessWhat?! Visual object discovery through multi-modal dialogue, Conference on Computer Vision and Pattern Recognition (CVPR, 2017.

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra et al., VQA: Visual Question Answering, International Conference on Computer Vision (ICCV), 2015.

Y. Goyal, T. Khot, D. Summers-stay, D. Batra, and D. Parikh, Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering, IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2017.

A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, Don't just assume; look and answer: Overcoming priors for visual question answering, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

K. Kafle and C. Kanan, An analysis of visual question answering algorithms, The IEEE International Conference on Computer Vision (ICCV), 2017.

D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin et al., Vizwiz grand challenge: Answering visual questions from blind people, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.3608-3617, 2018.

A. Drew, C. Hudson, and . Manning, Gqa: a new dataset for compositional question answering over real-world images, 2019.

R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, From recognition to cognition: Visual commonsense reasoning, 2019.

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson et al., Bottom-up and top-down attention for image captioning and visual question answering, IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2018.

R. Cadene, H. Ben-younes, N. Thome, and M. Cord, Murel: Multimodal Relational Reasoning for Visual Question Answering, IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02073649

H. Ben-younes, R. Cadene, N. Thome, and M. Cord, Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection, Proceedings of the 33st Conference on Artificial Intelligence (AAAI), 2019.
URL : https://hal.archives-ouvertes.fr/hal-02073644

R. Hu, J. Andreas, T. Darrell, and K. Saenko, Explainable neural computation via stack neural module networks, ECCV, 2018.

J. Kim, J. Jun, and B. Zhang, Bilinear attention networks, Advances in Neural Information Processing Systems, pp.1564-1574, 2018.

J. Li-jiaxin, H. Shi, and . Zhang, Explainable and explicit visual reasoning over scene graphs, CVPR, 2019.

C. Wu, J. Liu, X. Wang, and X. Dong, Chain of Reasoning for Visual Question Answering, Advances in Neural Information Processing Systems, vol.31, pp.275-285, 2018.

G. Peng, Z. Jiang, H. You, Z. Jiang, P. Lu et al., Dynamic Fusion with Intra-and Inter-Modality Attention Flow for Visual Question Answering, CVPR, 2019.

J. Johnson, B. Hariharan, L. Van-der-maaten, L. Fei-fei, C. L. Zitnick et al., CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning, IEEE Conference on Computer Vision and Pattern Recognition CVPR, 2017.

A. Agrawal, D. Batra, and D. Parikh, Analyzing the behavior of visual question answering models, EMNLP, 2016.

A. Sainandan-ramakrishnan, S. Agrawal, and . Lee, Overcoming language priors in visual question answering with adversarial regularization, Advances in Neural Information Processing Systems, pp.1541-1551, 2018.

Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, Stacked attention networks for image question answering, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.21-29, 2016.

J. Gordon and B. Van-durme, Reporting bias and knowledge acquisition, Proceedings of the 2013 workshop on Automated knowledge base construction, pp.25-30, 2013.

W. Chao, H. Hu, and F. Sha, Being negative but constructively: Lessons learnt from creating better visual question answering datasets, 2018.

A. Torralba and A. A. Efros, Unbiased look at dataset bias. CVPR, 2011.

P. Stock and M. Cisse, Convnets and imagenet beyond accuracy: Understanding mistakes and uncovering biases, The European Conference on Computer Vision (ECCV), 2018.

S. Jia, T. Lansdall-welfare, and N. Cristianini, Right for the Right Reason: Training Agnostic Networks, Lecture Notes in Computer Science, pp.164-174, 2018.

V. Manjunatha, N. Saini, and L. S. Davis, Explicit Bias Discovery in Visual Question Answering Models, CVPR, 2019.

L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach, Women also snowboard: Overcoming bias in captioning models, In ECCV, 2018.

J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K. Chang, Men also like shopping: Reducing gender bias amplification using corpus-level constraints, Conference on Empirical Methods in Natural Language Processing, 2017.

I. Misra, L. Zitnick, M. Mitchell, and R. Girshick, Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.2930-2939, 2016.

A. Gupta, A. Murali, P. Dhiraj, L. Gandhi, and . Pinto, Robot learning in homes: Improving generalization and reducing dataset bias, Advances in Neural Information Processing Systems, pp.9094-9104, 2018.

A. Anand, E. Belilovsky, K. Kastner, H. Larochelle, and A. Courville, Blindfold baselines for embodied qa, 2018.

J. Thomason, D. Gordon, and Y. Bisk, Shifting the baseline: Single modality performance on visual navigation & qa, NACL, 2019.

A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko, Object hallucination in image captioning, In EMNLP, 2018.

A. Jabri, A. Joulin, and L. Van-der-maaten, Revisiting visual question answering baselines, European conference on computer vision, pp.727-739, 2016.

J. Lu, J. Yang, D. Batra, and D. Parikh, Hierarchical question-image co-attention for visual question answering, Advances In Neural Information Processing Systems, pp.289-297, 2016.

H. Ben-younes, R. Cadène, N. Thome, and M. Cord, Mutan: Multimodal tucker fusion for visual question answering, 2017.
URL : https://hal.archives-ouvertes.fr/hal-02073637

Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao, Beyond bilinear: Generalized multi-modal factorized high-order pooling for visual question answering, IEEE Transactions on Neural Networks and Learning Systems, 2018.

A. Das, C. L. Harsh-agrawal, D. Zitnick, D. Parikh, and . Batra, Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?, Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.

R. Shrestha, K. Kafle, and C. Kanan, Answer them all! toward universal visual question answering models, 2019.